Skip to content

Latest commit

 

History

History
290 lines (203 loc) · 13.7 KB

README.adoc

File metadata and controls

290 lines (203 loc) · 13.7 KB

HBase Indexer

The HBase Indexer allows you to easily and quickly index HBase rows into Solr. It uses HBase replication to copy rows from HBase to the Solr index in real-time.

Features of the Lucidworks HBase Indexer

This project is a fork of the original hbase-indexer project from NGDATA, with these additional features:

  • Support for HBase 0.94, 0.96, 0.98, 1.1.0 and 1.1.2

  • Support for Solr 5.x (Solr 6.x to come soon)

  • Kerberos authentication

  • Output to Lucidworks Fusion pipelines (optional)

Important
Due to the changes required for Solr 5.x support, builds from this fork of the main project will not work with Solr 4.x.

Subprojects

HBase SEP

A standalone library for asynchronously processing HBase mutation events by hooking into HBase replication, see the SEP readme.

HBase SEP & Replication Monitoring

A standalone utility to monitor HBase replication progress, see the SEP-tools readme.

How to Build

Building for General Use

You can build the full hbase-indexer project as follows:

mvn clean package -DskipTests

The default build is linked to HBase 0.94. In order to build for another version, you should specify the version with the -Dhbase.api parameter. For example, to build for HBase 0.98, run the following command:

mvn clean package -DskipTests -Dhbase.api=0.98

To build for HBase 1.1.0, run the following command:

mvn clean package -DskipTests -Dhbase.api=1.1.0

How the HBase Indexer Works

The HBase Indexer works as an HBase replication sink and runs as a daemon. When updates are written to HBase region servers, they are replicated to the HBase Indexer processes. These processes then pass the updates to Solr for indexing.

To use the HBase Indexer, there are several configuration files to prepare before using the HBase Indexer. Then a specific configuration for your your HBase table must be supplied to map HBase columns and rows to Solr fields and documents.

If you are already using HBase replication, the HBase Indexer process should not interfere.

HBase Indexer Daemon Setup

Each of the following configuration changes need to be made to use the HBase Indexer.

Note
These instructions assume Solr is running and a collection already exists for the data to be indexed. Solr does not need to be restarted to start using the HBase Indexer.

Configure hbase-indexer-site.xml

The first step to configuring the HBase Indexer is to define the ZooKeeper connect string and quorum location in hbase-indexer-site.xml, found in /opt/${namePackage}/hbase-indexer/conf.

The properties to define are the hbaseindexer.zookeeper.connectstring and hbaseindexer.zookeeper.quorum as in this example:

<configuration>
   <property>
      <name>hbaseindexer.zookeeper.connectstring</name>
      <value>server1:2181,server2:2181,server3:2181</value>
   </property>
   <property>
      <name>hbase.zookeeper.quorum</name>
      <value>server1,server2,server3</value>
   </property>
</configuration>

Customize the server names and ports as appropriate for your environment.

Configure hbase-site.xml

The hbase-site.xml file on each HBase node needs to be modified to include several new properties, as shown below:

<configuration>
   <!-- SEP is basically replication, so enable it -->
   <property>
      <name>hbase.replication</name>
      <value>true</value>
   </property>
   <property>
      <name>replication.source.ratio</name>
      <value>1.0</value>
   </property>
   <property>
      <name>replication.source.nb.capacity</name>
      <value>1000</value>
   </property>
   <property>
      <name>replication.replicationsource.implementation</name>
      <value>com.ngdata.sep.impl.SepReplicationSource</value>
   </property>
</configuration>

In more detail, these properties are:

  • hbase.replication: True or false; enables replication.

  • replication.source.ratio: Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters).

  • replication.source.ratio: The ratio of the number of slave cluster regionservers the master cluster regionserver will connect to for updates. This should be set to 1.0 to get updates from all slaves.

  • replication.source.nb.capacity: The maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.

  • replication.replicationsource.implementation: A custom replication source for indexing content to Solr.

Once these values have been inserted to hbase-site.xml, HBase will need to be restarted. That is an upcoming step in the configuration.

Tip
If you use Ambari to manage your cluster, you can set these properties by going to the HBase Configs screen at Ambari → Configs → Advanced tab → Custom hbase-site.

Copy hbase-site.xml to HBase Indexer

Once you have added the new properties to hbase-site.xml, copy it to the hbase-indexer conf directory. In many cases, this is from /etc/hbase/conf to hbase-indexer/conf.

Copying this file ensures all of the parameters configured for HBase (such as settings for Kerberos and ZooKeeper) are available to the HBase Indexer.

Copy JAR Files

The SEP replication being used by HBase Indexer requires 4 .jar files to be copied from the HBase Indexer distribution to each HBase node.

These .jar files can be found in the hbase-indexer/lib directory.

They need to be copied to the $HBASE_HOME/lib directory on each node running HBase. These files are:

  • hbase-sep-api-2.2.2.jar

  • hbase-sep-impl-2.2.2.jar

  • hbase-sep-impl-common-2.2.2.jar

  • hbase-sep-tools-2.2.2.jar

Enable Kerberos Support

If you want to index content to a Solr cluster that has been secured with Kerberos for internode communication, you will need to apply additional configuration.

A JAAS file configures the authentication properties, and will include a section for a service principal and keytab file for a user who has access to both HBase and Solr. This user should be a different user than the service principal that Solr is using for internode communication.

Kerberos Parameters

To configure HBase Indexer to be able to write to Kerberized Solr, you will modify the hbase-indexer script found in /opt/${namePackage}/hbase-indexer/bin.

Find the section where two of the properties are commented out by default:

#HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.file="
#HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.appname="

Uncomment those properties to supply the correct values (explained below), and then add another property:

HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Djava.security.auth.login.config="

The three together should look similar to this:

HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.file="
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Djava.security.auth.login.config="
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.appname="

Remove the # to uncomment existing lines, and supply values for each property:

-Dlww.jaas.file

The full path to a JAAS configuration file that includes a section to define the keytab location and service principal that will be used to run the HBase Indexer. This user must have access to both HBase and Solr, but should be a different user.

-Djava.security.auth.login.config

The path to the JAAS file which includes a section for the HBase user. This can be the same file that contains the definitions for the HBase Indexer user, but must be defined separately.

-Dlww.jaas.appname

The name of the section in the JAAS file that includes the service principal and keytab location for the user who will run the HBase Indexer, as shown below. If this is not defined, a default of "Client" will be used.

Sample JAAS File

Here is a sample JAAS file, and the areas that must be changed for your environment:

Client { --(1)
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/hbase.keytab" --(2)
  storeKey=true
  useTicketCache=false
  debug=true
  principal="[email protected]"; --(3)
};
SolrClient { --(4)
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/solr-indexer.keytab" --(5)
  storeKey=true
  useTicketCache=false
  debug=true
  principal="[email protected]"; --(6)
};
  1. The name of the section of the JAAS file for the HBase user. This first section named "Client" should contain the proper credentials for HBase and ZooKeeper.

  2. The path to the keyTab file for the HBase user. The user running the HBase Indexer must have access to this file.

  3. The service principal name for the HBase user.

  4. The name of the section for the user who will run the HBase Indexer. This second section is named "SolrClient" and should contain the proper credentials for the user who will run the HBase Indexer. This section name will be used with the -Dlww.jaas.appname parameter as described earlier.

  5. The path to the keyTab file for the Solr user. The user running the HBase Indexer must have access to this file.

  6. The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and HBase.

Restart HBase

Once each of the above changes have been made, restart HBase on all nodes.

Start the HBase Indexer Daemon

When configuration is complete and HBase has been restarted, you can start the HBase Indexer daemon.

From hbase-indexer/bin, run:

hbase-indexer server &

The & portion of this command will run the server in the background. If you would like to run it in the foreground, omit the & part of the above example.

At this point the HBase Indexer daemon is running, you are ready to configure an indexer to start indexing content.

Stream Data from HBase Indexer to Solr

Once the HBase configuration files have been updated, HBase has been restarted on all nodes, and the daemon started, the next step is to create an indexer to stream data from a specific HBase table to Solr.

Important
The HBase table that will be indexed must have the REPLICATION_SCOPE set to "1".

If you are not familiar with HBase, the HBase Indexer tutorial provides a good introduction to creating a simple table and indexing it to Solr.

Add an Indexer

In order to process HBase events, an indexer must be created.

First you need to create an indexer configuration file, which is a simple XML file to tell the HBase Indexer how to map HBase columns to Solr fields, For example:

<?xml version="1.0"?>
<indexer table="indexdemo-user">
   <field name="firstname_s" value="info:firstname"/>
   <field name="lastname_s" value="info:lastname"/>
   <field name="age_i" value="info:age" type="int"/>
</indexer>

Note that this defines the Solr field name, then the HBase column, and optionally, the field type. The indexer-table value must also reflect the name of the table you intend to index.

More details on the indexer configuration options are available from https://github.com/NGDATA/hbase-indexer/wiki/Indexer-configuration.

Start an Indexer Process

Once we have an indexer configuration file, we can then start the indexer itself with the add-indexer command.

This command takes several properties, as in this example:

./hbase-indexer add-indexer \ -- (1)
  -n myindexer \ -- (2)
  -c indexer-conf.xml \ -- (3)
  -cp solr.zk=server1:3181,server2:3181,server3:3181/solr \ -- (4)
  -cp solr.collection=myCollection -- (5)
  -z server1:3181,server2:3181,server3:3181 -- (6)
  1. The hbase-indexer script is found in /opt/${namePackage}/hbase-indexer/bin directory. The add-indexer command adds the indexer to the running hbase-indexer daemon.

  2. The -n property provides a name for the indexer.

  3. The -c property defines the location of the indexer configuration file. Provide the path as well as the filename if you are launching the

  4. The -cp property allows you to provide a key-value pair. In this case, we use the solr.zk property to define the location of the ZooKeeper ensemble used with Solr. The ZooKeeper connect string should include /solr as the znode path.

  5. Another -cp key-value pair, which defines the solr.collection property with a value of a collection name in Solr that the documents should be indexed to. This collection must exist prior to running the indexer.

  6. The -z property defines the location of the ZooKeeper ensemble used with HBase. This may be the same as was defined in item (4) above, but needs to be additionally defined.

More details on the options for add-indexer are available from https://github.com/NGDATA/hbase-indexer/wiki/CLI-tools.