Skip to content

Latest commit

 

History

History
182 lines (126 loc) · 8.63 KB

README.adoc

File metadata and controls

182 lines (126 loc) · 8.63 KB

Hive SerDe

This project includes tools to build a Hive SerDe to index Hive tables to Solr.

Features

  • Index Hive table data to Solr.

  • Read Solr index data to a Hive table.

  • Kerberos support for securing communication between Hive and Solr.

Build the SerDe Jars

This project has a dependency on the solr-hadoop-common submodule (also a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the SerDe .jar.

To initialize the submodule, pull this repo, then:

git submodule init

Once the submodule has been initialized, the command git submodule update will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the SerDe .jar.

The build uses Gradle; however, you do not need to have Gradle already installed before attempting to build.

To build the .jar files, run this command:

./gradlew clean shadowJar --info

This will make two .jar files, one for use with Hive v0.13 only, and another that can be used with Hive 0.12, 0.14, 0.15.

  • solr-hive-serde/build/libs/solr-hive-serde-2.2.2.jar

  • solr-hive-serde-0.13/build/libs/solr-hive-serde-0.13-2.2.2.jar

Using the SerDe

Install the SerDe Jar to Hive

In order for Hive to work with Solr, the Hive SerDe jar must be added as a plugin to Hive.

From a Hive prompt, use the ADD JAR command and reference the path and filename of the SerDe jar for your Hive version.

   hive> ADD JAR solr-hive-serde-2.2.2.jar;

This can also be done in your Hive command to create the table, as in the example below.

Create External Tables

An external table in Hive allows the data in the table to be used (read or write) by another system or application outside of Hive. For integration with Solr, the external table allows you to have Solr read from and write to Hive.

To create an external table for Solr, you can use a command similar to below. The properties available are described after the example.

hive> CREATE EXTERNAL TABLE solr (id string, field1 string, field2 int)
      STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
      LOCATION '/tmp/solr'
      TBLPROPERTIES('solr.server.url' = 'http://localhost:8888/solr',
                    'solr.collection' = 'collection1',
                    'solr.query' = '*:*');

In this example, we have created an external table named "solr", and defined a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler').

The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/solr.

In the section TBLPROPERTIES, we define several properties for Solr so the data can be indexed to the right Solr installation and collection:

solr.zkhost

The location of the ZooKeeper quorum if using LucidWorks in SolrCloud mode. If this property is set along with the solr.server.url property, the solr.server.url property will take precedence.

solr.server.url

The location of the Solr instance if not using LucidWorks in SolrCloud mode. If this property is set along with the solr.zkhost property, this property will take precedence.

solr.collection

The Solr collection for this table. If not defined, an exception will be thrown.

solr.query

The specific Solr query to execute to read this table. If not defined, a default of *:* will be used. This property is not needed when loading data to a table, but is needed when defining the table so Hive can later read the table.

lww.commit.on.close

If true, inserts will be automatically committed when the connection is closed. True is the default.

lww.jaas.file

Used only when indexing to or reading from a Solr cluster secured with Kerberos.

This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Solr and Hive.

The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:

Client { --(1)
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/solr-indexer.keytab" --(2)
  storeKey=true
  useTicketCache=true
  debug=true
  principal="[email protected]"; --(3)
};
  1. The name of this section of the JAAS file. This name will be used with the lww.jaas.appname parameter.

  2. The location of the keytab file.

  3. The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Hive.

lww.jaas.appname

Used only when indexing to or reading from a Solr cluster secured with Kerberos.

This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.

If the table needs to be dropped at a later time, you can use the DROP TABLE command in Hive. This will remove the metadata stored in the table in Hive, but will not modify the underlying data (in this case, the Solr index).

Query and Insert Content

Once the table is configured, any syntactically correct Hive query will be able to query the Solr index.

For example, to select three fields named "id", "field1", and "field2" from the "solr" table, you would use a query such as:

hive> SELECT id, field1, field2 FROM solr;

To join data from tables, you can make a request such as:

hive> SELECT id, field1, field2 FROM solr left
      JOIN sometable right
      WHERE left.id = right.id;

And finally, to insert data to a table, simply use the Solr table as the target for the Hive INSERT statement, such as:

hive> INSERT INTO solr
      SELECT id, field1, field2 FROM sometable;

Example Hive Table

Solr includes a small number of sample documents for use when getting started. One of these is a CSV file containing book metadata. This file is found in your Solr installation, at $SOLR_HOME/example/exampledocs/books.csv.

Using the sample books.csv file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.

CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; --(1)

LOAD DATA LOCAL INPATH '/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books; --(2)

ADD JAR solr-hive-serde-2.2.2.jar; --(3)

CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) --(4)
     STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' --(5)
     LOCATION '/tmp/solr' --(6)
     TBLPROPERTIES('solr.zkhost' = 'zknode1:2181,zknode2:2181,zknode3:2181/solr',
                   'solr.collection' = 'gettingstarted',
                   'solr.query' = '*:*'),
                   'lww.jaas.file' = '/data/jaas-client.conf'; --(7)


INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
  1. Define the table books, and provide the field names and field types that will make up the table.

  2. Load the data from the books.csv file.

  3. Add the solr-hive-serde-2.2.2.jar file to Hive. Note the jar name shown here omits the version information which will be included in the jar file you have. If you are using Hive 0.13, you must also use a jar specifically built for 0.13.

  4. Create an external table named solr, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr.

  5. Define the custom storage handler provided by the solr-hive-serde-2.2.2.jar.

  6. Define storage location in HDFS.

  7. Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.