Solr is an open source search engine developed by the Apache Software Foundation. On its home page Solr advertises itself as
Solr is the popular, blazing-fast,
open source enterprise search platform built on Apache Lucene.
and the book Solr in Action describes Solr as
Solr is a scalable, ready-to-deploy enterprise search engine
that’s optimized to search large volumes of text-centric data
and return results sorted by relevance [p. 4]
The fact that Solr is a search engine means that there is a strong focus on speed, large volumes of text data, and the ability to sort the results by relevance.
Although Solr could technically be described as a NoSQL database (i.e. it allows us to store and retrieve data in a non-relational form) it is better to think of it as a search engine to emphasize the fact that it is better suited for text-centric and read-mostly environments [Solr in Action, p. 4].
The core functionality that Solr makes available is provided by a Java library called Lucene. Lucene is the brain behind the "indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities" that we will see in this tutorial.
But Lucene is a Java Library than can only be used from other Java programs. Solr on the other hand is a wrapper around Lucene that allows us to use the Lucene functionality from any programming language that can submit HTTP requests.
-------------------
| Java Runtime |
[client application] ----> HTTP request ----> | Solr --> Lucene |
-------------------
In this diagram the client application could be a program written in Ruby or Python. In fact, as we will see throughout this tutorial, it can also be a system utility like cURL or a web browser. Anything that can submit HTTP requests can communicate with Solr.
To install Solr we are going to use a tool called Docker that allows us to download small virtual machines (called containers) with pre-installed software. In our case we'll download a container with Solr 9.1.0 installed on it and use that during the workshop.
NOTE: You can also download and install the Solr binaries directly on your machine without using Docker. You'll need to have the Java Development Kit (JDK) for this to method to work. If you are interested in this approach take a look at these instructions instead.
For the Docker installation let's start by going to https://www.docker.com/, download the "Docker Desktop", install it, and run it.
Once installed run the following command from the terminal to make sure it's running:
$ docker ps
#
# You'll see something like this
# CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
If Docker is not running we'll see an error that will indicate something along the lines of
Error response from daemon: dial unix docker.raw.sock: connect: connection refused
If we see this error it could be that the Docker Desktop app has not fully started. Wait a few seconds and try again. We can also open the "Docker Desktop" app and see its status.
Once Docker has been installed and it's up and running we can create a container to host Solr 9.1.0 with the following command:
$ docker run -d -p 8983:8983 --name solr-container solr:9.1.0
#
# You'll see something like this...
#
# Unable to find image 'solr:9.1.0' locally
# 9.1.0: Pulling from library/solr
# 846c0b181fff: Pull complete
# ...
# fc8f2125142b: Pull complete
# Digest: sha256:971cd7a5c682390f8b1541ef74a8fd64d56c6a36e5c0849f6b48210a47b16fa2
# Status: Downloaded newer image for solr:9.1.0
# 47e8cd4d281db5a19e7bfc98ee02ca73e19af66e392e5d8d3532938af5a76e96
The parameter -d
in the previous command tells Docker to run the container in the background (i.e. detached) and the parameter -p 8983:8983
tells Docker to forwards calls to our local port 8983
to the port 8983
on the container.
We can check that the new container is running with the following command:
$ docker ps
#
# You'll see something like this...
#
# CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
# 47e8cd4d281d solr:9.1.0 "docker-entrypoint.s…" 2 minutes ago Up 2 minutes 0.0.0.0:8983->8983/tcp, :::8983->8983/tcp solr-container
Notice that now we have a container NAMED solr-container
using the IMAGE solr:9.1.0
. We can check the status of Solr with the following command:
$ docker exec -it solr-container solr status
# Found 1 Solr nodes:
#
# Solr process 15 running on port 8983
# {
# "solr_home":"/var/solr/data",
# "version":"9.1.0 aa4f3d98ab19c201e7f3c74cd14c99174148616d - ishan - 2022-11-11 13:00:47",
# "startTime":"2023-01-12T20:48:46.084Z",
# "uptime":"0 days, 0 hours, 9 minutes, 15 seconds",
# "memory":"178.3 MB (%34.8) of 512 MB"}
We can also see Solr running by pointing your browser to http://localhost:8983/solr/ which will show the Solr Admin web page. In this page we can see that we do not have any cores defined to store data, we'll fix that in the next section. WARNING: Do not attempt to create a Solr cores via "Add Core" button in the Solr Admin page -- that button only leads to pain.
Solr uses the concept of cores to represent independent environments in which we configure data schemas and store data. This is similar to the concept of a "database" in MySQL or PostgreSQL.
For our purposes, let's create a core named bibdata
with the following command:
$ docker exec -it solr-container solr create_core -c bibdata
#
# WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
# To turn off: bin/solr config -c bibdata -p 8983 -action set-user-property -property update.autoCreateFields -value false
#
# Created new core 'bibdata'
If we go back to http://localhost:8983/solr/ on our browser (we might need to refresh the page) we should see our newly created bibdata
core available in the "Core Selector" dropdown list.
Now that our core has been created we can query it with the following command:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
#
# {
# "responseHeader":{
# "status":0,
# "QTime":0,
# "params":{
# "q":"*"}},
# "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
# }}
and we'll see "numFound":0
indicating that there are no documents on it. We can also point our browser to http://localhost:8983/solr#bibdata/query and click the "Execute Query" button at the bottom of the page and see the same result.
Now let's add a few documents to our bibdata
core. First, download this sample data file:
$ curl -OL https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json
#
# You'll see something like this...
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 1998 100 1998 0 0 5561 0 --:--:-- --:--:-- --:--:-- 5581
#
File books.json
contains a small sample data a set with information about a few thousand books. We can take a look at it with something like head books.json
or using the text editor of our choice. Below is an example on one of the books in this file:
{
"id": "00008027",
"author_txt_en": "Patent, Dorothy Hinshaw.",
"authors_other_txts_en": [
"Muñoz, William,"
],
"title_txt_en": "Horses /",
"responsibility_txt_en": "by Dorothy Hinshaw Patent ; photographs by William Muñoz.",
"publisher_place_s": "Minneapolis, Minn. :",
"publisher_name_s": "Lerner Publications,",
"publisher_date_s": "c2001.",
"subjects_ss": [
"Horses",
"Horses"
],
"subjects_form_ss": [
"Juvenile literature"
]
},
To import this data to our Solr we'll first copy the file to the Docker container
$ docker cp books.json solr-container:/opt/solr-9.1.0/books.json
and then we load it to Solr:
$ docker exec -it solr-container post -c bibdata books.json
#
# /opt/java/openjdk/bin/java -classpath /opt/solr/server/solr-webapp/webapp/WEB-INF/lib/solr-core-9.1.0.jar ...
# SimplePostTool version 5.0.0
# Posting files to [base] url http://localhost:8983/solr/bibdata/update...
# POSTing file books.json (application/json) to [base]/json/docs
# 1 files indexed.
# COMMITting Solr index changes to http://localhost:8983/solr/bibdata/update...
# Time spent: 0:00:01.951
Now if we re-run our query we should see some results:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
#
# {
# "responseHeader":{
# "status":0,
# "QTime":0,
# "params":{
# "q":"*"}},
# "response":{"numFound":30424,"start":0,"numFoundExact":true,"docs":[
# {
# ...the information for the first 10 documents will be displayed here..
#
Notice how the number of documents found is greater than zero (e.g. "numFound":30424
)
Now that we have added a few documents to our bibdata
core we can query Solr for those documents. In a subsequent section we'll explore more advanced searching options and how our schema definition is key to enable different kind of searches, but for now we'll start with a few basic searches to get familiar with the way querying works in Solr.
If you look at the content of the books.json
file that we imported into our bibdata
core you'll notice that the documents have the following fields:
- id: string to identify each document (MARC 001)
- author_txt_en: string for the main author (MARC 100a)
- authors_other_txts_en: list of other authors (MARC 700a)
- title_txt_en: title of the book (MARC 245ab)
- publisher_name_s: publisher name (MARC 260b)
- subjects_ss: an array of subjects (MARC 650a)
The suffix added to each field (e.g. _txt_en
) is a hint for Solr to pick the appropriate field type for each field as it ingests the data. We will look closely into this in a later section.
To fetch data from Solr we make an HTTP request to the select
handler. For example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
There are many parameters that we can pass to this handler to define what documents we want to fetch and what fields we want to fetch.
We can use the fl
parameter to indicate what fields we want to fetch. For example to request the id
and the title_txt_en
of the documents we would use fl=id,title_txt_en
as in the following example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*&fl=id,title_txt_en'
Note: When issuing the commands via cURL (as in the previous example) make sure that the fields are separated by a comma without any spaces in between them. In other words make sure the URL says fl=id,title_txt_en
and not fl=id,
title_txt_en
. If the parameter includes spaces Solr will not return any results and give you a cryptic error message instead.
Try adding and removing some other fields to this list, for example, fl=id,title_txt_en,author_txt_en
or fl=id,title_txt_en,author_txt_en,subjects_ss
In the previous examples you might have seen an inconspicuous q=*
parameter in the URL. The q
(query) parameter tells Solr what documents to retrieve. This is somewhat similar to the WHERE
clause in a SQL SELECT query.
If we want to retrieve all the documents we can just pass q=*
. But if we want to filter the results we can use the following syntax: q=field:value
to filter documents where a specific field has a particular value. For example, to include only documents where the title_txt_en
has the word "teachers" we would use q=title_txt_en:teachers
:
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers'
We can request filter by many different fields, for example to request documents where the title_txt_en
includes the word "teachers" or the author_txt_en
includes the word "Alice" we would use q=title_txt_en:teachers author_txt_en:Alice
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers+author_txt_en:Alice'
As we saw in the previous example, by default, Solr searches for either of the terms. If we want to force that both conditions are matched we must explicitly use the AND
operator in the q
value as in q=title_txt_en:teachers AND author_txt_en:Alice
Notice that the AND
operator must be in uppercase.
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers+AND+author_txt_en:Alice'
Now let's try something else. Let's issue a search for books where the title says "art history" using q=title_txt_en:"art history"
(make sure the text "art history" is in quotes)
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"art+history"'
# the results will include 6 documents with titles like:
#
# "title":["... : strategies for middle and high school teachers /"]},
# "title":["... Sunday school teachers ... /"]},
# "title":["... solutions for middle and high school teachers /"]}]
#
Notice how all three results have the term "art history" somewhere on the title. Now let's issue a slightly different query using q=title_txt_en:"art history"~3
to indicate that we want the words "art" and "history" to be present in the title_txt_en
but they can be a few words apart (notice the ~3
):
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"art+history"~3'
The result for this query will include a few more books (notice that numFound
is now 10
instead of 6
) and some of the new tiles include
# "title_txt_en":"History of art /"},
# "title_txt_en":"American art : a cultural history /"},
# "title_txt_en":"The invention of art : a cultural history /"},
# "title_txt_en":"A history of art in Africa /"}]
these new books include the words "art" and "history" but they don't have to be exactly next to each other, as long as they are close to each other they are considered a match (the ~3
in our query asks for "edit distance of 3").
When searching multi-word keywords for a given field make sure the keywords are surrounded by quotes, for example make sure to use q=title_txt_en:"art history"
and not q=title_txt_en:art history
. The later will execute a search for "school" in the title_txt_en
field and "teachers" in the _text_
field.
You can validate this by running the query and passing the debug
flag and seeing the parsedquery
value. For example in the following command we surround both search terms in quotes:
$ curl -s 'http://localhost:8983/solr/bibdata/select?debug=all&q=title_txt_en:"art+history"' | grep parsedquery
#
# "parsedquery":"PhraseQuery(title_txt_en:\"art histori\")",
#
notice that the parsedQuery
shows that Solr is searching for, as we would expect, both words in the title_txt_en
field.
Now let's look at the parsedQuery
when we don't surround the search terms in quotes:
$ curl -s 'http://localhost:8983/solr/bibdata/select?debugQuery=on&q=title_txt_en:art+history' | grep parsedquery
#
# "parsedquery":"title_txt_en:art _text_:history",
#
notice that Solr searched for the word "art" in the title_txt_en
field but searched for the word "history" on the _text_
field. Certainly not what we were expecting. We'll elaborate in a later section on the significance of the _text_
field but for now make sure to surround in quotes the search terms when issuing multi word searches.
One last thing to notice is that Solr returns results paginated, by default it returns the first 10 documents that match the query. We'll see later on this tutorial how we can request a large page size (via the rows
parameter) or another page (via the start
parameter). But for now just notice that at the top of the results Solr always tells us the total number of results found:
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:education&fl=id,title_txt_en'
#
# response will include
# "response":{"numFound":340,"start":0,"docs":[
#
When we issue a search, Solr is able to return facet information about the data in our core. This is a built-in feature of Solr and easy to use, we just need to include the facet=on
and the facet.field
parameter with the name of the field that we want to facet the information on.
For example, to search for all documents with title "education" (q=title_txt_en:education
) and retrieve facets (facet=on
) based on the subjects (facet.field=subjects_ss
) we'll use a query like this:
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:education&facet=on&facet.field=subjects_ss'
# response will include something like this
#
# "facet_counts":{
# "facet_fields":{
# "subjects_ss":[
# "Education",58,
# "Educational change",16,
# "Multicultural education",15,
# "Education, Higher",14,
# "Education and state",13,
# #
To update a document in Solr we have two options. The most common option is to post the data for that document again to Solr and let Solr overwrite the old document with the new data. The key for this to work is to provide the same ID in the new data as the ID of an existing document.
For example, if we query the document with ID 00007345
we would get:
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007345'
# "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
# {
# "id":"00007345",
# "authors_other_txts_en":["Giannakis, Georgios B."],
# "title_txt_en":"Signal processing advances in wireless and mobile communications /",
# "responsibility_txt_en":"edited by G.B. Giannakis ... [et al.].",
# "publisher_place_s":"Upper Saddle River, NJ :",
# "publisher_name_s":"Prentice Hall PTR,",
# "publisher_date_s":"c2001.",
# "subjects_ss":["Signal processing", "Wireless communication systems"],
# "_version_":1755414312334131200
# }
#
If we post to Solr a new document with the same ID Solr will overwrite the existing document with the new data. Below is an example of how to update this document with new JSON data using curl
to post the data to Solr. Notice that the command is issued against the update
endpoint rather than the select
endpoint we used in our previous commands.
$ curl -X POST --data '[{"id":"00007345","title_txt_en":"the new title"}]' 'http://localhost:8983/solr/bibdata/update?commit=true'
Out of the box Solr supports multiple input formats (JSON, XML, CSV), section Uploading Data with Index Handlers in the Solr guide provides more details out this.
If we query for the document with ID 00007345
again we will see the new data and notice that the fields that we did not provide during the update are now gone from the document, that's because Solr overwrote the old document with ID 00000034
with our new data that included only two fields (id
and title_txt_en
).
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007345'
# "response":{"numFound":1,"start":0,"docs":[
# {
# "id":"00007345",
# "title_txt_en":"the new title",
# }]}
#
The second option to update a document in Solr is to via atomic updates in which we can indicate what fields of the document will be updated. Details of this method are out of scope for this tutorial but below is a very simple example to show the basic syntax, notice how we are using the set
parameter in the title_txt_en
field to indicate a different kind of update:
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007450'
#
# "title_txt_en":"Principles of fluid mechanics /",
#
$ curl -X POST --data '[{"id":"00007450","title_txt_en":{"set":"the new title for 00007450"}}]' 'http://localhost:8983/solr/bibdata/update?commit=true'
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007450'
#
# title will say "the new title for 00007450"
# and the rest of the fields will remain unchanged
#
...
To delete documents from the bibdata
core we also use the update
endpoint but the structure of the command is as follows:
$ curl 'http://localhost:8983/solr/bibdata/update?commit=true' --data '<delete><query>id:00008056</query></delete>'
The body of the request (--data
) indicates to Solr that we want to delete a specific document (notice the id:00008056
query).
We can also pass a less specific query like title_txt_en:teachers
to delete all documents where the title includes the word "teachers" (or a variation of it). Or we can delete all documents with a query like *:*
.
Be aware that even if you delete all documents from a Solr core the schema and the core's configuration will remain intact. For example, the fields that were defined are still available in the schema even if no documents exist in the core anymore.
If you want to delete the entire core (documents, schema, and other configuration associated with it) you can use the Solr delete command instead:
$ docker exec -it solr-container solr delete -c bibdata
# Deleting core 'bibdata' using command:
# http://localhost:8983/solr/admin/cores?action=UNLOAD&core=bibdata&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true
You will need to re-create the core if you want to re-import data to it.
Solr uses a document model to represent data. Documents are Solr's basic unit of information and they can contain different fields depending on what information they represent. For example a book in a library catalog stored as a document in Solr might contain fields for author, title, and subjects, whereas information about a house in a real estate system using Solr might include fields for address, taxes, price, and number of rooms.
In earlier versions of Solr documents were self-contained and did not support nested documents. Starting with version 8 Solr provides support for nested documents. This tutorial does not cover nested documents.
Search engines like Solr use a data structure called inverted index to support fast retrieval of documents even with complex query expression on large datasets. The basic idea of an inverted index is to use the terms inside a document as the key of the index rather than the document's ID as the key.
Let's illustrate this with an example. Suppose we have three books that we want to index. With a traditional index we would create something like this:
ID TITLE
-- ------------------------------
1 Princeton guide for dog owners
2 Princeton tour guide
3 Cats and dogs
With an inverted index Solr would take each of the words in the title of our books and use those words as the index key:
KEY DOCUMENT ID
--------- -----------
princeton 1, 2
owners 1
dogs 1, 3
guide 1, 2
tour 2
cats 3
Notice that the inverted index allow us to do searches for individual words within the title. For example a search for the word "guide" immediately tell us that documents 1 and 2 are a match. Likewise a search for "tour" will tells that document 2 is a match.
Chapter 3 in Solr in Action has a more comprehensive explanation of how Solr uses inverted indexes to allow for partial matches as well as to aid with the ranking of the results.
The schema in Solr is the definition of the field types and fields configured for a given core.
Field Types are the building blocks to define fields in our schema. Examples of field types are: binary
, boolean
, pfloat
, string
, text_general
, and text_en
. These are similar to the field types that are supported in a relational database like MySQL but, as we will see later, they are far more configurable than what you can do in a relational database.
There are three kind of fields that can be defined in a Solr schema:
-
Fields are the specific fields that you define for your particular core. Fields are based of a field type, for example, we might define field
title
based on thestring
field type,description
based on thetext
field type, andprice
base of thepfloat
field type. -
dynamicFields are field patterns that we define to automatically create new fields when the data submitted to Solr matches the given pattern. For example, we can define that if we receive data for a field that ends with
_txt
the field will be create it as atext_general
field type. -
copyFields are instructions to tell Solr how to automatically copy the value given for one field to another field. This is useful if we want to perform different transformation to the values as we ingest them. For example, we might want to remove punctuation characters for searching but preserve them for display purposes.
Our newly created bibdata
core already has a schema and you can view the definition through the Solr Admin web page via the Schema Browser Screen at http://localhost:8983/solr/#/bibdata/schema or by exploring the managed-schema
file via the Files Screen.
You can also view this information with the Schema API as shown in the following example. The (rather long) response will be organized in four categories: fieldTypes
, fields
, dynamicFields
, and copyFields
as shown below:
$ curl localhost:8983/solr/bibdata/schema
# {
# "responseHeader": {"status": 0, "QTime": 2},
# "schema": {
# "fieldTypes": [lots of field types defined],
#
# "fields": [lots of fields defined],
#
# "dynamicFields":[lots of dynamic fields defined],
#
# "copyFields": [a few copy fields defined]
# }
# }
#
The advantage of the Schema API is that it allows you to view and update the information programatically which is useful if you need to recreate identical Solr cores without manually configuring each field definition (e.g. development vs production)
You can request information about each of these categories individually in the Schema API with the following commands (notice that combined words like fieldTypes
and dynamicFields
are not capitalized in the URLs below):
$ curl localhost:8983/solr/bibdata/schema/fieldtypes
$ curl localhost:8983/solr/bibdata/schema/fields
$ curl localhost:8983/solr/bibdata/schema/dynamicfields
$ curl localhost:8983/solr/bibdata/schema/copyfields
Notice that unlike a relational database, where only a handful field types are available to choose from (e.g. integer, date, boolean, char, and varchar) in Solr there are lots of predefined field types available out of the box, and each of them with its own configuration.
Note for Solr 4.x users: In Solr 4 the default mechanism to update the schema was by editing the file schema.xml
. Starting in Solr 5 the default mechanism is through the "Managed Schema Definition" which uses the Schema API to add, edit, and remove fields. There is now a managed-schema
file with the same information as schema.xml
but you are not supposed to edit this file. See section "Managed Schema Definition in SolrConfig" in the Solr Reference Guide 5.0 (PDF) for more information about this.
You might be wondering how did the fields like id
, title_txt_en
, author_txt_en
, and subjects_ss
in our bibdata
core were created if we never explicitly defined them.
Solr automatically created most of these fields when we imported the data from the books.json
file. If you look at a few of the elements in the books.json
file you'll recognize that they match most of the fields defined in our schema. Below is the data for one of the records in our sample data:
{
"id":"00007345",
"authors_other_txts_en":["Giannakis, Georgios B."],
"title_txt_en":"Signal processing advances in wireless and mobile communications /",
"responsibility_txt_en":"edited by G.B. Giannakis ... [et al.].",
"publisher_place_s":"Upper Saddle River, NJ :",
"publisher_name_s":"Prentice Hall PTR,",
"publisher_date_s":"c2001.",
"subjects_ss":["Signal processing", "Wireless communication systems"],
}
The process that Solr follows when a new document is ingested into Solr is more or less as follows:
-
If there is an exact match for a field being ingested and the fields defined in the schema then Solr will use the definition in the schema to ingest the data. This is what happened for the
id
field. Our JSON data has anid
field and so does the schema, therefore Solr stored theid
value in theid
field as indicated in the schema (i.e. as single-value string.) -
If there is no exact match in the schema then Solr will look at the dynamicFields definitions to see if the field can be handled with some predefined settings. This is what happened with the
title_txt_en
field. Because there is nottitle_txt_en
definition in the schema Solr used the dynamic field definition for*_txt_en
that indicated that the value should be indexed using the text in English (text_en
) field definition. -
If no match is found in the dynamic fields either Solr will guess what is the best type to use based on the data for this field in the first document. This is what happened with the
authors_other_txts_en
field (notice that this field ends with_txts_en
rather than_txt_en
). In this case, since there is no dynamic field definition to handle this ending, Solr guessed and created fieldauthors_other_txts_en
astext_general
. For production use Solr recommends to disable this automatic guessing, this is what the "WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use" was about when we first created our Solr core.
In the following sections we are going to drill down into some of specifics of the fields and dynamic field definitions that are configured in our Solr core.
Let's look at the details of the id
field in our schema
$ curl localhost:8983/solr/bibdata/schema/fields/id
#
# Will return something like this
# {
# "responseHeader":{...},
# "field":{
# "name":"id",
# "type":"string",
# "multiValued":false,
# "indexed":true,
# "required":true,
# "stored":true
# }
# }
#
Notice how the field is of type string
but also it is marked as not multi-value, to be indexed, required, and stored.
The type string
has also its own definition which we can view via:
$ curl localhost:8983/solr/bibdata/schema/fieldtypes/string
# {
# "responseHeader":{...},
# "fieldType":{
# "name":"string",
# "class":"solr.StrField",
# "sortMissingLast":true,
# "docValues":true
# }
# }
#
In this case the class
points to an internal Solr class (solr.StrField
) that will be used to handle values of the string type.
Now let's look at a more complex field and field type. If we look for a definition for the title_txt_en
Solr will report that we don't have one:
$ curl localhost:8983/solr/bibdata/schema/fields/title_txt_en
# {
# "responseHeader":{...
# "error":{
# "metadata":[
# "error-class","org.apache.solr.common.SolrException",
# "root-error-class","org.apache.solr.common.SolrException"],
# "msg":"No such path /schema/fields/title_txt_en",
# "code":404}}
#
However, if we look at the dynamic field definitions we'll notice that there is one for fields that end in _txt_en
:
$ curl localhost:8983/solr/bibdata/schema/dynamicfields/*_txt_en
# {
# "responseHeader":{...
# "dynamicField":{
# "name":"*_txt_en",
# "type":"text_en",
# "indexed":true,
# "stored":true}}
#
This tells Solr that any field name in the source data that does not already exist in the schema and that ends in _txt_en
should be created as a field of type text_en
. That looks innocent enough, so let's look closer to see what field type text_en
means:
$ curl localhost:8983/solr/bibdata/schema/fieldtypes/text_en
# {
# "responseHeader":{...}
# "fieldType":{
# "name":"text_en",
# "class":"solr.TextField",
# "positionIncrementGap":"100",
# "indexAnalyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"
# },
# "filters":[
# { "class":"solr.StopFilterFactory" ... },
# { "class":"solr.LowerCaseFilterFactory" },
# { "class":"solr.EnglishPossessiveFilterFactory" },
# { "class":"solr.KeywordMarkerFilterFactory" ... },
# { "class":"solr.PorterStemFilterFactory" }
# ]
# },
# "queryAnalyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"
# },
# "filters":[
# { "class":"solr.SynonymGraphFilterFactory" ... },
# { "class":"solr.StopFilterFactory" ... },
# { "class":"solr.LowerCaseFilterFactory" },
# { "class":"solr.EnglishPossessiveFilterFactory" },
# { "class":"solr.KeywordMarkerFilterFactory" ... },
# { "class":"solr.PorterStemFilterFactory" }
# ]
# }
# }
# }
This is obviously a much more complex definition than the ones we saw before. Although the basics are the same (e.g. the field type points to class solr.TextField
) notice that there are two new sections indexAnalyzer
and queryAnalyzer
for this field type. We will explore those in the next section.
Note: The fact that the Solr schema API does not show dynamically created fields (like title_txt_en
) is baffling, particularly since they do show in the Schema Browser Screen of the Solr Admin screen. This has been a known issue for many years as shown in this Stack Overflow question from 2010 in which one of the answers suggests using the following command to list all fields, including those created via dynamicField
definitions: curl localhost:8983/solr/bibdata/admin/luke?numTerms=0
The indexAnalyzer
section defines the transformations to perform as the data is indexed in Solr and queryAnalyzer
defines transformations to perform as we query for data out of Solr. It's important to notice that the output of the indexAnalyzer
affects the terms indexed, but not the value stored. The Solr Reference Guide says:
The output of an Analyzer affects the terms indexed in a given field
(and the terms used when parsing queries against those fields) but
it has no impact on the stored value for the fields. For example:
an analyzer might split "Brown Cow" into two indexed terms "brown"
and "cow", but the stored value will still be a single String: "Brown Cow"
When a value is indexed for a particular field the value is first passed to a tokenizer
and then to the filters
defined in the indexAnalyzer
section for that field type. Similarly, when we query for a value in a given field the value our query is first processed by a tokenizer
and then by the filters
defined in the queryAnalyzer
section for that field.
If we look again at the definition for the text_en
field type we'll notice that "stop words" (i.e. words to be ignored) are handled at index and query time (notice the StopFilterFactory
filter appears in the indexAnalyzer
and the queryAnalyzer
sections.) However, notice that "synonyms" will only be applied at query time since the filter SynonymGraphFilterFactory
only appears on the queryAnalyzer
section.
We can customize field type definitions to use different filters and tokenizers via the Schema API which we will discuss later on this tutorial.
For most purposes we can think of a tokenizer as something that splits a given text into individual tokens or words. The Solr Reference Guide defines Tokenizers as follows:
Tokenizers are responsible for breaking
field data into lexical units, or tokens.
For example if we give the text "hello world" to a tokenizer it might split the text into two tokens like "hello" and "word".
Solr comes with several built-in tokenizers that handle a variety of data. For example if we expect a field to have information about a person's name the Standard Tokenizer might be appropriated for it. However, for a field that contains e-mail addresses the UAX29 URL Email Tokenizer might be a better option.
You can only have one tokenizer per analyzer
Whereas a tokenizer
takes a string of text and produces a set of tokens, a filter
takes a set of tokens, process them, and produces a different set of tokens. The Solr Reference Guide says that
in most cases a filter looks at each token in the stream sequentially
and decides whether to pass it along, replace it or discard it.
Notice that unlike tokenizers, whose job is to split text into tokens, the job of filters is a bit more complex since they might replace the token with a new one or discard it altogether.
Solr comes with many built-in Filters that we can use to perform useful transformations. For example the ASCII Folding Filter converts non-ASCII characters to their ASCII equivalent (e.g. "México" is converted to "Mexico"). Likewise the English Possessive Filter removes singular possessives (trailing 's) from words. Another useful filter is the Porter Stem Filter that calculates word stems using English language rules (e.g. both "jumping" and "jumped" will be reduced to "jump".)
When we looked at the definition for the text_en
field type we noticed that at index time several filters were applied (StopFilterFactory
, LowerCaseFilterFactory
, EnglishPossessiveFilterFactory
, KeywordMarkerFilterFactory
, and PorterStemFilterFactory
.)
That means that if we index the text "The Television is Broken!" in a text_en
field the filters defined in the indexAnalyzer
will transform this text into two tokens: "televis", and "broken". Notice how the tokens were lowercased, the stop words ("the" and "is") dropped, and only the stem of the word "television" was indexed.
Likewise, the definition for text_en
included the additional filter SynonymGraphFilter
at query time. So if we were to query for the text "The TV is Broken!" Solr will run this text through the filters indicated in the queryAnalyzer
section and generate the following tokens: "televis", "tv", and "broken". Notice that an additional transformation was done to this text, namely, the word "TV" was expanded to its synonyms. This is because the queryAnalyzer
uses the SynonymGraphFilter
and a standard Solr configuration comes with those synonyms predefined in the synonyms.txt
file.
The Analysis Screen in the Solr Admin tool is a great way to see how a particular text is either indexed or queried by Solr depending on the field type. Point your browser to http://localhost:8983/solr/#/bibdata/analysis and try the following examples:
Here are a few examples to try:
-
Enter "The quick brown fox jumps over the lazy dog" in the "Field Value (index)", select
string
as the field type and see how is indexed. Then selecttext_general
and click "Analyze Values" to see how it's indexed. Lastly, selecttext_en
and see how it's indexed. You might want to uncheck the "Verbose output" to see the differences more clearly. -
With the text still on the "Field Value (index)" text box, enter "The quick brown fox jumps over the LAZY dog" on the "Field Value (query)" and try the different field types (
string/text_general/text_en
) again to see how each of them shows different matches. -
Try changing the text on the "Field Value (query)" text box to "The quick brown foxes jumped over the LAZY dogs". Compare the results using
text_general
versustext_en
. -
Now enter "The TV is broken!" on the "Field Value (index)" text box, clear the "Field Value (query)" text box, select
text_en
, and see how the value is indexed. Then do the reverse, clear the indexed value and enter "The TV is broken!" on the "Field Value (query)" text box and notice synonyms being applied. -
Now enter "The TV is broken!" on the "Field Value (index)" text box and "the television is broken" on the "Field Value (query)". Notice how they are matched because the use of synonyms applied for
text_en
fields. -
Now enter "The TV is broken!" on the "Field Value (index)" text box and clear the "Field Value (query)" text box, select
text_general
and notice how the stop words were not removed because we are not using English specific rules.
If your data has text in Chinese, Japanese, or Korean (CJK) Solr has built-in support for searching text in these languages using the proper transformations. Just as Solr uses different transformation when using field type text_en
instead of text_general
Solr applies different rules when using field type text_cjk
.
You can see the definition of this field type with the following command. Notice how there are two new filters (CJKWidthFilterFactory
and CJKBigramFilterFactory
) that are different from what we saw in the text_en
definition.
$ curl localhost:8983/solr/bibdata/schema/fieldtypes/text_cjk
# ...
# "fieldType":{
# "name":"text_cjk",
# "class":"solr.TextField",
# "positionIncrementGap":"100",
# "analyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"},
# "filters":[
# {"class":"solr.CJKWidthFilterFactory"},
# {"class":"solr.LowerCaseFilterFactory"},
# {"class":"solr.CJKBigramFilterFactory"}]}}}
#
If you go to the Analysis Screen again and enter "胡志明" (Ho Chi Minh) as the "Field Value (index)", select text_general
as the FieldType and analyse the values you'll notice how Solr calculated three tokens ("胡", "志", and "明") which is incorrect in Chinese. However, if you select text_cjk
and analyze the values again you'll notice that you'll end with two tokens ("胡志" and "志明") thanks to the CJKBigramFilterFactory
and that is the expected behavior for text in Chinese.
The data for this section was taken from this blog post. Although the technology referenced in the blog post is a bit dated, the basic concepts explained are still relevant, particularly if you, like me, are not a CJK speaker. Naomi Dushay's CJK with Solr for Libraries is a great resource on this topic.
There are two properties on a Solr field that control whether its values are stored
, indexed
, or both.
- Fields that are stored but not indexed can be fetched once a document has been found, but you cannot search by those fields (i.e. you cannot reference them in the
q
parameter). - Fields that are indexed but not stored are the reverse, you can search by them but you cannot fetch their values once a document has been found (i.e. you cannot reference them in the
fl
parameter).
Technically it's also possible to add a field that is neither stored nor indexed but that's beyond the scope of this tutorial.
There are many reasons to toggle the stored and indexed properties of a field. For example, perhaps we want to store a complex object as string in Solr so that we can display it to the user but we really don't want to index its values because we don't expect to ever search by this field. Conversely, perhaps we want to create a field with a combination of values to optimize a particular kind of search, but we don't want to display it to the users (the default _text_
field in our schema is such an example).
So far we have only worked with the fields that were automatically added to our bibdata
core as we imported the data. Because the fields in our source data had suffixes (e.g. _txt_en
) that match with the default dynamicField
definitions in a standard Solr installation most of our fields were created with the proper field type except, as we saw earlier, the _txts_en
field which was created as a text_general
field rather than at text_en
field (because there was no definition for _txts_en
fields).
Also, although it's nice that we can do sophisticated searches by title (because it is a text_en
field) we could not sort the results by this field because it's a tokenized field (technically we can sort by it but the results will not be what we would expect.)
Let's customize our schema a little bit to get the most out of Solr.
Let's begin by recreating our Solr core so that we have a clean slate.
Delete the existing bibdata
core in Solr
$ docker exec -it solr-container solr delete -c bibdata
# Deleting core 'bibdata' using command:
# http://localhost:8983/solr/admin/cores?action=UNLOAD&core=bibdata...
Then re-create it
$ docker exec -it solr-container solr create_core -c bibdata
# WARNING: Using _default configset with data driven schema functionality.
# ...
#
# Created new core 'bibdata'
And finally query it (you should have zero documents since we have not re-imported the data)
$ curl 'http://localhost:8983/solr/bibdata/select?q=*:*'
#
# "response":{"numFound":0,"start":0,"docs":[]
#
This time before we import the data in the books.json
file we are going to add a few field definitions to the schema to make sure the data is indexed and stored in the way that we want to.
The first thing we'll do is add a new dynamicField
definition to account for multi-value text fields in English for fields that end with _txts_en
in our JSON data:
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{
"name":"*_txts_en",
"type":"text_en",
"multiValued":true}
}' http://localhost:8983/solr/bibdata/schema
this will make sure Solr indexes these fields as text_en
rather than the default text_general
that it used when we did not have an dynamicField
to account for them.
Secondly we'll ask Solr to store a string version of the title (in addition to the text version) so we can sort results by title. To do this we'll add a copy-field
directive to our Schema to copy the value of the title_txt_en
to another field (title_s
). This way we'll have a text version for searching and a string version for sorting.
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":[
{
"source":"title_txt_en",
"dest":[ "title_s" ]
}
]
}' http://localhost:8983/solr/bibdata/schema
Right now we have two separate fields for author information (author_txt_en
for the main author and authors_other_txts_en
for additional authors) which means that if we want to find books by a particular author we have to issue a query against two separate fields: author_txt_en:"Sarah" OR authors_other_txts_en:"Sarah"
Let's use a copy-field
directive to have Solr automatically combine the main author and additional authors into a new field. Notice that the new field authors_all_txts_en
matches the dynamicField
directive that we just created, meaning that it will be indexed as text_en
multi-valued.
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":[
{
"source":"author_txt_en",
"dest":[ "authors_all_txts_en" ]
},
{
"source":"authors_other_txts_en",
"dest":[ "authors_all_txts_en" ]
}
]
}' http://localhost:8983/solr/bibdata/schema
Another customization that we'll do is to aggregate all the subject fields (subjects_ss
, subjects_geo_ss
, subjects_chrono
) into a new single field subjects_all_txts_en
and we'll make that field a text field so that we can search by subject easily. We'll do this via a copy field
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":[
{
"source":"subjects_ss",
"dest": "subjects_all_txts_en"
},
{
"source":"subjects_geo_ss",
"dest": "subjects_all_txts_en"
},
{
"source":"subjects_chrono_ss",
"dest": "subjects_all_txts_en"
}
]
}' http://localhost:8983/solr/bibdata/schema
As we saw earlier, by default, if no field is indicated in a search, Solr searches in the _text_
field. This field is already defined in our schema but we are currently not populating it with anything since the field does not exist in our books.json
data file. Let's fix that, let's tell Solr to copy the value of every field into the _text_
field by using a copyField
definition like the one below:
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":[
{
"source":"*",
"dest":[ "_text_" ]
}
]
}' http://localhost:8983/solr/bibdata/schema
In a production environment we probably want to be a more selective on how we populate _text_
but this will do for us.
Now that we have configured our schema with a few specific field definitions let's re-import the data so that fields are indexed using the new configuration.
$ docker exec -it solr-container post -c bibdata books.json
# /opt/java/openjdk/bin/java -classpath ...
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/bibdata/update...
Entering auto mode. ...
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/bibdata/update...
Time spent: 0:00:02.871
Now that we have a string version of the title field is possible for us to sort our search results by this field, for example, let's search for books that have the word "water" in the title (q=title_txt_en:water
) and sort them by title (sort=title_s+asc
):
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:water&sort=title_s+asc'
#
# response will include
# ...
# "title_txt_en":"A practical guide to creating and maintaining water quality /",
# "title_txt_en":"A practical guide to particle counting for drinking water treatment /",
# "title_txt_en":"Applied ground-water hydrology and well hydraulics /",
# "title_txt_en":"Assessment of blue-green algal toxins in raw and finished drinking water /",
# "title_txt_en":"Bureau of Reclamation..."
# "title_txt_en":"Carry me across the water : a novel /",
# "title_txt_en":"Clean Water Act : proposed revisions to EPA regulations to clean up polluted waters /",
# "title_txt_en":"Cold water burning /"
# ...
#
notice that the result are sorted alphabetically by title because we are using the string version of the field (title_s
) for sorting. Try and see what the results look like if you sort by the text version of the title (title_txt_en
):
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:water&sort=title_txt_en+asc'
The results in this case will not look correct because Solr will be using the tokenized value of the title_txt_en
field to sort rather than the string version.
Take a look at the data for this particular book that has many authors and notice how the authors_all_txts_en
field has the combination of author_txt_en
and authors_other_txts_en
even though our source data didn't have an authors_all_txts_en
field:
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:00009214'
#
# {
# "id":"00009214",
# "author_txt_en":"Everett, Barbara,",
# "authors_other_txts_en":["Gallop, Ruth,"]
# "authors_all_txts_en":["Everett, Barbara,", "Gallop, Ruth,"],
# }
#
Likewise, let's search for books authored by "Gallop" using our new authors_all_txts_en
field (q=authors_all_txts_en:Gallop
) and notice how this document will be on the results regardless of whether Ruth Gallop is the main author or an additional author.
$ curl 'http://localhost:8983/solr/bibdata/select?q=authors_all_txts_en:Gallop'
Let's run a query without specifing what field to search on, for example q:biology
$ curl 'http://localhost:8983/solr/bibdata/select?q=biology&debug=all'
The result will include all documents where the word "biology" is found in the _text_
field and since we are now populating this field with a copy of every value in our documents this means that we'll get back any document that has the word "biology" in the title, the author, or the subject.
We can confirm that Solr is searching on the _text_
field by looking at the information in the parsed query, it will looks like this:
"debug":{
"rawquerystring":"biology",
"querystring":"biology",
"parsedquery":"_text_:biology",
Notice that our raw query "biology"
got parsed as "_text_:biology"
.
When we issue a search to Solr we pass the search parameters in the query string. In previous examples we passed values in the q
parameter to indicate the values that we want to search for and fl
to indicate what fields we want to retrieve. For example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*&fl=id,title_txt_en'
In some instances we passed rather sophisticated values for these parameters, for example we used q=title_txt_en:"art history"~3
when we wanted to search for books with the words "art" and "history" in the title within a few word words of each other.
The components in Solr that parse these parameters are called Query Parsers. Their job is to extract the parameters and create a query that Lucene can understand. Remember that Lucene is the search engine underneath Solr.
Out of the box Solr comes with three query parsers: Standard, DisMax, and Extended DisMax (eDisMax). Each of them has its own advantages and disadvantages.
-
The Standard query parser (aka the Lucene Parser) "supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries. The largest disadvantage is that it’s very intolerant of syntax errors, as compared with something like the DisMax Query Parser which is designed to throw as few errors as possible."
-
The DisMax query parser (DisMax) interface "is more like that of Google than the interface of the 'lucene' Solr query parser. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages."
-
The Extended DisMax (eDisMax) query parser is an improved version of the DisMax parser that is also very forgiving on errors when parsing user entered queries and like the Standard query parser supports complex query expressions.
One key difference among these parsers is that they recognize different parameters. For example, the DisMax and eDisMax parsers supports a qf
parameter to specify what fields should be searched for but this parameter is not supported by the Standard parser.
The rest of the examples in this section are going to use the eDisMax parser, notice the defType=edismax
in our queries to Solr to make this selection. As we will see later on this tutorial you can also set the default query parser of your Solr core to use eDisMax by updating the defType
parameter in your solrconfig.xml
so that you don't have to explicitly set it on every query.
The number of search parameters that you can pass to Solr is rather large and, as we've mentioned, they also depend on what query parser you are using.
To see a list a comprehensive list of the parameters that apply to all parsers take a look at the Common Query Parameters and the Standard Query Parser sections in the Solr Reference Guide.
Below are some of the parameters that are supported by all parsers:
defType
: Query parser to use (default islucene
, other possible values aredismax
andedismax
)q
: Search query, the basic syntax isfield:"value"
.sort
: Sorting of the results (default isscore desc
, i.e. highest ranked document first)rows
: Number of documents to return (default is10
)start
: Index of the first document to result (default is0
)fl
: List of fields to return in the result.fq
: Filters results without calculating a score.
Below are a few sample queries to show these parameters in action. Notice that spaces are URL encoded as +
in the curl
commands below, you do not need to encode them if you are submitting these queries via the Solr Admin interface in your browser.
- Retrieve the first 10 documents where the
title_txt_en
includes the word "washington" (q=title_txt_en:washington
)
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:washington'
- The next 15 documents for the same query (notice the
start=10
androws=15
parameters)
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:washington&start=10&rows=15'
- Retrieve the
id
andtitle_txt_en
(fl=id,title_txt_en
) where the title includes the words "women writers" but allowing for a word in between e.g. "women nature writers" (q=title_txt_en:"women writers"~1
) Technically the~N
means "N edit distance away" (See Solr in Action, p. 63).
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:"women+writers"~1&fl=id,title_txt_en'
- Documents that have additional authors (
q=authors_other_txts_en:*
), the*
means "any value".
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,author_txt_en,authors_other_txts_en&q=authors_other_txts_en:*'
- Documents that do not have additional authors (
q=NOT authors_other_txts_en:*
) be aware that theNOT
must be in uppercase.
$ curl 'http://localhost:8983/solr/bibdata/select?fl=*&q=NOT+authors_other_txts_en:*'
- Documents where at least one of the subjects is about "communication" (
q=subjects_all_txts_en:communication
) -- in reality because this field is a Text in English field this query will return all documents where thesubjects_all_txts_en
has the word "commun", the stem of "communication", you can validate this in the debug output:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,subjects_all_txts_en&q=subjects_all_txts_en:communication&debug=all'
- Documents where title include "science" and at least one of the subjects is "women" (
q=title_txt_en:science AND subjects_all_txts_en:women
notice that both search conditions are indicated in theq
parameter) Again, notice that theAND
operator must be in uppercase.
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,subjects_all_txts_en&q=title_txt_en:science+AND+subjects_all_txts_en:women'
- Documents where title includes the word "history" but does not include the word "art" (
q=title_txt_en:history AND NOT title_txt_en:art
)
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:history+AND+NOT+title_txt_en:art'
The Solr Reference Guide and this Lucene tutorial are good places to check for quick reference on the query syntax.
The DisMax and eDisMax query parsers provide another parameter, Query Fields qf
, that should not be confused with the q
or fq
parameters. The qf
parameter is used to indicate the list of fields that the search should be executed on along with their boost values.
If we want to search for the same value in multiple fields at once (e.g. if we want to find all books where the title or the author includes the text "Washington") we must indicate each field/value pair individually: q=title_txt_en:"Washington" authors_all_txts_en:"Washington"
.
The qf
parameter allows us specify the fields separate from the terms so that we can use instead: q="Washington"
and qf=title_txt_en authors_all_txts_en
. This is really handy if we want to customize what fields are searched in an application in which the user enters a single text (say "Washington") and the application automatically searches multiple fields.
Below is an example of this (remember to select the eDisMax parser (defType=edismax
) when using the qf
parameter):
$ curl 'http://localhost:8983/solr/bibdata/select?q="washington"&qf=title_txt_en+authors_all_txts_en&defType=edismax'
Solr provides an extra parameter debug=all
that we can use to get debug information about a query. This is particularly useful if the results that we get are not what we were expecting. For example, let's run the same query again but this time passing the debug=all
parameter:
$ curl 'http://localhost:8983/solr/bibdata/select?q="washington"&qf=title_txt_en+authors_all_txts_en&defType=edismax&debug=all'
# response will include
# {
# "responseHeader":{...}
# "response":{...}
# "debug":{
# "rawquerystring":"\"washington\"",
# "querystring":"\"washington\"",
# "parsedquery":"+DisjunctionMaxQuery((title_txt_en:washington | authors_all_txts_en:washington))",
# "parsedquery_toString":"+(title_txt_en:washington | authors_all_txts_en:washington)",
# "explain":{
# ... tons of information here ...
# }
# "QParser":"ExtendedDismaxQParser",
# }
# }
#
Notice the debug
property in the output, inside this property there is information about:
- what value the server received for the search (
querystring
) which is useful to detect if you are not URL encoding properly the value sent to the server - how the server parsed the query (
parsedquery
) which is useful to detect if the syntax on theq
parameter was parsed as we expected (e.g. remember the example earlier when we passed two wordsart history
without surrounding them in quotes and the parsed query showed that it was querying two different fieldstitle_txt_en
for "art" and_text_
for "history") - you can also see that some of the search terms were stemmed (e.g. if you query for "running" you'll notice that the parsed query will show "run")
- how each document was ranked (
explain
) - what query parser (
QParser
) was used
Check out this blog post for more information about debugQuery
.
When Solr finds documents that match the query it ranks them so that the most relevant documents show up first. You can provide Solr guidance on what fields are more important to you so that Solr considers this when ranking documents that match a given query.
Let's say that we want documents where the word "Washington" (q=washington
) is found in the title or in the author (qf=title_txt_en authors_all_txts_en
)
$ curl 'http://localhost:8983/solr/bibdata/select?&q=washington&qf=title_txt_en+authors_all_txts_en&defType=edismax'
Now let's say that we want to boost the documents where the author has the word "Washington" ahead of the documents where "Washington" was found in the title. To this we update the qf
parameter as follows qf=title_txt_en authors_all_txts_en^5
(notice the ^5
to boost the authors_all_txts_en
field)
$ curl 'http://localhost:8983/solr/bibdata/select?&q=washington&qf=title_txt_en+authors_all_txts_en^5&defType=edismax'
Notice how documents where the author is named "Washington" come first, but we still get documents where the title includes the word "Washington".
Boost values are arbitrary, you can use 1, 20, 789, 76.2, 1000, or whatever number you like, you can even use negative numbers (qf=title_txt_en authors_all_txts_en^-10
). They are just a way for us to hint Solr which fields we consider more important in a particular search.
If want to see why Solr ranked a result higher than another you can pass an additional parameter debug.explain.structured=true
to see the explanation on how Solr ranked each of the documents in the result:
$ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:west+authors_all_txts_en:washington&debug=all&debug.explain.structured=true'
The result will include an explain
node with a ton of information for each of the documents ranked. This information is rather complex but it has a wealth of details that could help us figure out why a particular document is ranked higher or lower than what we would expect. Take a look at this blog post to get an idea on how to interpret this information.
You can also filter a field to be within a range by using the bracket operator with the following syntax: field:[firstValue TO lastValue]
. For example, to request documents with id
between 00010500
and 00012050
we could do: id:[00010500 TO 00012050]
. You can also indicate open-ended ranges by passing an asterisk as the value, for example: id:[* TO 00012050]
.
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:\[00010500+TO+00012050\]'
Be aware that range filtering with string
fields would work as you would expect it to, but with text_general
and text_en
fields it will filter on the terms indexed not on the value of the field.
Searching is a large topic and complex topic. I've found the book "Relevant search with applications for Solr and Elasticsearch" (see references) to be a good conceptual reference with specifics on how to understand and configure Solr to improve search results. Chapter 3 on this book goes into great detail on how to read and understand the ranking of results.
One of the most popular features of Solr is the concept of facets. The Solr Reference Guide defines it as:
Faceting is the arrangement of search results into categories
based on indexed terms.
Searchers are presented with the indexed terms, along with numerical
counts of how many matching documents were found for each term.
Faceting makes it easy for users to explore search results, narrowing
in on exactly the results they are looking for.
You can easily get facet information from a query by selecting what field (or fields) you want to use to generate the categories and the counts. The basic syntax is facet=on
followed by facet.field=name-of-field
. For example to facet our dataset by subjects we would use the following syntax: facet.field=subjects_ss
as in the following example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.field=subjects_ss'
# result will include
#
# "facet_counts":{
# "facet_queries":{},
# "facet_fields":{
# "subjects_ss":[
# "Women",435,
# "Large type books",415,
# "African Americans",337,
# "English language",330,
# "World War, 1939-1945",196,
# ...
#
IMPORTANT: You might have noticed that we are using the string
representation of the subjects (subjects_ss
) to generate the facets rather than the text_en
version stored in the subjects_all_txts_en
field. This is because, as the Solr Reference Guide indicates facets are calculated "based on indexed terms". The indexed version of the subjects_all_txts_en
field is tokenized whereas the indexed version of subjects_ss
is the entire string.
You can indicate more than one facet.field
in a query to Solr (e.g. facet.field=publisher_name_s&facet.field=subjects_ss
) to get facets for more than one field.
There are several extra parameters that you can pass to Solr to customize how many facets are returned on result set. For example, if you want to list only the top 20 subjects in the facets rather than all of them you can indicate this with the following syntax: f.subjects_ss.facet.limit=20
. You can also filter only get facets that have at least certain number of matches, for example only subjects that have at least 50 books f.subjects_ss.facet.mincount=50
as shown the following example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.field=subjects_ss&f.subjects_ss.facet.limit=20&f.subjects_ss.facet.mincount=50'
You can also facet by multiple fields at once this is called Pivot Faceting. The way to do this is via the facet.pivot
parameter.
Note: Unfortunately the facet.pivot
parameter is not available via the Solr Admin web page, if you want to try this example you will have to do it via the command on the terminal.
This parameter allows you to list the fields that should be used to facet the data, for example to facet the information by subject and then by publisher (facet.pivot=subjects_ss,publisher_name_s
) you could issue the following command:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.pivot=subjects_ss,publisher_name_s&facet.limit=5'
#
# response will include facets organized as follows:
#
# "facet_counts":{
# "facet_pivot":{
# "subjects_ss,publisher_name_s":[{
# "field":"subjects_ss",
# "value":"Women",
# "count":435,
# "pivot":[{
# "field":"publisher_name_s",
# "value":"Chelsea House Publishers,",
# "count":22},
# {
# "field":"publisher_name_s",
# "value":"Enslow Publishers,",
# "count":13},
# ...
# ]
# }
# ]
# ...
#
Notice how the results for the subject "Women" (435 results) are broken down by publisher under the "pivot" section.
Another Solr feature is the ability to return a fragment of the document where the match was found for a given search term. This is called highlighting.
Let's say that we search for books where one of the authors or the title include the word "Washington". To do this we'll set our parameters
q=washington
qf=title_txt_en authors_all_txts_en
defType=edismax
(theqf
parameter does not work with the Standard parser so we explicitly select eDisMax)hl=on
(this is what enables hit highlightint)
$ curl 'http://localhost:8983/solr/bibdata/select?defType=edismax&q=washington&qf=title_txt_en+authors_all_txts_en&hl=on'
#
# response will include a highlight section like this
#
# "highlighting":{
# "00065343":{
# "title_txt_en":["<em>Washington</em> Irving's The legend of Sleepy Hollow.."],
# "authors_all_txts_en":["Irving, <em>Washington</em>,"]},
# "00107795":{
# "authors_all_txts_en":["<em>Washington</em>, Durthy."]},
# "00044606":{
# "title_txt_en":["University of <em>Washington</em> /"]},
#
Notice how the highlighting
property includes the id
of each document in the result (e.g. 00065343
), the field where the match was found (e.g. authors_all_txts_en
and/or title_txt_en
) and the text that matched within the field (e.g. University of <em>Washington</em> /
). You can display this information along with your search results to allow the user to "preview" why each result was rendered.
In the next sections we'll make a few changes to the configuration of our bidata
core in order to enable some other features of Solr like synonyms and spell checking.
In Linux, Solr is typically installed under the /opt/solr
folder and the data for our cores is stored under the /var/solr/data
folder. We can see this in our Docker container if we log into it.
Open a separate terminal window and execute the following command to login into the container and see the files inside it:
$ docker exec -it solr-container /bin/bash
$ ls -la
#
# You'll see something like this
#
# bin CHANGES.txt docker lib LICENSE.txt NOTICE.txt README.txt
# books.json contrib example licenses modules prometheus-exporter server
#
While still on the Docker container issue a command as follow to see the files with the configuration for our bibdata
core:
$ ls -la /var/solr/data/bibdata/conf/
#
# You'll see something like this
#
# drwxr-xr-x 2 solr solr 4096 Nov 11 07:31 lang
# -rw-r--r-- 1 solr solr 26665 Jan 15 18:07 managed-schema.xml
# -rw-r--r-- 1 solr solr 873 Nov 11 07:31 protwords.txt
# -rw-r--r-- 1 503 dialout 48192 Jan 15 19:45 solrconfig.xml
# -rw-r--r-- 1 solr solr 781 Nov 11 07:31 stopwords.txt
# -rw-r--r-- 1 solr solr 1124 Nov 11 07:31 synonyms.txt
Notice the solrconfig.xml
, managed-schema.xml
and the synonyms.txt
files. These are the files that we saw before under the "Files" option in the Solr Admin web page.
File managed-schema.xml
is where field definitions are declared. File solrconfig.xml
is where we configure many of the features of Solr for our particular bibdata
core. File synonyms.txt
is where we define what words are considered synonyms and we'll look closely into this next.
Before we continue let's exit from the Docker container with the exit
command (don't worry the Docker container is still up and running in the background):
$ exit
In a previous section, when we looked at the text_general
and text_en
field types, we noticed that it used a filter to handle synonyms at query time.
Here is how to view that definition again:
$ curl 'http://localhost:8983/solr/bibdata/schema/fieldtypes/text_en'
#
# "queryAnalyzer":{
# "tokenizer":{
# ...
# },
# "filters":[
# ...
# a few filters go here
# ...
# {
# "class":"solr.SynonymGraphFilterFactory",
# "expand":"true",
# "ignoreCase":"true",
# "synonyms":"synonyms.txt"
# },
# ...
#
Notice how one of the filter uses the SynonymGraphFilterFactory
to handle synonyms and references a file synonyms.txt
.
You can view the contents of the synonyms.txt
file for our bibdata
core through the Files option in the Solr Admin web page: http://localhost:8983/solr/#/bibdata/files?file=synonyms.txt
The contents of this file looks more or less like this:
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterGraphFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
In the data in our bibdata
core several of the books have the words "twentieth century" in the title but these books would not be retrieved if a user were to search for "20th century".
Let's try it, first let's search for q=title_txt_en:"twentieth century"
:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"twentieth+century"'
#
# result will include 84 results
#
And now let's search for q=title_txt_en:"20th century"
:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"20th+century"'
#
# result will include 22 results
#
We can indicate Solr that "twentieth" and "20th" are synonyms by updating the synonyms.txt
file by adding a line as follows:
20th,twentieth
Because our Solr is running inside a Docker container we need to update the synonyms.txt
file inside the container. We are going to do this in four steps:
- First we'll copy
synonyms.txt
from the Docker container to our machine - Then we'll update the file in our machine (with whatever editor we are comfortable with)
- Next we'll copy our updated local copy back to the container
- And lastly, we'll tell Solr to reload the core's configuration so the changes take effect.
To copy the synonyms.txt
from the container to our machine we'll issue the following command:
$ docker cp solr-container:/var/solr/data/bibdata/conf/synonyms.txt .
$ ls
#
# drwxr-xr-x 3 user-id staff 96 Jan 16 18:02 .
# drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 ..
# -rw-r--r--@ 1 user-id staff 1124 Nov 11 02:31 synonyms.txt
#
We can view the contents of the file with a command as follows:
$ cat synonyms.txt
#
# will include a few lines including
#
# GB,gib,gigabyte,gigabytes
# Television, Televisions, TV, TVs
#
Let's edit this file with whatever editor your are comfortable. Our goal is to add a new line to make 20th
and twentieth
synonyms, we can do it like this:
$ echo "20th,twentieth" >> synonyms.txt
Now that we have updated our local copy of the synonyms file we need to copy this new version back to the Docker container, we can do this with a command like this:
$ docker cp synonyms.txt solr-container:/var/solr/data/bibdata/conf/
If we refresh the page http://localhost:8983/solr/#/bibdata/files?file=synonyms.txt on our browser we should see that the new line has been added to the synonyms.txt
file. However, we must reload our core for the changes to take effect. We can do this as follow:
$ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata'
# response will look similar to this
# {
# "responseHeader":{
# "status":0,
# "QTime":221}}
#
You can also reload the core via the Solr Admin page. Select "Core Admin", then "bibdata", and click "Reload".
If you run the queries again they will both report "106 results found" regardless of whether you search for q=title_txt_en:"twentieth century"
or q=title_txt_en:"20th century"
:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"twentieth+century"'
#
# result will include 106 results
# 88 with "twentieth century" plus 22 with "20th century"
#
To find more about synonyms take a look at this blog post where I talk about the different ways of adding synonyms, how to test them in the Solr Admin tool, and the differences between applying synonyms at index time versus query time.
One of the most important configuration files for a Solr core is solrconfig.xml
located in the configuration folder for the core. We can view the content of this file in our bibdata
core in this URL http://localhost:8983/solr/#/bibdata/files?file=solrconfig.xml
A default solrconfig.xml
file is about 1100 lines of heavily documented XML. We won't need to make changes to most of the content of this file, but there are a couple of areas that are worth knowing about: request handlers and search components.
Note: Despite its name, file solrconfig.xml
controls the configuration for our core, not for the entire Solr installation. Each core has its own solrconfig.xml
file.
To make things easier for the rest of this section let's download two copies of this file to our local machine:
$ docker cp solr-container:/var/solr/data/bibdata/conf/solrconfig.xml solrconfig.xml
$ docker cp solr-container:/var/solr/data/bibdata/conf/solrconfig.xml solrconfig.bak
$ ls
#
# drwxr-xr-x 4 user-id staff 128 Jan 16 18:19 .
# drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 ..
# -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36 solrconfig.bak
# -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36 solrconfig.xml
# -rw-r--r--@ 1 user-id staff 1151 Jan 16 18:12 synonyms.txt
#
solrconfig.xml
is the file that we will be working on. Like with the synonyms.txt
file before, the general workflow will be to make changes to this local version of the file, copy the updated file to the Docker container, and reload the Solr core to pick up the changes.
solrconfig.bak
on the other hand is just backup, in case we mess up solrconfig.xml
and need to go back to a well-known state.
When we submit a request to Solr the request is processed by a request handler. Throughout this tutorial all our queries to Solr have gone to a URL that ends with /select
, for example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
The /select
in the URL points to a request handler defined in solrconfig.xml
. If we look at the content of this file you'll notice (around like 733) a definition that looks like the one below, notice the "/select"
in this request handler definition:
#
# <!-- Primary search handler, expected by most clients, examples and UI frameworks -->
# <requestHandler name="/select" class="solr.SearchHandler">
# <lst name="defaults">
# <str name="echoParams">explicit</str>
# <int name="rows">10</int>
# </lst>
# </requestHandler>
#
We can make changes to this section to indicate that we want to use the eDisMax query parser (defType
) by default and set the default query fields (qf
) to title and author. To do so we could update the "defaults" section as follows:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="defType">edismax</str>
<str name="qf">title_txt_en authors_all_txts_en</str>
</lst>
</requestHandler>
We need to copy our updated file back to the Docker container and reload the core for the changes to take effect, let's do this with the following commands:
$ docker cp solrconfig.xml solr-container:/var/solr/data/bibdata/conf/
$ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata'
notice now now we we can issue a much simpler query since we don't have to specify the qf
or defType
parameters in the URL:
$ curl 'http://localhost:8983/solr/bibdata/select?q=west'
Be careful, an incorrect setting on the solrconfig.xml
file can take our core down or cause queries to give unexpected results. For example, entering the qf
value as title_txt_en, authors_all_txts_en
(notice the comma to separate the fields) will cause Solr to ignore this parameter.
The Solr Reference Guide has excellent documentation on what the values for a request handler mean and how we can configure them.
Request handlers in turn use search components to execute different operations on a search. Solr comes with several built-in default search components to implement faceting, highlighting, and spell checking to name a few.
You can find the definition of the search components in the solrconfig.xml
by looking at the searchComponent
elements defined in this file. For example, in our solrconfig.xml
there is a section like this for the highlighting feature that we used before:
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
... lots of other properties are define here...
<formatter name="html"
default="true"
class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
... lots of other properties are define here...
Notice that the HTML tokens (<em>
and </em>
) that we saw in the highlighting results in previous section are defined here.
Solr provides spellcheck functionality out of the box that we can use to help users when they misspell a word in their queries. For example, if a user searches for "Washingon" (notice the missing "t") most likely Solr will return zero results, but with the spellcheck turned on Solr is able to suggest the correct spelling for the query (i.e. "Washington").
In our current bibdata
core a search for "Washingon" will return zero results:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:washingon'
#
# response will indicate
# {
# "responseHeader":{
# "status":0,
# "params":{
# "q":"title:washingon",
# "fl":"id,title"}},
# "response":{"numFound":0,"start":0,"docs":[]
# }}
#
Spellchecking is configured under the /select
request handler in solrconfig.xml
. To enable it we need to update the defaults
settings and enable the spellcheck
search component.
To do this let's edit our local solrconfig.xml
and replace the <requestHandler name="/select" class="solr.SearchHandler">
node again but now with the following content:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="defType">edismax</str>
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
and copy our updated version back to the Docker container and reload it:
$ docker cp solrconfig.xml solr-container:/var/solr/data/bibdata/conf/
$ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata'
The spellcheck
component indicated above is already defined in the solrconfig.xml
with the following defaults.
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_general</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">_text_</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
...
</lst>
</searchComponent
Notice how by default it will use the _text_
field for spellcheck.
Now that our bibdata
core has been configured to use spellcheck let's try our misspelled "washingon" query again:
$ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:washingon'
#
# response will still indicate zero documents found,
# but the spellcheck node will be populated
#
# "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]},
# "spellcheck":{
# "suggestions":[
# "washingon",{
# "numFound":3,
# "startOffset":13,
# "endOffset":22,
# "suggestion":["washington",
# "washigton",
# "washing"]}],
# "collations":[
# "collation",{
# "collationQuery":"title_txt_en:washington",
# "hits":41,
# "misspellingsAndCorrections":[
# "washingon","washington"]},
Notice that even though we still got zero results back (numFound:0
), the response now includes a spellcheck
section with the words that were misspelled and the suggested spelling for it. We can use this information to alert the user that perhaps they misspelled a word or perhaps re-submit the query with the correct spelling.
- Solr Reference Guide
- Solr in Action by Trey Grainger and Timothy Potter
- Relevant search with applications for Solr and Elasticsearch by Doug Turnbull and John Berryman
- Let's build a Full-Text Search engine by Artem Krylysov
- The technology behind GitHub’s new code search
File books.json
contains roughly 30,000 books taken from the Library of Congress' MARC Distribution Services.
The steps to create the books.json
file from the original MARC data are as follow:
- Download file
BooksAll.2016.part01.utf8.gz
from https://www.loc.gov/cds/downloads/MDSConnect/BooksAll.2016.part01.utf8.gz. - Unzip it:
gzip -d BooksAll.2014.part01.utf8.gz
- Process the unzipped file with marcli with the following command:
marcli -file BooksAll.2016.part01.utf8 -match 2001 -matchFields 260 -format solr > books.json
to include only books published in 2001. The original MARC file has 250,000 books but the resulting file only includes 30,424 records.
marcli
is a small utility program that I wrote in Go to parse MARC files. If you are interested in the part that generates the JSON out of the MARC record take a look at the processorSolr.go file.
I would like to thank my former team at the Brown University Library for their support and recommendations as I prepared the initial version of this tutorial back in 2017 as well as those that attended the workshop at the Code4Lib conference in Washington, DC in 2018 and San Jose, CA in 2019. A special thanks goes to Birkin Diana for helping me run the workshop in 2018 and 2019 and for taking the time to review the materials (multiple times!) and painstakingly testing each of the examples.
Likewise, a big thanks to Bess Sadler, Carolyn Cole, Francis Kayiwa, and James Griffin from the Princeton University Library for helping me run the workshop at Code4Lib 2023.