Reindexing documents

When Overview slices and dices your documents, it stores some parts of them in different places. The authoritative data store is Postgres. We store text in ElasticSearch for a speed boost; that is derived data.

This page explains how to rebuild the ElasticSearch data using the data in Postgres.

Why reindex?

You may wish to reindex:

If we suggest a new document mapping
If you want to reconfigure shards and replicas
If you had an unexpected failure and you aren't certain your ElasticSearch data is correct
If you want to perform an upgrade and this seems like an easy option

Why not reindex?

Reindexing can take a few hours, and it will slow down Overview noticeably.

That's all. In particular, reindexing does not make Overview return incorrect results -- even while it's in progress.

How to reindex

1. Learn the concepts (if you want)

You need some ElasticSearch concepts:

A cluster is a group of ElasticSearch servers with a name.
An index is a place where we store documents.
A mapping describes how those documents are stored and indexed.
An alias is a name we use to refer to an index.
ElasticSearch runs an HTTP server on port *9200-9299 by default and a "transport" server on port 9300-9400. (When it starts up, it picks one that's available.)

Overview follows best practices. It uses an index, documents_v1, with a mapping. It writes new documents to documents, an alias which points to documents_v1. When you create a document set with ID 1234, Overview creates an alias documents_1234 which also points to documents_v1.

Upgrading involves creating a new index -- say, documents_v2 -- and pointing all the aliases to it as we fill it with documents from Postgres. Overview will automatically forget about documents_v1 and start using documents_v2 exclusively.

For the purposes of this example, we'll use these settings:

ElasticSearch cluster name: SearchIndex (in development, it would be Dev SearchIndex)
Old index name: documents_v1
New index name: documents_v2
Database URL: postgres://overview:overview@dbserver:9010/overview
ElasticSearch HTTP server: http://esserver:9200
ElasticSearch transport: esserver:9300

2. Create the new index and mapping

We'll use curl to do this from the command line.

curl -XPUT 'http://esserver:9200/documents_v2' -d @- <<EOT
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "document": {
      "_id": { "path": "id" },
      "properties": {
        "id":              { "type": "long", "store": "yes" },
        "document_set_id": { "type": "long" },
        "text":            { "type": "string" },
        "supplied_id":     { "type": "string" },
        "title":           { "type": "string" }
      }
    }
  }
}
EOT

Choose the settings that you think most appropriate.

Unless you know what you're doing, copy/paste the mapping from common/src/main/resources/documents-mapping.json.

You should see a response like this:

{"ok":true,"acknowledged":true}

3. Run the reindexer (within checked-out Overview source code)

Compile it: ./sbt reindex-documents/stage
Run it:

upgrade/reindex-documents/target/universal/stage/bin/reindex-documents \
  --database-url "postgres://overview:overview@dbserver:9010/overview" \
  --elasticsearch-url "localhost:9300" \
  --elasticsearch-cluster "SearchIndex" \
  --index-name "documents_v2"

This will take a long time. If you cancel it by mistake, run it again to start over.

4. Check the aliases have moved

Upload a new document set and test that you can search it.
curl -XGET 'http://esserver:9200/documents_v2/_aliases' should output a lot of aliases. That's because it's the new main index. One of those aliases should be for the document set you just created.
curl -XGET 'http://esserver:9200/documents_v1/_aliases' should output {"documents_v1":{"aliases":{}}} That proves that Overview has forgotten about it.

5. Delete the old index

When you're ready: curl -XDELETE 'http://esserver:9200/documents_v1'

Remember, you're only deleting derived data. If you delete the wrong index by mistake, just run these steps again to rebuild it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly