Skip to content

Reindexing documents

Adam Hooper edited this page Jan 19, 2015 · 11 revisions

When Overview slices and dices your documents, it stores some parts of them in different places. The authoritative data store is Postgres. We store text in ElasticSearch for a speed boost; that is derived data.

This page explains how to rebuild the ElasticSearch data using the data in Postgres.

Why reindex?

You may wish to reindex:

  • If we suggest a new document mapping
  • If you want to reconfigure shards and replicas
  • If you had an unexpected failure and you aren't certain your ElasticSearch data is correct
  • If you want to perform an upgrade and this seems like an easy option

Why not reindex?

Reindexing can take a few hours, and it will slow down Overview noticeably.

That's all. In particular, reindexing does not make Overview return incorrect results -- even while it's in progress.

How to reindex

1. Learn the concepts (if you want)

You need some ElasticSearch concepts:

  • A cluster is a group of ElasticSearch servers with a name.
  • An index is a place where we store documents.
  • A mapping describes how those documents are stored and indexed.
  • An alias is a name we use to refer to an index.
  • ElasticSearch runs an HTTP server on port *9200-9299 by default and a "transport" server on port 9300-9400. (When it starts up, it picks one that's available.)

Overview follows best practices. It uses an index, documents_v1, with a mapping. It writes new documents to documents, an alias which points to documents_v1. When you create a document set with ID 1234, Overview creates an alias documents_1234 which also points to documents_v1.

Upgrading involves creating a new index -- say, documents_v2 -- and pointing all the aliases to it as we fill it with documents from Postgres. Overview will automatically forget about documents_v1 and start using documents_v2 exclusively.

For the purposes of this example, we'll use these settings:

  • ElasticSearch cluster name: SearchIndex (in development, it would be Dev SearchIndex)
  • Old index name: documents_v1
  • New index name: documents_v2
  • Database URL: postgres://overview:overview@dbserver:9010/overview
  • ElasticSearch HTTP server: http://esserver:9200
  • ElasticSearch transport: esserver:9300

2. Create the new index and mapping

We'll use curl to do this from the command line.

curl -XPUT 'http://esserver:9200/documents_v2' -d @- <<EOT
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "document": {
      "_id": { "path": "id" },
      "properties": {
        "id":              { "type": "long", "store": "yes" },
        "document_set_id": { "type": "long" },
        "text":            { "type": "string" },
        "supplied_id":     { "type": "string" },
        "title":           { "type": "string" }
      }
    }
  }
}
EOT

Choose the settings that you think most appropriate.

Unless you know what you're doing, copy/paste the mapping from common/src/main/resources/documents-mapping.json.

You should see a response like this:

{"ok":true,"acknowledged":true}

3. Run the reindexer (within checked-out Overview source code)

  1. Compile it: ./sbt reindex-documents/stage
  2. Run it:
upgrade/reindex-documents/target/universal/stage/bin/reindex-documents \
  --database-url "postgres://overview:overview@dbserver:9010/overview" \
  --elasticsearch-url "localhost:9300" \
  --elasticsearch-cluster "SearchIndex" \
  --index-name "documents_v2"

This will take a long time. If you cancel it by mistake, run it again to start over.

4. Check the aliases have moved

  1. Upload a new document set and test that you can search it.
  2. curl -XGET 'http://esserver:9200/documents_v2/_aliases' should output a lot of aliases. That's because it's the new main index. One of those aliases should be for the document set you just created.
  3. curl -XGET 'http://esserver:9200/documents_v1/_aliases' should output {"documents_v1":{"aliases":{}}} That proves that Overview has forgotten about it.

5. Delete the old index

When you're ready: curl -XDELETE 'http://esserver:9200/documents_v1'

Remember, you're only deleting derived data. If you delete the wrong index by mistake, just run these steps again to rebuild it.

Clone this wiki locally