-
Notifications
You must be signed in to change notification settings - Fork 37
Reindexing documents
When Overview slices and dices your documents, it stores some parts of them in different places. The authoritative data store is Postgres. We store text in ElasticSearch for a speed boost; that is derived data.
This page explains how to rebuild the ElasticSearch data using the data in Postgres.
You may wish to reindex:
- If we suggest a new document mapping
- If you want to reconfigure shards and replicas
- If you had an unexpected failure and you aren't certain your ElasticSearch data is correct
- If you want to perform an upgrade and this seems like an easy option
Reindexing can take a few hours, and it will slow down Overview noticeably.
That's all. In particular, reindexing does not make Overview return incorrect results -- even while it's in progress.
You need some ElasticSearch concepts:
- A cluster is a group of ElasticSearch servers with a name.
- An index is a place where we store documents.
- A mapping describes how those documents are stored and indexed.
- An alias is a name we use to refer to an index.
- ElasticSearch runs an HTTP server on port *9200-9299 by default and a "transport" server on port 9300-9400. (When it starts up, it picks one that's available.)
Overview follows best practices. It uses an index, documents_v1
, with a mapping. It writes new documents to documents
, an alias which points to documents_v1
. When you create a document set with ID 1234
, Overview creates an alias documents_1234
which also points to documents_v1
.
Upgrading involves creating a new index -- say, documents_v2
-- and pointing all the aliases to it as we fill it with documents from Postgres. Overview will automatically forget about documents_v1
and start using documents_v2
exclusively.
For the purposes of this example, we'll use these settings:
- ElasticSearch cluster name:
SearchIndex
(in development, it would beDev SearchIndex
) - Old index name:
documents_v1
- New index name:
documents_v2
- Database URL:
postgres://overview:overview@dbserver:9010/overview
- ElasticSearch HTTP server:
http://esserver:9200
- ElasticSearch transport:
esserver:9300
We'll use curl to do this from the command line.
curl -XPUT 'http://esserver:9200/documents_v2' -d @- <<EOT
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"document": {
"_id": { "path": "id" },
"properties": {
"id": { "type": "long", "store": "yes" },
"document_set_id": { "type": "long" },
"text": { "type": "string" },
"supplied_id": { "type": "string" },
"title": { "type": "string" }
}
}
}
}
EOT
Choose the settings
that you think most appropriate.
Unless you know what you're doing, copy/paste the mapping
from common/src/main/resources/documents-mapping.json.
You should see a response like this:
{"ok":true,"acknowledged":true}
- Compile it:
./sbt reindex-documents/stage
- Run it:
upgrade/reindex-documents/target/universal/stage/bin/reindex-documents \
--database-url "postgres://overview:overview@dbserver:9010/overview" \
--elasticsearch-url "localhost:9300" \
--elasticsearch-cluster "SearchIndex" \
--index-name "documents_v2"
This will take a long time. If you cancel it by mistake, run it again to start over.
- Upload a new document set and test that you can search it.
-
curl -XGET 'http://esserver:9200/documents_v2/_aliases'
should output a lot of aliases. That's because it's the new main index. One of those aliases should be for the document set you just created. -
curl -XGET 'http://esserver:9200/documents_v1/_aliases'
should output{"documents_v1":{"aliases":{}}}
That proves that Overview has forgotten about it.
When you're ready: curl -XDELETE 'http://esserver:9200/documents_v1'
Remember, you're only deleting derived data. If you delete the wrong index by mistake, just run these steps again to rebuild it.