Search Engine Internals

Overview is built to search documents. The feature list keeps growing; at the time of writing, we support:

Tag search
ViewFilter search (i.e., search-by-plugin)
Full-text search with phrase searches, fuzzy searches and boolean operations, metadata search, notes search and search of document text and title
regular-expression search of the document's full text
sort results by metadata in addition to the default, document title

"Search" means transforming an HTTP request into a SelectionRequest, and a SelectionRequest into a Selection. We'll define these soon.

This document explains how Overview's search works in broad strokes; then it explains why we couldn't find a pre-existing solution for these problems.

Important Pieces

Many callers rely on Search. For instance:

DocumentListController#index: The document list in the front-end shows search results. (For now, this is the only use of search that reports progress. See the Sorting section for more about progress reporting.)
TagDocumentsController: We count, add and remove Tags on the document list using search results.
DocumentSetArchiveController: The user can export a selection of documents.
api.DocumentController#index: API users can send HTTP requests to /api/v1/document-sets/:documentSetId/documents

Broadly, the division of responsibilities is:

Web server:
- models.SelectionRequest: the user's query, as sent in an HTTP request
- models.DocumentIdList: an Array of sorted, 32-bit IDs that is somewhat expensive to transmit (~0.5s for 10M documents).
- models.DocumentIdSet: a set of document IDs, unordered, that can be transmitted cheaply. (See Document Lists and Where They're Stored for more details on DocumentIdList and DocumentIdSet.)
- models.Selection: a handle to a DocumentIdList and a list of models.SelectionWarnings, stored in Redis.
- controllers.SelectionHelpers: convenience methods for controllers
- backend.SelectionBackend: stores and retrieves Selections from Redis
- backend.DocumentSelectionBackend: transforms a SelectionRequest into an InMemorySelection (which can be written to Redis)
Worker(s):
- searchindex.DocumentSetLuceneIndex: queries a Lucene index of the document set (which it also maintains)
- sort.Sorter: runs an external merge sort, with progress reporting
- jobhandler.documentset.SortRunner: updates document_set.document_ids_sorted and document_id_list rows in the database by invoking Sorter.
Both:
- query.Query is the parsed representation of the user's search string. It is the most complex component of a SelectionRequest.
Front-end:
- Show.models.DocumentListParams: the client-side analogue to SelectionRequest, which plugins may modify
- Show.models.DocumentList: communicates with server to execute search
- Show.views.DocumentListTitle: displays progress of search
- Show.views.DocumentList: displays results

Full Specification

"Search" means:

Parse the SelectionRequest from the HTTP request
Check Redis for an existing Selection (result), and return it if it exists
Branch out to individual services to find DocumentIdSets:
1. Query Postgres with tags
2. Query Lucene with full-text
3. Query each ViewFilter
Intersect all results to find the definitive DocumentIdSet
Sort the DocumentIdSet
Filter by regular expression
Cache the resulting Selection in Redis

1. Parse the SelectionRequest

First, understand a distinction:

a SelectionRequest is a user's request: for instance, (tags A and B) AND (full-text query "foo"). It is calculated synchronously (without communicating with any other service). This is the input of the search engine.
a Selection is the search result. This is the output of the search engine. It includes:
- a DocumentIdList.
- a list of SelectionWarnings from various search engines.
- an ID (a [UUID])(https://en.wikipedia.org/wiki/Universally_unique_identifier)

A SelectionRequest is the search engine's representation of the user's HTTP request. It is completely independent from the database and search index.

A Selection is specific to the user's data at the exact time a search happened. If you search once, then add or edit documents, then search again, the second search's Selection may differ from the first search's Selection -- even with the same SelectionRequest.

We parse the SelectionRequest from HTTP GET query parameters or POST form data. The most complex part is the Query parser, which generates the Query we send to Lucene. The Query is a tree structure of arbitrary depth (to handle (A OR (B AND C))-type queries).

The definitive list of valid search queries is in QueryParserSpec.scala.

If query-parsing fails, we return that error to the user. Otherwise ... now we have a SelectionRequest!

Future directions: currently, the Query is one part of the SelectionRequest. Originally, Query was exactly the part which was sent to the search index. Now, it also includes regular expressions. Someday, it may encompass all aspects of the search.

2. Check Redis for an existing `Selection`

We have two different code paths that query Redis:

SelectionHelpers.selectFast() applies if the user supplies a selectionId. If selectionId is set, selectFast() queries Redis for the Selection and returns the result: either "not found" or the Selection.
SelectionHelpers.selectSlow() applies as long as the user does not pass ?refresh=true. It hashes the SelectionRequest and asks Redis for a selectionId matching the hash. If the selectionId exists, the Selection is valid and gets returned. Otherwise, we run the main query engine.

3. Gather `DocumentIdSet`s

Overview's search uses set logic: AND and OR. DocumentIdSet stores these in bitsets.

(When you browse the code you'll see references to DocumentIdFilter. Conceptually, this is equivalent to a DocumentIdSet. (DocumentIdFilter simply optimizes the Empty and All special cases.)

DocumentSelectionBackend queries all services in parallel. Each service can return a DocumentIdFilter and a list of SelectionWarnings.

3a. Tags (and other little-used features)

We store document tags, Tree nodes and StoreObjects in PostgreSQL. One query finds the document IDs that match the SelectionRequest.

3b. Lucene

Most full-text search is offloaded to the document set's dedicated Lucene index, DocumentSetLuceneIndex. Its searchForIds() converts the Query to Lucene syntax (producing SelectionWarnings if there are too many terms in the query, if queried metadata fields aren't indexed, or if the index hasn't been built yet) and builds the DocumentIdSet result.

The Lucene index is run on the worker machine(s). The DocumentSelectionBackend requests the search over Akka.

3c. ViewFilters

Plugins may create their own index structures, and DocumentSelectionBackend will search them in parallel over http. Each request can produce a warning if the plugin server doesn't respond or produces an invalid response.

4. Intersect all results

DocumentSelectionBackend's final idsFilter is simply the intersection of all DocumentIdSets.

5. Sort the results

Let's face reality: sorting is slow.

Sorting 10M IDs takes ~1s on a 3.5Ghz machine. Sorting documents by anything other than ID takes orders of magnitude longer. (Overview scales to 10M-document sets.)

Sorting can take minutes. If we sort by title, for instance, we need to read every title from the database, build 10M collation keys (say ~100 bytes each), and then sort the resulting 1GB of text. If we sort by a metadata field, it's slower still: we need to parse all the metadata JSON, which takes time; and we're forced to sort on disk, because there may be 100GB of collation keys to sort.

This is a perfectly-defined problem. Sorting algorithms have existed since the dawn of computing. PostgreSQL, Lucene and virtually any other database will run an external merge sort, which grows more-than-linearly with the number of documents and is may take minutes. The problem isn't how sorting should happen: The problem is when sorting should happen. We can't make every search take minutes.

Let's add restrictions to simplify the problem:

Overview sorts by title by default, and the user's other sort options are just metadata fields. That means there are few ways to sort a document list: few enough to cache each ordering.
Users can update metadata at any time, changing the sort order, and then search immediately afterwards. That means we need user-friendly feedback when we get a cache miss.

Overview's solution:

Cache sort results as DocumentIdLists in Postgres: document_set.document_ids_sorted (sorted by title) and the dedicated document_id_lists table (one row per metadata field). Each DocumentIdList in Postgres is a cache of the sort order of all documents in the document set. Overview invalidates these caches when the user modifies a document.

Sort using DocumentIdLists. Overview's sort algorithm is dead simple and takes milliseconds: SELECT the DocumentIdList matching the user's chosen sort field, and filter it for only the documents in the idsFilter. When a cache value exists, the bottleneck is the SELECT: it transfers a DocumentIdList over the network, which can be 40MB. This is an acceptable delay.

Give feedback when there is no cached value. Postgres and Lucene don't report progress. That means they don't give an estimate of when the sort will finish. Overview solves this by implementing its own sort engine that streams all Documents in the DocumentSet and produces a DocumentIdList that SortRunner writes to the database. The sort is a standard external merge sort: it reads documents and produces sort keys, sorting 200MB chunks at a time and flushing them to disk; then it merges the sorted pages, 16 at a time, into larger and larger files until at most 16 files remain; finally, it extracts document IDs while reading the last files. During these operations, SortRunner sends progress reports back to the DocumentSelectionBackend.

Progress reporting is a cross-cutting concern. It's more complex than external sort:

The SortRunner sends messages (via Akka) to DocumentSelectionBackend: several Sorting messages and then a SortDone message to indicate the DocumentIdList has been written to the database. (A DocumentIdList is too large to transfer via Akka.)
DocumentSelectionBackend's getSortedIds() accepts these Sorting messages and calls an onCallback function argument.
The onCallback argument is defined by DocumentListController, which fashions a special HTTP response (for non-API users). The response looks like: [ { progress: 0.25 }, { progress: 0.5 }, { progress: 0.7 }, <the actual result> ] The response is sent chunked (as progress reports come in).
The Overview web interface streams the JSON response and displays the progress.

After all that wiring, we have extremely fast sorting in the common case and a user-friendly progress bar when fast sort isn't possible.

6. Filter by regular expression

Overview's regular-expression searches operate on document text. There's no index for that: it's orders of magnitude slower than other types of search.

Currently, Overview has a failsafe to make sure queries return quickly enough: it limits the number of documents being searched by regex. The default limit is 2,000.

Future directions: an alternate approach would be to show a progressbar and allow users to cancel regex searches. But cancellation is another cross-cutting concern; the limit is simpler.

Remember the original Query is a tree of search instructions. Regex searches are part of that tree. But we don't process them with Lucene because Lucene doesn't hold the document text. And we don't run them in parallel with all the other searches, because we want the user to filter a smaller-than-2,000-documents subset of the document set.

Overview's solution: defer regular-expression search as long as possible.

First, a slight revision to the Lucene logic above: DocumentSetLuceneIndex simply ignores Regex query nodes. It rewrites them as "all documents."

After sorting, DocumentSelectionBackend#queryToRegexSearchRules() extracts the regex nodes. It generates warnings for Regex nodes that are inside OR(...) or NOT(AND(...)) nodes: Overview doesn't handle them. Regex search only works correctly when the Query tree looks like, AND(regexes, luceneCompatibleQuery), because those are the searches that let us defer regex queries.

With the regex nodes in hand, DocumentSelectionBackend filters the already-sorted DocumentIdList by streaming documents and applying all filters to all documents. If there are fewer than 2,000 documents before matching, it knows the search result is correct. Otherwise, filtering stops: it returns the set of matching documents and a warning that only 2,000 were tested.

7. Cache the result

Now we have a Selection: a list of SelectionWarnings and a DocumentIdList. SelectionBackend writes it to Redis in three separate keys:

The value at hashKey (a hash of the SelectionRequest) becomes selection.id (a new, randomly-generated UUID)
The value at selectionId:document-ids becomes the DocumentIdList (an array of big-endian integers up to 40MB large, stored as a String -- since Redis Strings are binary and GETRANGE lets us query substrings without ever sending the DocumentIdList over the wire again)
The value at selectionId:warnings serializes all the SelectionWarnings. (The serialization format is non-standard and may change in the future.)

Now we have a Selection. It's a handle to a value in Redis that holds up to 40MB of document IDs. Helper methods let the user load a Page of IDs at a time: that can create a result page, or it can be part of a streaming operation.

Summary: The Call Chain

To summarize, let's draw the call chain.

Search means transforming an HTTP request into a SelectionRequest, and a SelectionRequest into a Selection.

The call chain looks like this: (Signatures and variable names have been simplified.)

The controller calls SelectionHelpers#requestToSelection(documentSetId, userEmail, httpRequest, onProgress): Either[HttpResult,Selection].
- SelectionHelpers#selectionRequest(documentSetId, httpRequest): Either[HttpResult,SelectionRequest] creates a SelectionRequest which identifies what the user wants. (If the HTTP request specifies an invalid selection, it fails fast.)
- SelectionHelpers#selectFast(selectionRequest, userEmail, maybeSelectionId, refresh) calls SelectionBackend.find(documentSetId, selectionId): Future[Option[Selection]] if the the user's HTTP request included a selectionId option. (selectionId guarantees that the search engine won't run. That makes an HTTP query with a selectionId "fast".) This queries Redis and returns a Selection if the selectionId is valid or an HttpResult otherwise. If selectionId wasn't specified...:
- SelectionHelpers#selectSlow(params, onProgress) tests whether the selectionRequest has been searched recently by looking it in Redis based on its hash. (The user can disable this check by setting refresh=true.) This is SelectionBackend#findOrCreate(userEmail, selectionRequest, maybeSelectionId, onProgress). If Redis has a Selection, it returns it. Otherwise...:
- SelectionBackend#create(userEmail, selectionRequest, onProgress): Future[Selection] runs the search (and will cache the result when done):
  - DocumentSelectionBackend#indexSelectedIds(selectionRequest, onProgress): Future[Selection] does the heavy lifting:
    - DocumentSelectionBackend#indexByDB(selectionRequest): Future[DocumentIdSet] searches the database
    - DocumentSetLuceneIndex#searchForIds(query): Future[(List[SelectionWarning],DocumentIdSet)] searches the Lucene index
    - ViewFilterBackend#resolve(documentSetId, viewFilterSelection): Future[Either[SelectionWarning,DocumentIdSet]] searches ViewFilters (external plugin servers) by sending HTTP requests
    - Sorter#sortIds(documentIdList, onProgress): Future[DocumentIdList] helps sort the intersected results
    - DocumentSelectionBackend#runRegexFilters(documentIdList, query): Future[Selection] filters the results by regular expression

Off-the-shelf software doesn't address Overview's needs

We deliberated for years and finally abandoned ElasticSearch (and its obvious competitor, Postgres full-text search). They had restrictions we couldn't accept.

Don't get us wrong: Overview hasn't reinvented full-text search. It uses Lucene, which is the same backend ElasticSearch uses. (PostgreSQL uses the same algorithms, too.) Overview differs in where it stores its data and what it queries. The underlying algorithms are identical.

(We link to ElasticSearch documentation throughout this page because their documents are fantastic.)

1. We need an index per document set

ElasticSearch and PostgreSQL are both built to house One Big Database. All the documents go to the same place. That assumption simplifies their wonderful replication and distributed-search features. But the same assumption disqualifies them for Overview's data.

Off-the-shelf software won't work properly with one index per document set:

In ElasticSearch, each search index requires running processes. It's inefficient and will run up against operating-system limits such as maximum number of children or file descriptors per process.
In PostgreSQL, using an arbitrary number of indexes on a table makes INSERT arbitrarily slow (because each INSERT is checked against each index); using an arbitrary number of tables makes everything arbitrarily slow.

And so if Overview relies on ElasticSearch or PostgreSQL for search, it needs One Big Index that holds all documents from all document sets.

Here's why the One Big Index approach doesn't work for Overview.

1a. One Big Index has too many terms

Full-text search stores documents the way a textbook's index stores pages. It finds all the terms (words) in the document, and then writes an index: dog appears in documents 3, 5, and 8, for instance; and doll appears in documents 2, 3 and 6.

When you search for do*, the search engine walks through the index, collecting all terms that match do*. In this example, it finds dog and doll. This is called multi-term rewrite. The search becomes dog OR doll. The engine adds up both lists of document IDs and returns the result: documents 2, 3, 5, 6 and 8.

What happens if you index 100,000,000 terms? Well, a whole lot of terms start with do*. That makes queries awfully slow.

OCR often produces thousands of garbage terms per document. With the One Big Index approach, those garbage terms slow down every query -- even queries of completely-different document sets.

Overview solves this problem by creating one index per document set. When you search a document set in Overview, your search won't consider terms from other users' documents.

1b. One Big Index has too many fields

Overview supports user-specified "Fields". (They're sometimes called "Metadata fields.") Each field the user creates is indexed as text.

Search indexes aren't built to handle a crazy number of fields. In ElasticSearch, the default limit is 1,000. With One Big Index, every Field in every document set counts against this limit.

Overview solves this problem by creating one index per document set. When you add a field to a document set in Overview, other document sets aren't affected.

1c. One Big Index is too slow to reindex

When Overview gets a new search feature, we reindex all document sets. At the time of writing, reindexing takes hours. What happens while reindexing is happening?

ElasticSearch's index aliases let us index to new data structures in the background and then flip a switch when the task is done. But even this takes days of planning.

Overview's solution is neater: delete all search indexes. It works because each document set gets its own index -- and so it takes only a few seconds to reindex a document set.

Overview schedules the most-recently-searched DocumentSets' reindex jobs first. If a user searches a DocumentSet that does not yet have an index, Overview will tell the user to re-run the search in a few seconds. At worst, reindexing can force a currently-online user or two to wait a few seconds. On the flipside, system administrators (and overview-local users) needn't consider deployments or backups when Overview gains a new search feature.

1d. One Big Index shares term vectors across document sets (which may matter in the future)

Today, Overview's doesn't sort search results by relevance (how important each found document is, with respect to the search terms). And today, Overview doesn't expose a document set's term vectors (its counts of how often a term appears in the document set).

But these features seem like plausible directions Overview may take. One Big Index makes them impossible.

Regular-expression search is special

Regular-expression search works by scanning all the text of every document in a document set. ElasticSearch won't do this. (ElasticSearch's Regexp matches terms, not text.) PostgreSQL does, but it explicitly recommends against allowing end users to write regular expressions, since an erroneous (or malicious) search could take Overview offline.

Overview uses re2j so a single user can't disable the Overview web server.

A plausible future speedup would be to compile re2 as a PostgreSQL extension. This would speed up regex searches somewhat by sending less over the network; but it would make deployment more complex.

Sort needs progress reporting

We know: sorting document IDs within Postgres or Lucene could be faster than a custom sort algorithm, because they'd transmit less data over the network. It would transmit less data over the network. But the operation could still cost minutes. A sensible user would assume the website is broken. It's a non-option.

Is there truly no way to sort without progress reporting?

We haven't found one.

Here's the obvious alternate sort strategy: store a clustered B-tree index per document set, per sort field. This would slow down indexing and reindexing acceptably; and it could speed DocumentIdList sorting to the point that we wouldn't need to report progress. We'd want to keep our DocumentIdList cached in PostgreSQL to handle sub-second searches; but a cache miss would only cost a few seconds, not minutes.

But this introduces new problems:

Where to store the B-tree indexes? Maybe in PostgreSQL, but each index makes file lookup slower) (and we're discussing perhaps tens of thousands of indexes, across all document sets). Plus, immediately after import the visibility map wouldn't allow index-only scans, so PostgreSQL's implementation risks being too slow to avoid progress reports. (This isn't to say the concept of B-trees is bad: just that PostgreSQL's implementation might not suit Overview's use case.)
How do we create a new index? When the user adds a Field, the documents' metadata JSON might already contain values. We'd need to create a new index. That could take minutes. And if the user sorted by it, we'd need to report index-creation progress.

These are the exact problems we face with Lucene. So it seems wise to store the sort indexes and search indexes in the same place.

In conclusion: there may be architectures that are better for end-users. But any solution that allows users to sort requires progress reporting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Engine Internals

Important Pieces

Full Specification

1. Parse the SelectionRequest

2. Check Redis for an existing `Selection`

3. Gather `DocumentIdSet`s

3a. Tags (and other little-used features)

3b. Lucene

3c. ViewFilters

4. Intersect all results

5. Sort the results

6. Filter by regular expression

7. Cache the result

Summary: The Call Chain

Off-the-shelf software doesn't address Overview's needs

1. We need an index per document set

1a. One Big Index has too many terms

1b. One Big Index has too many fields

1c. One Big Index is too slow to reindex

1d. One Big Index shares term vectors across document sets (which may matter in the future)

Regular-expression search is special

Sort needs progress reporting

Is there truly no way to sort without progress reporting?

Clone this wiki locally

Search Engine Internals

Important Pieces

Full Specification

1. Parse the SelectionRequest

2. Check Redis for an existing Selection

3. Gather DocumentIdSets

3a. Tags (and other little-used features)

3b. Lucene

3c. ViewFilters

4. Intersect all results

5. Sort the results

6. Filter by regular expression

7. Cache the result

Summary: The Call Chain

Off-the-shelf software doesn't address Overview's needs

1. We need an index per document set

1a. One Big Index has too many terms

1b. One Big Index has too many fields

1c. One Big Index is too slow to reindex

1d. One Big Index shares term vectors across document sets (which may matter in the future)

Regular-expression search is special

Sort needs progress reporting

Is there truly no way to sort without progress reporting?

Clone this wiki locally

2. Check Redis for an existing `Selection`

3. Gather `DocumentIdSet`s