-
Notifications
You must be signed in to change notification settings - Fork 37
Search Engine Internals
Overview is built to search documents. The feature list keeps growing; at the time of writing, we support:
- Tag search
- ViewFilter search (i.e., search-by-plugin)
- Full-text search with phrase searches, fuzzy searches and boolean operations, metadata search, notes search and search of document text and title
- regular-expression search of the document's full text
- sort results by metadata in addition to the default, document title
"Search" means transforming an HTTP request into a SelectionRequest
, and a SelectionRequest
into a Selection
. We'll define these soon.
This document explains how Overview's search works in broad strokes; then it explains why we couldn't find a pre-existing solution for these problems.
Many callers rely on Search. For instance:
- DocumentListController#index: The document list in the front-end shows search results. (For now, this is the only use of search that reports progress. See the Sorting section for more about progress reporting.)
- TagDocumentsController: We count, add and remove Tags on the document list using search results.
- DocumentSetArchiveController: The user can export a selection of documents.
- api.DocumentController#index: API users can send HTTP requests to /api/v1/document-sets/:documentSetId/documents
Broadly, the division of responsibilities is:
- Web server:
- models.SelectionRequest: the user's query, as sent in an HTTP request
- models.DocumentIdList: an Array of sorted, 32-bit IDs that is somewhat expensive to transmit (~0.5s for 10M documents).
-
models.DocumentIdSet: a set of document IDs, unordered, that can be transmitted cheaply. (See Document Lists and Where They're Stored for more details on
DocumentIdList
andDocumentIdSet
.) - models.Selection: a handle to a DocumentIdList and a list of models.SelectionWarnings, stored in Redis.
- controllers.SelectionHelpers: convenience methods for controllers
-
backend.SelectionBackend: stores and retrieves
Selection
s from Redis -
backend.DocumentSelectionBackend: transforms a
SelectionRequest
into anInMemorySelection
(which can be written to Redis)
- Worker(s):
- searchindex.DocumentSetLuceneIndex: queries a Lucene index of the document set (which it also maintains)
- sort.Sorter: runs an external merge sort, with progress reporting
-
jobhandler.documentset.SortRunner: updates
document_set.document_ids_sorted
anddocument_id_list
rows in the database by invokingSorter
.
- Both:
-
query.Query is the parsed representation of the user's search string. It is the most complex component of a
SelectionRequest
.
-
query.Query is the parsed representation of the user's search string. It is the most complex component of a
- Front-end:
-
Show.models.DocumentListParams: the client-side analogue to
SelectionRequest
, which plugins may modify - Show.models.DocumentList: communicates with server to execute search
- Show.views.DocumentListTitle: displays progress of search
- Show.views.DocumentList: displays results
-
Show.models.DocumentListParams: the client-side analogue to
"Search" means:
- Parse the
SelectionRequest
from the HTTP request - Check Redis for an existing Selection (result), and return it if it exists
- Branch out to individual services to find DocumentIdSets:
- Query Postgres with tags
- Query Lucene with full-text
- Query each ViewFilter
- Intersect all results to find the definitive DocumentIdSet
- Sort the DocumentIdSet
- Filter by regular expression
- Cache the resulting
Selection
in Redis
First, understand a distinction:
- a SelectionRequest is a user's request: for instance,
(tags A and B) AND (full-text query "foo")
. It is calculated synchronously (without communicating with any other service). This is the input of the search engine. - a Selection is the search result. This is the output of the search engine. It includes:
- a DocumentIdList.
- a list of SelectionWarnings from various search engines.
- an ID (a [UUID])(https://en.wikipedia.org/wiki/Universally_unique_identifier)
A SelectionRequest
is the search engine's representation of the user's HTTP request. It is completely independent from the database and search index.
A Selection
is specific to the user's data at the exact time a search happened. If you search once, then add or edit documents, then search again, the second search's Selection
may differ from the first search's Selection
-- even with the same SelectionRequest
.
We parse the SelectionRequest
from HTTP GET
query parameters or POST
form data. The most complex part is the Query parser, which generates the Query
we send to Lucene. The Query
is a tree structure of arbitrary depth (to handle (A OR (B AND C))
-type queries).
The definitive list of valid search queries is in QueryParserSpec.scala.
If query-parsing fails, we return that error to the user. Otherwise ... now we have a SelectionRequest
!
Future directions: currently, the Query
is one part of the SelectionRequest
. Originally, Query
was exactly the part which was sent to the search index. Now, it also includes regular expressions. Someday, it may encompass all aspects of the search.
We have two different code paths that query Redis:
-
SelectionHelpers.selectFast() applies if the user supplies a
selectionId
. IfselectionId
is set,selectFast()
queries Redis for the Selection and returns the result: either "not found" or theSelection
. -
SelectionHelpers.selectSlow() applies as long as the user does not pass
?refresh=true
. It hashes theSelectionRequest
and asks Redis for aselectionId
matching the hash. If theselectionId
exists, the Selection is valid and gets returned. Otherwise, we run the main query engine.
Overview's search uses set logic: AND
and OR
. DocumentIdSet stores these in bitsets.
(When you browse the code you'll see references to DocumentIdFilter. Conceptually, this is equivalent to a DocumentIdSet
. (DocumentIdFilter
simply optimizes the Empty
and All
special cases.)
DocumentSelectionBackend queries all services in parallel. Each service can return a DocumentIdFilter
and a list of SelectionWarning
s.
We store document tags, Tree nodes and StoreObjects in PostgreSQL. One query finds the document IDs that match the SelectionRequest.
Most full-text search is offloaded to the document set's dedicated Lucene index, DocumentSetLuceneIndex. Its searchForIds() converts the Query
to Lucene syntax (producing SelectionWarnings
if there are too many terms in the query, if queried metadata fields aren't indexed, or if the index hasn't been built yet) and builds the DocumentIdSet
result.
The Lucene index is run on the worker machine(s). The DocumentSelectionBackend
requests the search over Akka.
Plugins may create their own index structures, and DocumentSelectionBackend
will search them in parallel over http. Each request can produce a warning if the plugin server doesn't respond or produces an invalid response.
DocumentSelectionBackend
's final idsFilter is simply the intersection of all DocumentIdSet
s.
Let's face reality: sorting is slow.
Sorting 10M IDs takes ~1s on a 3.5Ghz machine. Sorting documents by anything other than ID takes orders of magnitude longer. (Overview scales to 10M-document sets.)
Sorting can take minutes. If we sort by title, for instance, we need to read every title from the database, build 10M collation keys (say ~100 bytes each), and then sort the resulting 1GB of text. If we sort by a metadata field, it's slower still: we need to parse all the metadata JSON, which takes time; and we're forced to sort on disk, because there may be 100GB of collation keys to sort.
This is a perfectly-defined problem. Sorting algorithms have existed since the dawn of computing. PostgreSQL, Lucene and virtually any other database will run an external merge sort, which grows more-than-linearly with the number of documents and is may take minutes. The problem isn't how sorting should happen: The problem is when sorting should happen. We can't make every search take minutes.
Let's add restrictions to simplify the problem:
- Overview sorts by title by default, and the user's other sort options are just metadata fields. That means there are few ways to sort a document list: few enough to cache each ordering.
- Users can update metadata at any time, changing the sort order, and then search immediately afterwards. That means we need user-friendly feedback when we get a cache miss.
Overview's solution:
Cache sort results as DocumentIdLists in Postgres: document_set.document_ids_sorted
(sorted by title) and the dedicated document_id_lists
table (one row per metadata field). Each DocumentIdList
in Postgres is a cache of the sort order of all documents in the document set. Overview invalidates these caches when the user modifies a document.
Sort using DocumentIdLists. Overview's sort algorithm is dead simple and takes milliseconds: SELECT
the DocumentIdList
matching the user's chosen sort field, and filter it for only the documents in the idsFilter
. When a cache value exists, the bottleneck is the SELECT
: it transfers a DocumentIdList
over the network, which can be 40MB. This is an acceptable delay.
Give feedback when there is no cached value. Postgres and Lucene don't report progress. That means they don't give an estimate of when the sort will finish. Overview solves this by implementing its own sort engine that streams all Documents in the DocumentSet and produces a DocumentIdList
that SortRunner writes to the database. The sort is a standard external merge sort: it reads documents and produces sort keys, sorting 200MB chunks at a time and flushing them to disk; then it merges the sorted pages, 16 at a time, into larger and larger files until at most 16 files remain; finally, it extracts document IDs while reading the last files. During these operations, SortRunner
sends progress reports back to the DocumentSelectionBackend
.
Progress reporting is a cross-cutting concern. It's more complex than external sort:
- The
SortRunner
sends messages (via Akka) toDocumentSelectionBackend
: severalSorting
messages and then aSortDone
message to indicate theDocumentIdList
has been written to the database. (ADocumentIdList
is too large to transfer via Akka.) -
DocumentSelectionBackend
's getSortedIds() accepts theseSorting
messages and calls anonCallback
function argument. - The
onCallback
argument is defined by DocumentListController, which fashions a special HTTP response (for non-API users). The response looks like:[ { progress: 0.25 }, { progress: 0.5 }, { progress: 0.7 }, <the actual result> ]
The response is sentchunked
(as progress reports come in). - The Overview web interface streams the JSON response and displays the progress.
After all that wiring, we have extremely fast sorting in the common case and a user-friendly progress bar when fast sort isn't possible.
Overview's regular-expression searches operate on document text. There's no index for that: it's orders of magnitude slower than other types of search.
Currently, Overview has a failsafe to make sure queries return quickly enough: it limits the number of documents being searched by regex. The default limit is 2,000.
Future directions: an alternate approach would be to show a progressbar and allow users to cancel regex searches. But cancellation is another cross-cutting concern; the limit is simpler.
Remember the original Query
is a tree of search instructions. Regex searches are part of that tree. But we don't process them with Lucene because Lucene doesn't hold the document text. And we don't run them in parallel with all the other searches, because we want the user to filter a smaller-than-2,000-documents subset of the document set.
Overview's solution: defer regular-expression search as long as possible.
First, a slight revision to the Lucene logic above: DocumentSetLuceneIndex
simply ignores Regex
query nodes. It rewrites them as "all documents."
After sorting, DocumentSelectionBackend#queryToRegexSearchRules()
extracts the regex nodes. It generates warnings for Regex
nodes that are inside OR(...)
or NOT(AND(...))
nodes: Overview doesn't handle them. Regex search only works correctly when the Query
tree looks like, AND(regexes, luceneCompatibleQuery)
, because those are the searches that let us defer regex queries.
With the regex nodes in hand, DocumentSelectionBackend
filters the already-sorted DocumentIdList
by streaming documents and applying all filters to all documents. If there are fewer than 2,000 documents before matching, it knows the search result is correct. Otherwise, filtering stops: it returns the set of matching documents and a warning that only 2,000 were tested.
Now we have a Selection
: a list of SelectionWarnings
and a DocumentIdList
. SelectionBackend
writes it to Redis in three separate keys:
- The value at
hashKey
(a hash of the SelectionRequest) becomesselection.id
(a new, randomly-generated UUID) - The value at
selectionId:document-ids
becomes theDocumentIdList
(an array of big-endian integers up to 40MB large, stored as a String -- since Redis Strings are binary and GETRANGE lets us query substrings without ever sending theDocumentIdList
over the wire again) - The value at
selectionId:warnings
serializes all theSelectionWarnings
. (The serialization format is non-standard and may change in the future.)
Now we have a Selection
. It's a handle to a value in Redis that holds up to 40MB of document IDs. Helper methods let the user load a Page of IDs at a time: that can create a result page, or it can be part of a streaming operation.
To summarize, let's draw the call chain.
Search means transforming an HTTP request into a SelectionRequest
, and a SelectionRequest
into a Selection
.
The call chain looks like this: (Signatures and variable names have been simplified.)
- The controller calls
SelectionHelpers#requestToSelection(documentSetId, userEmail, httpRequest, onProgress): Either[HttpResult,Selection]
.-
SelectionHelpers#selectionRequest(documentSetId, httpRequest): Either[HttpResult,SelectionRequest]
creates aSelectionRequest
which identifies what the user wants. (If the HTTP request specifies an invalid selection, it fails fast.) -
SelectionHelpers#selectFast(selectionRequest, userEmail, maybeSelectionId, refresh)
callsSelectionBackend.find(documentSetId, selectionId): Future[Option[Selection]]
if the the user's HTTP request included aselectionId
option. (selectionId
guarantees that the search engine won't run. That makes an HTTP query with aselectionId
"fast".) This queries Redis and returns aSelection
if theselectionId
is valid or anHttpResult
otherwise. IfselectionId
wasn't specified...: -
SelectionHelpers#selectSlow(params, onProgress)
tests whether theselectionRequest
has been searched recently by looking it in Redis based on its hash. (The user can disable this check by settingrefresh=true
.) This isSelectionBackend#findOrCreate(userEmail, selectionRequest, maybeSelectionId, onProgress)
. If Redis has aSelection
, it returns it. Otherwise...: -
SelectionBackend#create(userEmail, selectionRequest, onProgress): Future[Selection]
runs the search (and will cache the result when done):-
DocumentSelectionBackend#indexSelectedIds(selectionRequest, onProgress): Future[Selection]
does the heavy lifting:-
DocumentSelectionBackend#indexByDB(selectionRequest): Future[DocumentIdSet]
searches the database -
DocumentSetLuceneIndex#searchForIds(query): Future[(List[SelectionWarning],DocumentIdSet)]
searches the Lucene index -
ViewFilterBackend#resolve(documentSetId, viewFilterSelection): Future[Either[SelectionWarning,DocumentIdSet]]
searches ViewFilters (external plugin servers) by sending HTTP requests -
Sorter#sortIds(documentIdList, onProgress): Future[DocumentIdList]
helps sort the intersected results -
DocumentSelectionBackend#runRegexFilters(documentIdList, query): Future[Selection]
filters the results by regular expression
-
-
-
We deliberated for years and finally abandoned ElasticSearch (and its obvious competitor, Postgres full-text search). They had restrictions we couldn't accept.
Don't get us wrong: Overview hasn't reinvented full-text search. It uses Lucene, which is the same backend ElasticSearch uses. (PostgreSQL uses the same algorithms, too.) Overview differs in where it stores its data and what it queries. The underlying algorithms are identical.
(We link to ElasticSearch documentation throughout this page because their documents are fantastic.)
ElasticSearch and PostgreSQL are both built to house One Big Database. All the documents go to the same place. That assumption simplifies their wonderful replication and distributed-search features. But the same assumption disqualifies them for Overview's data.
Off-the-shelf software won't work properly with one index per document set:
- In ElasticSearch, each search index requires running processes. It's inefficient and will run up against operating-system limits such as maximum number of children or file descriptors per process.
- In PostgreSQL, using an arbitrary number of indexes on a table makes
INSERT
arbitrarily slow (because eachINSERT
is checked against each index); using an arbitrary number of tables makes everything arbitrarily slow.
And so if Overview relies on ElasticSearch or PostgreSQL for search, it needs One Big Index that holds all documents from all document sets.
Here's why the One Big Index approach doesn't work for Overview.
Full-text search stores documents the way a textbook's index stores pages. It finds all the terms (words) in the document, and then writes an index: dog
appears in documents 3, 5, and 8, for instance; and doll
appears in documents 2, 3 and 6.
When you search for do*
, the search engine walks through the index, collecting all terms that match do*
. In this example, it finds dog
and doll
. This is called multi-term rewrite. The search becomes dog OR doll
. The engine adds up both lists of document IDs and returns the result: documents 2, 3, 5, 6 and 8.
What happens if you index 100,000,000 terms? Well, a whole lot of terms start with do*
. That makes queries awfully slow.
OCR often produces thousands of garbage terms per document. With the One Big Index approach, those garbage terms slow down every query -- even queries of completely-different document sets.
Overview solves this problem by creating one index per document set. When you search a document set in Overview, your search won't consider terms from other users' documents.
Overview supports user-specified "Fields". (They're sometimes called "Metadata fields.") Each field the user creates is indexed as text.
Search indexes aren't built to handle a crazy number of fields. In ElasticSearch, the default limit is 1,000. With One Big Index, every Field in every document set counts against this limit.
Overview solves this problem by creating one index per document set. When you add a field to a document set in Overview, other document sets aren't affected.
When Overview gets a new search feature, we reindex all document sets. At the time of writing, reindexing takes hours. What happens while reindexing is happening?
ElasticSearch's index aliases let us index to new data structures in the background and then flip a switch when the task is done. But even this takes days of planning.
Overview's solution is neater: delete all search indexes. It works because each document set gets its own index -- and so it takes only a few seconds to reindex a document set.
Overview schedules the most-recently-searched DocumentSets' reindex jobs first. If a user searches a DocumentSet that does not yet have an index, Overview will tell the user to re-run the search in a few seconds. At worst, reindexing can force a currently-online user or two to wait a few seconds. On the flipside, system administrators (and overview-local
users) needn't consider deployments or backups when Overview gains a new search feature.
Today, Overview's doesn't sort search results by relevance (how important each found document is, with respect to the search terms). And today, Overview doesn't expose a document set's term vectors (its counts of how often a term appears in the document set).
But these features seem like plausible directions Overview may take. One Big Index makes them impossible.
Regular-expression search works by scanning all the text of every document in a document set. ElasticSearch won't do this. (ElasticSearch's Regexp matches terms, not text.) PostgreSQL does, but it explicitly recommends against allowing end users to write regular expressions, since an erroneous (or malicious) search could take Overview offline.
Overview uses re2j so a single user can't disable the Overview web server.
A plausible future speedup would be to compile re2 as a PostgreSQL extension. This would speed up regex searches somewhat by sending less over the network; but it would make deployment more complex.
We know: sorting document IDs within Postgres or Lucene could be faster than a custom sort algorithm, because they'd transmit less data over the network. It would transmit less data over the network. But the operation could still cost minutes. A sensible user would assume the website is broken. It's a non-option.
We haven't found one.
Here's the obvious alternate sort strategy: store a clustered B-tree index per document set, per sort field. This would slow down indexing and reindexing acceptably; and it could speed DocumentIdList
sorting to the point that we wouldn't need to report progress. We'd want to keep our DocumentIdList
cached in PostgreSQL to handle sub-second searches; but a cache miss would only cost a few seconds, not minutes.
But this introduces new problems:
- Where to store the B-tree indexes? Maybe in PostgreSQL, but each index makes file lookup slower) (and we're discussing perhaps tens of thousands of indexes, across all document sets). Plus, immediately after import the visibility map wouldn't allow index-only scans, so PostgreSQL's implementation risks being too slow to avoid progress reports. (This isn't to say the concept of B-trees is bad: just that PostgreSQL's implementation might not suit Overview's use case.)
- How do we create a new index? When the user adds a Field, the documents' metadata JSON might already contain values. We'd need to create a new index. That could take minutes. And if the user sorted by it, we'd need to report index-creation progress.
These are the exact problems we face with Lucene. So it seems wise to store the sort indexes and search indexes in the same place.
In conclusion: there may be architectures that are better for end-users. But any solution that allows users to sort requires progress reporting.