You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically, Lucene will try to determinize a finite automaton for a case-insensitive string query in a similar way to how it handles regular expressions. That is, a case-insensitive term query for abc behaves the same as a regexp query for [Aa][Bb][Cc]. That automaton has four states:
stateDiagram-v2
direction LR
[*] --> s1: [Aa]
s1 --> s2: [Bb]
s2 --> [*]: [Cc]
Loading
The Javadoc on the determinize method says Worst case complexity: exponential in number of states. It also describes the workLimit parameter as:
* Maximum amount of "work" that the powerset construction will spend before
* throwing {@link TooComplexToDeterminizeException}. Higher numbers allow this operation to
* consume more memory and CPU but allow more complex automatons. Use {@link
* #DEFAULT_DETERMINIZE_WORK_LIMIT} as a decent default if you don't otherwise know what to
* specify.
Unfortunately, all of our case-insensitive queries (while all go through the AutomatonQueries helper class) call MinimizationOperations.minimize (which calls determinize) with a limit of Integer.MAX_VALUE, which potentially blows up the memory usage if someone tries a case-insensitive query over a large string.
Related component
Search:Resiliency
To Reproduce
Submit a case-insensitive term query over a large (100+ character) string. You'll probably run out of memory.
Expected behavior
Like with regexp queries, we should have a max_determinized_states parameter for any query type the support case-insensitive queries. For backward compatibility on the 2.x branch, we can keep a default value of Integer.MAX_VALUE, while 3.0 can switch to the Lucene default of 10000 (overridable by users).
Additional Details
N/A
The text was updated successfully, but these errors were encountered:
Describe the bug
This follows on from @dbwiddis's doc suggestion here and my reply in particular.
Basically, Lucene will try to determinize a finite automaton for a case-insensitive string query in a similar way to how it handles regular expressions. That is, a case-insensitive term query for
abc
behaves the same as a regexp query for[Aa][Bb][Cc]
. That automaton has four states:The Javadoc on the
determinize
method saysWorst case complexity: exponential in number of states.
It also describes theworkLimit
parameter as:Unfortunately, all of our case-insensitive queries (while all go through the
AutomatonQueries
helper class) callMinimizationOperations.minimize
(which callsdeterminize
) with a limit ofInteger.MAX_VALUE
, which potentially blows up the memory usage if someone tries a case-insensitive query over a large string.Related component
Search:Resiliency
To Reproduce
Submit a case-insensitive term query over a large (100+ character) string. You'll probably run out of memory.
Expected behavior
Like with regexp queries, we should have a
max_determinized_states
parameter for any query type the support case-insensitive queries. For backward compatibility on the 2.x branch, we can keep a default value ofInteger.MAX_VALUE
, while 3.0 can switch to the Lucene default of 10000 (overridable by users).Additional Details
N/A
The text was updated successfully, but these errors were encountered: