Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can (easily) run out of memory with case-insensitive term queries #16975

Open
msfroh opened this issue Jan 7, 2025 · 0 comments
Open
Labels
bug Something isn't working Search:Resiliency

Comments

@msfroh
Copy link
Collaborator

msfroh commented Jan 7, 2025

Describe the bug

This follows on from @dbwiddis's doc suggestion here and my reply in particular.

Basically, Lucene will try to determinize a finite automaton for a case-insensitive string query in a similar way to how it handles regular expressions. That is, a case-insensitive term query for abc behaves the same as a regexp query for [Aa][Bb][Cc]. That automaton has four states:

stateDiagram-v2
direction LR
[*] --> s1: [Aa]
s1 --> s2: [Bb]
s2 --> [*]: [Cc]
Loading

The Javadoc on the determinize method says Worst case complexity: exponential in number of states. It also describes the workLimit parameter as:

   *  Maximum amount of "work" that the powerset construction will spend before
   *     throwing {@link TooComplexToDeterminizeException}. Higher numbers allow this operation to
   *     consume more memory and CPU but allow more complex automatons. Use {@link
   *     #DEFAULT_DETERMINIZE_WORK_LIMIT} as a decent default if you don't otherwise know what to
   *     specify.

Unfortunately, all of our case-insensitive queries (while all go through the AutomatonQueries helper class) call MinimizationOperations.minimize (which calls determinize) with a limit of Integer.MAX_VALUE, which potentially blows up the memory usage if someone tries a case-insensitive query over a large string.

Related component

Search:Resiliency

To Reproduce

Submit a case-insensitive term query over a large (100+ character) string. You'll probably run out of memory.

Expected behavior

Like with regexp queries, we should have a max_determinized_states parameter for any query type the support case-insensitive queries. For backward compatibility on the 2.x branch, we can keep a default value of Integer.MAX_VALUE, while 3.0 can switch to the Lucene default of 10000 (overridable by users).

Additional Details

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Resiliency
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants