[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

dbwiddis · 2025-01-07T07:36:24Z

What do you want to do?

Request a change to existing documentation
Add new documentation
Report a technical problem with the documentation
Other

Tell us about your request. Provide a summary of the request.

The documentation for term query includes the case_insensitive parameter.

Due to the implementation details of this type of search, every (alphabetic) character in such a query doubles the complexity of the search, consuming a lot of heap memory and potentially crashing nodes due to high CPU with GC thrashing. Even a relatively short search term (about 16 characters) could result in nearly 8 GB of heap.

This potential impact should be highlighted in the documentation as a warning. Additionally there are preferred strategies for doing case insensitive searches that should be presented as an alternative in the docs.

I'm happy to write such docs but would hope to get some assistance from @msfroh in validating them technically.

Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

all

The text was updated successfully, but these errors were encountered:

msfroh · 2025-01-07T17:58:47Z

We may also want to treat this as a bug in OpenSearch.

Specifically, we should set a value for maxDeterminizedStates in the AutomatonQueries class, like here: https://github.com/opensearch-project/OpenSearch/blob/ad7ce4cd446523d52286530ccfbb1dabc2dc7f86/server/src/main/java/org/opensearch/common/lucene/search/AutomatonQueries.java#L88

Elsewhere -- like in RegexpQueryBuilder, we default to setting max_determinized_states to 10000 (by referencing the constant defined in Lucene as Operations.DEFAULT_DETERMINIZE_WORK_LIMIT).

IMO, we should:

deprecate all of the existing caseInsensitive*Query methods in AutomatonQueries,
replace them with calls that specify maxDeterminizedStates,
add max_determinized_states as a query parameter to any query type that may generate an automaton query, and
make the default max_determinized_states value for those query types Integer.MAX_VALUE on the 2.x branch (for dangerous backward compatibility) and Operations.DEFAULT_DETERMINIZE_WORK_LIMIT (i.e. 10000) on the main branch so we're safe by default on 3.0.

That way, folks using 2.19 can at least safeguard themselves by explicitly setting max_determinized_states to something reasonable, while 3.0 is safe by default (but we let users risk shooting themselves in the foot if they explicitly ask to).

msfroh · 2025-01-07T18:31:28Z

I'm going to turn my previous message into an issue for https://github.com/opensearch-project/OpenSearch

msfroh · 2025-01-07T18:59:03Z

opensearch-project/OpenSearch#16975

dbwiddis added the untriaged label Jan 7, 2025

This was referenced Jan 7, 2025

Support searching from doc_value using termQueryCaseInsensitive/termQuery in flat_object/keyword field opensearch-project/OpenSearch#16974

Open

[BUG] Can (easily) run out of memory with case-insensitive term queries opensearch-project/OpenSearch#16975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

dbwiddis commented Jan 7, 2025

msfroh commented Jan 7, 2025 •

edited

Loading

msfroh commented Jan 7, 2025

msfroh commented Jan 7, 2025

[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

[DOC] Term Query documentation does not warn of performance impact of case_insensitive searches #9028

Comments

dbwiddis commented Jan 7, 2025

msfroh commented Jan 7, 2025 • edited Loading

msfroh commented Jan 7, 2025

msfroh commented Jan 7, 2025

msfroh commented Jan 7, 2025 •

edited

Loading