4942 Fix document_number filtering for docket numbers with repeated values #5018
+238
−27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@mlissner your observation in #4942 (comment) was correct!
The issue was indeed caused by duplicated terms in a docket number and the search analyzer used. When using the default
search_analyzer
, a docket number like "1:22-cr-00001" is tokenized as followsThis behavior is due to the
remove_duplicates
filter, which removes duplicate tokens. As a result, in this case, the second occurrence of "1" is removed.Similarly, for "3:25-cv-00025", the second occurrence of "25" is removed.
This explains why a filterquery like
docket_number:3:25-cv-00025
only matched up to"cv"
.Solution:
The solution is to use the
search_analyzer_exact
for thedocket_number
filter. Sincedocket_number
queries use amatch_phrase
query inbuild_term_query
, they already search for exact matches. This method is also used for other filters that require exact matches, such integers ascourt_id
and keyword fields, which are not modified at indexing time. That’s why this issue affected onlydocket_number
, which undergoes transformation during indexing, and not other filters.The second part of the solution is to use the
docketNumber.exact
field for indexing, ensuring it aligns with thesearch_analyzer_exact
query terms.q=3:25-cv-00025
because we detect the docket number in the query and convert it to a fieldeddocketNumber
query wrapped within quotesdocketNumber:"3:25-cv-00025"
which looks for an exact match.q=docketNumber:3:25-cv-00025
if the user didn't wrap the docket number within quotes. So I applied a fix for this case by tweakingcleanup_main_query
.docketNumber
field definition in the parenthetical index because it previously didn't support anexact
version of the field. However, this change won't take effect in production until we recreate the parenthetical index. Before we do that, we may want to review the entire parenthetical approach since it uses an old approach so many of its fields might require changes.