4942 Fix document_number filtering for docket numbers with repeated values #5018

albertisfu · 2025-01-31T22:14:19Z

@mlissner your observation in #4942 (comment) was correct!

The issue was indeed caused by duplicated terms in a docket number and the search analyzer used. When using the default search_analyzer, a docket number like "1:22-cr-00001" is tokenized as follows

{
    "tokens": [
        {
            "token": "1:22-cr-00001",
            "start_offset": 0,
            "end_offset": 13,
            "type": "word",
            "position": 0
        },
        {
            "token": "1",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "22",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "cr",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 2
        }
    ]
}

This behavior is due to the remove_duplicates filter, which removes duplicate tokens. As a result, in this case, the second occurrence of "1" is removed.

Similarly, for "3:25-cv-00025", the second occurrence of "25" is removed.

This explains why a filterquery like docket_number:3:25-cv-00025 only matched up to "cv".

Solution:
The solution is to use the search_analyzer_exact for the docket_number filter. Since docket_number queries use a match_phrase query in build_term_query, they already search for exact matches. This method is also used for other filters that require exact matches, such integers as court_id and keyword fields, which are not modified at indexing time. That’s why this issue affected only docket_number, which undergoes transformation during indexing, and not other filters.

The second part of the solution is to use the docketNumber.exact field for indexing, ensuring it aligns with the search_analyzer_exact query terms.

This was not an issue for docket number queries using the text box like q=3:25-cv-00025 because we detect the docket number in the query and convert it to a fielded docketNumber query wrapped within quotes docketNumber:"3:25-cv-00025" which looks for an exact match.
However, I found this issue was also present in fielded queries like q=docketNumber:3:25-cv-00025 if the user didn't wrap the docket number within quotes. So I applied a fix for this case by tweaking cleanup_main_query.
To pass a parenthetical test, it was necessary to change the docketNumber field definition in the parenthetical index because it previously didn't support an exact version of the field. However, this change won't take effect in production until we recreate the parenthetical index. Before we do that, we may want to review the entire parenthetical approach since it uses an old approach so many of its fields might require changes.

…values Fixes: #4942

…ocketNumber queries

…inside a phrase

mlissner

Looks like we might be able to do it without re-indexing. That's nice. I don't know this code well enough, so I only gave it a glance. Onward to Eduardo for a better look.

albertisfu added 5 commits January 31, 2025 12:22

fix(search): Fixed docket_number filtering for numbers with repeated …

d96362b

…values Fixes: #4942

fix(search): Apply fix for repeated docket_number values in fielded d…

631c542

…ocketNumber queries

fix(search): Fix cleanup_main_query for fielded queries with a value …

d3cfd84

…inside a phrase

fix(search): Fixed cleanup_main_query comments

d206980

Merge branch 'main' into 4942-fix-docket-number-search

10c3542

albertisfu requested a review from mlissner January 31, 2025 22:14

mlissner reviewed Jan 31, 2025

View reviewed changes

mlissner assigned ERosendo Jan 31, 2025

mlissner requested a review from ERosendo January 31, 2025 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4942 Fix document_number filtering for docket numbers with repeated values #5018

4942 Fix document_number filtering for docket numbers with repeated values #5018

albertisfu commented Jan 31, 2025

mlissner left a comment

4942 Fix document_number filtering for docket numbers with repeated values #5018

Are you sure you want to change the base?

4942 Fix document_number filtering for docket numbers with repeated values #5018

Conversation

albertisfu commented Jan 31, 2025

mlissner left a comment

Choose a reason for hiding this comment