Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4942 Fix document_number filtering for docket numbers with repeated values #5018

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

albertisfu
Copy link
Contributor

@mlissner your observation in #4942 (comment) was correct!

The issue was indeed caused by duplicated terms in a docket number and the search analyzer used. When using the default search_analyzer, a docket number like "1:22-cr-00001" is tokenized as follows

{
    "tokens": [
        {
            "token": "1:22-cr-00001",
            "start_offset": 0,
            "end_offset": 13,
            "type": "word",
            "position": 0
        },
        {
            "token": "1",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "22",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "cr",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 2
        }
    ]
} 

This behavior is due to the remove_duplicates filter, which removes duplicate tokens. As a result, in this case, the second occurrence of "1" is removed.

Similarly, for "3:25-cv-00025", the second occurrence of "25" is removed.

This explains why a filterquery like docket_number:3:25-cv-00025 only matched up to "cv".

Solution:
The solution is to use the search_analyzer_exact for the docket_number filter. Since docket_number queries use a match_phrase query in build_term_query, they already search for exact matches. This method is also used for other filters that require exact matches, such integers as court_id and keyword fields, which are not modified at indexing time. That’s why this issue affected only docket_number, which undergoes transformation during indexing, and not other filters.

The second part of the solution is to use the docketNumber.exact field for indexing, ensuring it aligns with the search_analyzer_exact query terms.

  • This was not an issue for docket number queries using the text box like q=3:25-cv-00025 because we detect the docket number in the query and convert it to a fielded docketNumber query wrapped within quotes docketNumber:"3:25-cv-00025" which looks for an exact match.
  • However, I found this issue was also present in fielded queries like q=docketNumber:3:25-cv-00025 if the user didn't wrap the docket number within quotes. So I applied a fix for this case by tweaking cleanup_main_query.
  • To pass a parenthetical test, it was necessary to change the docketNumber field definition in the parenthetical index because it previously didn't support an exact version of the field. However, this change won't take effect in production until we recreate the parenthetical index. Before we do that, we may want to review the entire parenthetical approach since it uses an old approach so many of its fields might require changes.

@albertisfu albertisfu requested a review from mlissner January 31, 2025 22:14
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we might be able to do it without re-indexing. That's nice. I don't know this code well enough, so I only gave it a glance. Onward to Eduardo for a better look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Development

Successfully merging this pull request may close these issues.

3 participants