Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lib): Enhances get_parties_from_case_name method #4971

Merged
merged 13 commits into from
Feb 7, 2025

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Jan 27, 2025

This PR enhances the get_parties_from_case_name method by removing common strings from bankruptcy case names before extracting party information. This improves the accuracy of party identification.

This PR also adds a new separator character to the list of valid separators for identifying parties in bankruptcy cases.

Fixes #4802

@ERosendo ERosendo force-pushed the 4802-feat-get-parties-from-bankruptcy-case-names branch from 192f1e2 to 7a79fa8 Compare January 27, 2025 15:02
@ERosendo ERosendo requested a review from mlissner January 27, 2025 15:12
This commit enhances the get_parties_from_case_name method by removing common strings from bankruptcy case names before extracting party information. This improves the accuracy of party identification.

- Adds a new separator character to the list of valid separators for identifying parties in bankruptcy cases.
@ERosendo ERosendo force-pushed the 4802-feat-get-parties-from-bankruptcy-case-names branch from 7a79fa8 to e31d9f3 Compare January 27, 2025 15:51
cl/lib/tests.py Outdated Show resolved Hide resolved
cl/lib/tests.py Outdated Show resolved Hide resolved
cl/lib/tests.py Outdated Show resolved Hide resolved
cl/lib/tests.py Outdated Show resolved Hide resolved
cl/lib/tests.py Outdated Show resolved Hide resolved
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in the approach, but I think I might have explained the goal poorly. I made a handful of suggestions to the tests should help. Sorry for the miscommunication.

Comment on lines +1274 to +1275
"case_name": 'Saucedo and Green Dream International, LLC <b> <font color="red"> Case Consolidated under 23-03142 </font> </b>',
"output": ["Saucedo", "Green Dream International, LLC"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks like it's actually wrong, but not sure we can do much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from #4802 (comment) it seems that there might be cases where the words In or of are not part of the party names.

For instance something like:
In re: Advantage LLC

Is this possible in bankruptcy?

If so, in these cases, the indexed party would be In re: Advantage LLC, which doesn't seem correct. In district courts, we simply ignore anything that doesn't have a valid separator, but here, it seems more complicated since we're performing cleanup before splitting parties.

Perhaps, in these cases, we could completely ignore anything that contains In or of? Or we could look for examples of these case names and check if we can identify a common pattern for cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible in bankruptcy?

I looked into how often "In re" appears in case names. After refining the dataset to include more records (2 million total records with RECAP source or a derived one) and searching, I found only 36 instances (0.0018%) where a case name begins with "In re." A few examples are:

I think we should add a step to the cleanup process that removes "In re" before we try to figure out the party names.

For reference, here's a CSV file containing these 36 instances:

case_names_re_recap.csv

@albertisfu Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah, it seems like this type of case name is not very common.

I think we can try removing in re or in re: before splitting the parties; however, that would also require removing other common terms that seem to be typical in this type of case name structure but do not appear to be part of the parties, such as:

Matter of
Receivership of
Appearances of

Not sure if it's possible to compile a list of all potential terms that might appear in a bankruptcy case name but are not part of the parties.

Additionally, some case names don't seem to contain parties at all.
In re Matter of Ascendium Replacement Filings
In re: Proceedings to Review Attorney Usage of CM/ECF Filing Credentials
In re: Proceedings to Enforce Fed.R.Bankr.9036
In Re: Proceedings to Enforce Fed. R. Bankr. P. 9036 as to various high-volume paper-notice recipients relating to cases pending within the District of Connecticut.
In re Matter of Proof of Claim Replacement Filings
In re Appointments and Reappointments of Ohio Sout

In these cases, if we remove "in re," it might not be correct to treat the remaining text as a party.

Another option is to simply ignore any case name that contains in re or in re: and not index parties from those cases. Perhaps @mlissner has an opinion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bankruptcy, I'm fine with just not indexing anything that starts with in re or in the matter of, etc.

@mlissner mlissner assigned ERosendo and unassigned mlissner Jan 28, 2025
@ERosendo
Copy link
Contributor Author

I made a handful of suggestions to the tests should help.

They're very helpful. I'll update the helper function based on your feedback.

Sorry for the miscommunication.

I apologize, I just reread the GitHub issue and it was all there. I was just trying to use the same approach as for district cases.

@ERosendo ERosendo force-pushed the 4802-feat-get-parties-from-bankruptcy-case-names branch from a06f20d to ca278cd Compare January 31, 2025 18:30
@ERosendo
Copy link
Contributor Author

@mlissner I've extracted the case name cleaning logic into a dedicated method. This incorporates your feedback and resolves the associated test now passes.

@ERosendo ERosendo requested a review from mlissner January 31, 2025 18:55
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple little things. Let's see if Alberto catches anything else since he's been in this area of the code most recently.

cl/search/documents.py Outdated Show resolved Hide resolved
cl/lib/search_index_utils.py Outdated Show resolved Hide resolved
@mlissner mlissner assigned albertisfu and unassigned ERosendo Jan 31, 2025
@mlissner mlissner requested a review from albertisfu January 31, 2025 19:27
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ERosendo this looks great! I only left some comments that might be worth confirming.

cl/lib/search_index_utils.py Show resolved Hide resolved
Comment on lines +1274 to +1275
"case_name": 'Saucedo and Green Dream International, LLC <b> <font color="red"> Case Consolidated under 23-03142 </font> </b>',
"output": ["Saucedo", "Green Dream International, LLC"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from #4802 (comment) it seems that there might be cases where the words In or of are not part of the party names.

For instance something like:
In re: Advantage LLC

Is this possible in bankruptcy?

If so, in these cases, the indexed party would be In re: Advantage LLC, which doesn't seem correct. In district courts, we simply ignore anything that doesn't have a valid separator, but here, it seems more complicated since we're performing cleanup before splitting parties.

Perhaps, in these cases, we could completely ignore anything that contains In or of? Or we could look for examples of these case names and check if we can identify a common pattern for cleanup?

field_value = get_parties_from_case_name(
main_instance.case_name
field_value = (
get_parties_from_case_name_bankr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're adding a special method for splitting parties in bankruptcy cases both here and in prepare_parties, I’d suggest adding a test case similar to those in test_index_party_from_case_name_when_parties_are_not_available to confirm that the correct method is selected for bankruptcy.

I think two additional test cases should be enough:

  1. Splitting parties from the case_name when creating a bankruptcy docket (which will use the logic in prepare_parties).
  2. Splitting parties when updating a case_name (which will use the logic in document_fields_to_update).

Currently, in test_index_party_from_case_name_when_parties_are_not_available, the factory docket_with_no_parties comes from a bankruptcy court. To differentiate the method get_parties_from_case_name_bankr, it would be necessary to change the court in this factory to a district court and create a new factory for bankruptcy. You could rely on the expected parties for the assertion, considering that get_parties_from_case_name_bankr performs some cleanup, or simply confirm that the correct method is being called using a mock. The same approach can be applied for the case_name update test case for bankruptcy.

I don’t think it'd necessary to replicate the rest of the assertions from test_index_party_from_case_name_when_parties_are_not_available for bankruptcy since they share common logic that hasn’t changed.

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Feb 3, 2025
This commit introduces a new helper function, `is_bankruptcy_court`, which checks if a given court ID corresponds to a bankruptcy court.
@ERosendo ERosendo force-pushed the 4802-feat-get-parties-from-bankruptcy-case-names branch from 0e94712 to 7442b1e Compare February 4, 2025 19:24
@ERosendo ERosendo force-pushed the 4802-feat-get-parties-from-bankruptcy-case-names branch from 5728ca7 to 41ea9cd Compare February 5, 2025 00:41
This commit introduces logic to handle bankruptcy case names that begin with "in re" or "in the matter of".  These types of case names typically don't contain party information in the standard format, so the function now returns an empty list in these cases.  This prevents incorrect parsing and ensures more accurate extraction of party names.
@ERosendo ERosendo requested a review from albertisfu February 6, 2025 21:14
@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Feb 6, 2025
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ERosendo changes look great!

Just one last thing. While running a test, I noticed a case in my database with the title:

Toby Edward Torres - Adversary Proceeding

I was surprised that these words didn't appear in your sample. Maybe the reason is that there are only about 5,000 cases with this title.

The "- Adversary Proceeding" part is not actually in the docket title; it is added by Juriscraper.

Screenshot 2025-02-06 at 7 03 14 p m

So, I think we should remove "- Adversary Proceeding" before extracting the parties.

I also just realized that Mike mentioned this in one of his earlier comments.

Additionally, while reviewing Juriscraper, I noticed that some cases have the title "Unknown Case Title" I think we should ignore these completely.

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Feb 7, 2025
@ERosendo
Copy link
Contributor Author

ERosendo commented Feb 7, 2025

Hey @albertisfu Thanks for catching that.

I was surprised that these words didn't appear in your sample.

I investigated the development database and found 3,000 instances of "Adversary Proceeding." However, my random dataset only contained 20 instances.

@ERosendo ERosendo requested a review from albertisfu February 7, 2025 02:48
@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Feb 7, 2025
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @ERosendo this looks great now. :shipit:

@albertisfu albertisfu merged commit f303e37 into main Feb 7, 2025
15 checks passed
@albertisfu albertisfu deleted the 4802-feat-get-parties-from-bankruptcy-case-names branch February 7, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Extract parties from case_name when they are not available in bankruptcy cases
3 participants