Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/process adjudication data #130

Merged
merged 6 commits into from
Dec 19, 2024
Merged

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Dec 16, 2024

Ref: #84

Script for processing prodigy adjudication data files (.jsonl). Also adds project support for type checking via mypy

@laurejt laurejt requested a review from rlskoeser December 16, 2024 21:46
@laurejt laurejt assigned laurejt and unassigned laurejt Dec 16, 2024
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.59%. Comparing base (f8fa533) to head (a879842).
Report is 3 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #130      +/-   ##
===========================================
- Coverage    99.69%   99.59%   -0.11%     
===========================================
  Files            5        8       +3     
  Lines          665      993     +328     
===========================================
+ Hits           663      989     +326     
- Misses           2        4       +2     

Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, was able to run it on the round2 adjudication data and review the results. Just needs a bit of documentation on what this is and how to run it, then it can be merged.

@@ -0,0 +1,176 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short docstring documenting briefly what this script does, what it is for, and one example of how to run it.

Here's how I ran it locally on the round 2 adjudication data:

python src/corppa/poetry_detection/annotation/process_adjudication_data.py allround2.jsonl round2_pages round2_excerpts

It wasn't obvious to me from the help information that I should have probably named the second param round2_pages.jsonl and the second one round2_excerpts.csv. I wasn't sure why we need both outputs or what the goals are for the different outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to verify that the parser help descriptions are clear. For example, output_pages was described as "Filename where extracted pages data (JSONL file) should be written".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"extracted pages data" was ambiguous to me in this context, I didn't know what it included or how it differed from the other output until poking around in the files a bit.

Maybe something like annotations aggregated/grouped by page ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense in context now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing the name later, but not something worth burning more of my R&D time on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating the help text, it is clearer now

Comment on lines +67 to +76
for excerpt in page_data["excerpts"]:
entry = {
"page_id": page_data["page_id"],
"work_id": page_data["work_id"],
"work_title": page_data["work_title"],
"work_author": page_data["work_author"],
"work_year": page_data["work_year"],
"start": excerpt["start"],
"end": excerpt["end"],
"text": excerpt["text"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when/if we come back to this (if we use this script again), let's tweak these fields to match the ones we've agreed on for the poem dataset

@laurejt laurejt merged commit a386f5f into develop Dec 19, 2024
6 checks passed
@laurejt laurejt deleted the feature/process-adjudication-data branch December 19, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants