-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/process adjudication data #130
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #130 +/- ##
===========================================
- Coverage 99.69% 99.59% -0.11%
===========================================
Files 5 8 +3
Lines 665 993 +328
===========================================
+ Hits 663 989 +326
- Misses 2 4 +2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, was able to run it on the round2 adjudication data and review the results. Just needs a bit of documentation on what this is and how to run it, then it can be merged.
@@ -0,0 +1,176 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short docstring documenting briefly what this script does, what it is for, and one example of how to run it.
Here's how I ran it locally on the round 2 adjudication data:
python src/corppa/poetry_detection/annotation/process_adjudication_data.py allround2.jsonl round2_pages round2_excerpts
It wasn't obvious to me from the help information that I should have probably named the second param round2_pages.jsonl
and the second one round2_excerpts.csv
. I wasn't sure why we need both outputs or what the goals are for the different outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to verify that the parser help descriptions are clear. For example, output_pages
was described as "Filename where extracted pages data (JSONL file) should be written".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"extracted pages data" was ambiguous to me in this context, I didn't know what it included or how it differed from the other output until poking around in the files a bit.
Maybe something like annotations aggregated/grouped by page ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense in context now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing the name later, but not something worth burning more of my R&D time on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating the help text, it is clearer now
for excerpt in page_data["excerpts"]: | ||
entry = { | ||
"page_id": page_data["page_id"], | ||
"work_id": page_data["work_id"], | ||
"work_title": page_data["work_title"], | ||
"work_author": page_data["work_author"], | ||
"work_year": page_data["work_year"], | ||
"start": excerpt["start"], | ||
"end": excerpt["end"], | ||
"text": excerpt["text"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when/if we come back to this (if we use this script again), let's tweak these fields to match the ones we've agreed on for the poem dataset
Ref: #84
Script for processing prodigy adjudication data files (.jsonl). Also adds project support for type checking via mypy