Feature/process adjudication data #130

laurejt · 2024-12-16T21:46:01Z

Ref: #84

Script for processing prodigy adjudication data files (.jsonl). Also adds project support for type checking via mypy

codecov · 2024-12-16T21:58:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.59%. Comparing base (f8fa533) to head (a879842).
Report is 3 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #130      +/-   ##
===========================================
- Coverage    99.69%   99.59%   -0.11%     
===========================================
  Files            5        8       +3     
  Lines          665      993     +328     
===========================================
+ Hits           663      989     +326     
- Misses           2        4       +2

rlskoeser

Looks good, was able to run it on the round2 adjudication data and review the results. Just needs a bit of documentation on what this is and how to run it, then it can be merged.

rlskoeser · 2024-12-17T17:16:05Z

src/corppa/poetry_detection/annotation/process_adjudication_data.py

@@ -0,0 +1,176 @@
+import argparse


Please add a short docstring documenting briefly what this script does, what it is for, and one example of how to run it.

Here's how I ran it locally on the round 2 adjudication data:

python src/corppa/poetry_detection/annotation/process_adjudication_data.py allround2.jsonl round2_pages round2_excerpts

It wasn't obvious to me from the help information that I should have probably named the second param round2_pages.jsonl and the second one round2_excerpts.csv. I wasn't sure why we need both outputs or what the goals are for the different outputs.

Just want to verify that the parser help descriptions are clear. For example, output_pages was described as "Filename where extracted pages data (JSONL file) should be written".

"extracted pages data" was ambiguous to me in this context, I didn't know what it included or how it differed from the other output until poking around in the files a bit.

Maybe something like annotations aggregated/grouped by page ?

Does it make more sense in context now?

I'm fine with changing the name later, but not something worth burning more of my R&D time on.

thanks for updating the help text, it is clearer now

rlskoeser · 2024-12-17T17:17:26Z

src/corppa/poetry_detection/annotation/process_adjudication_data.py

+    for excerpt in page_data["excerpts"]:
+        entry = {
+            "page_id": page_data["page_id"],
+            "work_id": page_data["work_id"],
+            "work_title": page_data["work_title"],
+            "work_author": page_data["work_author"],
+            "work_year": page_data["work_year"],
+            "start": excerpt["start"],
+            "end": excerpt["end"],
+            "text": excerpt["text"],


when/if we come back to this (if we use this script again), let's tweak these fields to match the ones we've agreed on for the poem dataset

laurejt added 3 commits November 18, 2024 12:23

Add mypy type checking support for dev

56ef234

Script & tests for processing adjudication data

2ddcd32

Update comment

f058c2e

laurejt requested a review from rlskoeser December 16, 2024 21:46

laurejt assigned laurejt and unassigned laurejt Dec 16, 2024

Added handling for blank pages

8ebd33e

rlskoeser approved these changes Dec 17, 2024

View reviewed changes

laurejt added 2 commits December 19, 2024 09:34

Add additional documentation

de56db4

Slight update to documentation

a879842

laurejt merged commit a386f5f into develop Dec 19, 2024
6 checks passed

laurejt deleted the feature/process-adjudication-data branch December 19, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/process adjudication data #130

Feature/process adjudication data #130

laurejt commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading

rlskoeser left a comment

rlskoeser Dec 17, 2024

laurejt Dec 19, 2024

rlskoeser Dec 19, 2024

laurejt Dec 19, 2024

laurejt Dec 19, 2024

rlskoeser Dec 19, 2024

rlskoeser Dec 17, 2024

Feature/process adjudication data #130

Feature/process adjudication data #130

Conversation

laurejt commented Dec 16, 2024 • edited Loading

codecov bot commented Dec 16, 2024 • edited Loading

Codecov Report

rlskoeser left a comment

Choose a reason for hiding this comment

rlskoeser Dec 17, 2024

Choose a reason for hiding this comment

laurejt Dec 19, 2024

Choose a reason for hiding this comment

rlskoeser Dec 19, 2024

Choose a reason for hiding this comment

laurejt Dec 19, 2024

Choose a reason for hiding this comment

laurejt Dec 19, 2024

Choose a reason for hiding this comment

rlskoeser Dec 19, 2024

Choose a reason for hiding this comment

rlskoeser Dec 17, 2024

Choose a reason for hiding this comment

laurejt commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading