-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enrich date #1090
Enrich date #1090
Conversation
c17435c
to
4fcfd5e
Compare
a212112
to
44c1df1
Compare
44c1df1
to
b1b4035
Compare
f049cc4
to
499d88d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One small comment inline.
One question I do have- should we be running the test of the date parsers as part of our CI suite?
metadata_mapper/mappers/mapper.py
Outdated
self.mapped_data is always a dictionary and should always be a | ||
dictionary, so directionally, I'd rather resolve this by | ||
figuring out why pylance thinks it could be otherwise and adding | ||
clarity there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to add the type declaration when the self.mapped_data
is initialized above?
self.mapped_data: dict = {}
Also, another parsing library to consider in the future is https://github.com/bear/parsedatetime |
856afad
to
bb482d6
Compare
Adds date parsing logic for:
enrich_date
enrich_earliest_date
add_facet_decade
make_sort_dates
unpack_display_date
All fetched vernacular data is run through a mapper. I did not do an analysis of what different mappers are doing with regards to date. Once the record is mapped,
enrich_date
andenrich_earliest_date
are optionally run, pending their configuration in the enrichment chain for that particular collection.enrich_earliest_date
: retrieves the mapped record'sdate
value, manipulates it, and then replaces the mapped record'sdate
value with the results of it's manipulation. Internally,convert_date
converts a string or list of strings into a list of dictionaries with the keys:begin
,end
, anddisplayDate
. Strings are parsed using DPLA's zen library: https://github.com/dpla-attic/zen (used in DPLA's ingestion 1 and copied here with some minor python2 to python3 updates to duplicate existing functionality). If the mapped record'sdate
value is already a dictionary, or if it is a list containing any dictionary, it does no date manipulation and simply returns thedate
value.Despite its name, I didn't see any indication that
enrich_earliest_date
has any kind of logic determining what dates are "earliest".enrich_date
: does the exact same thing asenrich_earliest_date
, except it takes a field name as an argument, operates on the value of the specified field name, and replaces the specified field's value in the mapped record. The default field name istemporal
. In my analysis,enrich_date
exists in 2,749 enrichment chains with no field name specified, and exists in 1,003 enrichment chains with the field namedate
specified. Instances ofenrich_date&prop=sourceResource/date
are functional duplicates ofenrich_earliest_date
, and sinceconvert_dates
does not perform any manipulation on lists of dictionaries, this duplication has no effect on the record. There are 3 collections that haveenrich_date&prop=sourceResource/date
but do not haveenrich_earliest_date
in their enrichment chains.Unlike the two enrichments listed previously, the following 3 functions run on every record (except Calisphere Solr extracted records), regardless of enrichment configuration:
add_facet_decade
is part of thesolr_updater
function that was migrated from our legacy harvester and operates on thedate
value of the mapped and enriched record to write to thefacet_decade
field in thesolr_doc
(the final output of the metadat_mapper module). It is difficult to say what thisdate
value might look like at this point, since there are some collections that never callenrich_earliest_date
and also never callenrich_date&prop=sourceResource/date
. In those cases, the value of thedate
field looks as it does whenever the mapper is finished with it. Some mappers do seem to produce lists of dictionaries with the keysbegin
,end
,displayDate
(providing their own special, mapper-specific date parsing logic), but it seems that some mappers also simply produce strings, or lists of strings. Nevertheless,add_facet_decade
makes thedate
value into a list if it is not already, and then tries toget_facet_decade
for each value in that list - returning the first successful response fromget_facet_decade
. For its part,get_facet_decade
does it's own special parsing of thedisplayDate
using Brian's decade facet logic.make_sort_dates
is also part of thesolr_updater
function and operates on thedate
value of the mapped and enriched record to write to thesort_date_start
andsort_date_end
fields of thesolr_doc
. It depends on thedate
value containing at least one dictionary with thebegin
andend
keys. If thedate
value contains multiple dictionaries with thebegin
andend
keys, it sorts all the begin dates and returns the first one as the start date, and then sorts all the end dates and returns the first one as the end date.unpack_display_date
is also part of thesolr_updater
function and operates on thedate
value and thetemporal
value to writedate
andtemporal
values to thesolr_doc
. This function turns the value into a list if it is not already, and then for each element in the list, if it's a dictionary, puts thedisplayDate
value in the return value, and if it's a string, puts the string into the return value, and finally returns this new list of display dates and strings.There are also some updates to the validator. We shut down the CouchDB instance, so validations against couch no longer work. As such, I've commented out
is_shown_by
andis_shown_at
validation functions.There is also a new test, the
metadata_mapper/test/test_date_processing.py
module, along with 5 csvs defining inputs and expected outputs to the previously mentioned 5 functions. These csvs were created using the 27 test collections specified here: https://docs.google.com/spreadsheets/d/1JNw7ynxCJS8d4f5W4Glwwqx0qMm8ss-aB_GpjVxN1dI/edit?gid=0#gid=0 (filter on Calisphere Solr Mapper Type = No) and their validated output. You can run these tests usingpytest metadata_mapper
This PR requires a new pip install in your environment - I've added a new dependency to the project,
timelib
I'm also not 100% sure the conditional validation works (commit: 1b78253) - I'm not sure which OS the
validation_collection_task
is running on, and if theUCLDC_SOLR_URL
andUCLDC_COUCH_URL
are available within that environment, but this is the general idea regarding conditional validation.