-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calisphere etl #714
Calisphere etl #714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question. Looks good to me.
@@ -56,7 +56,8 @@ def parse_enrichment_url(enrichment_url): | |||
|
|||
|
|||
def run_enrichments(records, collection, enrichment_set, page_filename): | |||
for enrichment_url in collection.get(enrichment_set, []): | |||
enrichment_urls = collection.get(enrichment_set) or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've done this in several places, replacing collection.get(some_value, [])
with collection.get(some_value) or []
. Is there a difference in behavior between those two? It isn't obvious to me what that is. No issue with it, just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amywieliczka @bibliotechy so yeah, it turns out that if the key exists and the value is explicitly None
then this is what happens:
Python 3.9.0 (default, May 23 2023, 15:28:21)
[Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> mydict = {"foo": None}
>>> for x in mydict.get("foo", []):
... print(x)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not iterable
Whereas:
>>> mydict.get("foo") or []
[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but similarly confused by the replacement of <dict>.get(<key>, [])
with <dict>.get(<key>) or []
The mapper is ready for review.
I see what you mean about the solr API returning a final empty page using nextCursorMark. This causes problems when running the
harvest_collection
DAG where the mapper fans out by page. I can't figure out what the heck is going on with Solr.Since the solr index is static and this is transitional functionality that we'll retire after we cutover to rikolti, I went ahead and added a bit of a hacky workaround to the fetcher's
increment()
method, which reports "finished" if we've fetched the number of docs that we're expecting. (See this commit: 2fe62d0).