Calisphere etl #714

barbarahui · 2024-01-27T00:49:06Z

The mapper is ready for review.

I see what you mean about the solr API returning a final empty page using nextCursorMark. This causes problems when running the harvest_collection DAG where the mapper fans out by page. I can't figure out what the heck is going on with Solr.

Since the solr index is static and this is transitional functionality that we'll retire after we cutover to rikolti, I went ahead and added a bit of a hacky workaround to the fetcher's increment() method, which reports "finished" if we've fetched the number of docs that we're expecting. (See this commit: 2fe62d0).

bibliotechy

One question. Looks good to me.

bibliotechy · 2024-01-29T20:11:06Z

metadata_mapper/lambda_function.py

@@ -56,7 +56,8 @@ def parse_enrichment_url(enrichment_url):


 def run_enrichments(records, collection, enrichment_set, page_filename):
-    for enrichment_url in collection.get(enrichment_set, []):
+    enrichment_urls = collection.get(enrichment_set) or []


You've done this in several places, replacing collection.get(some_value, []) with collection.get(some_value) or []. Is there a difference in behavior between those two? It isn't obvious to me what that is. No issue with it, just wondering.

@amywieliczka @bibliotechy so yeah, it turns out that if the key exists and the value is explicitly None then this is what happens:

Python 3.9.0 (default, May 23 2023, 15:28:21) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> mydict = {"foo": None} >>> for x in mydict.get("foo", []): ... print(x) ... Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'NoneType' object is not iterable

Whereas:

>>> mydict.get("foo") or [] []

amywieliczka

This looks good to me, but similarly confused by the replacement of <dict>.get(<key>, []) with <dict>.get(<key>) or []

amywieliczka and others added 9 commits January 24, 2024 14:21

update fetcher docs, check_page signature

635ef5c

CalisphereSolrFetcher for Rikolti ETL jobs

93f9d04

Add auth to CalisphereSolrFetcher

afb023f

update request endpoint from /select? to /query?

70e2571

Beginnings of calisphere solr mapper

55bdaca

Calisphere Solr mapper is functional (no validator yet)

f32803f

Add thumbnail_source mapping and validator

4263e3a

Make use of remove_validatable_field() function

da142b1

Merge branch 'main' into calisphere-etl

e76a597

barbarahui requested a review from amywieliczka as a code owner January 27, 2024 00:49

barbarahui added the core feature MVP label Jan 27, 2024

barbarahui linked an issue Jan 27, 2024 that may be closed by this pull request

Solr to ElasticSearch ETL development #657

Closed

8 tasks

barbarahui added 5 commits January 26, 2024 17:00

Remove unused import

1d8f2be

Add workaround for solr returning an extra page with zero docs

2fe62d0

Fix logic

f0511e5

Cleanup dict.get() syntax

1e8cdef

Fix typo

b0c1ade

bibliotechy approved these changes Jan 29, 2024

View reviewed changes

amywieliczka approved these changes Jan 29, 2024

View reviewed changes

barbarahui merged commit 4092654 into main Jan 30, 2024
2 checks passed

barbarahui deleted the calisphere-etl branch January 30, 2024 00:21

christinklez added this to the Migration Planning milestone Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calisphere etl #714

Calisphere etl #714

barbarahui commented Jan 27, 2024 •

edited

Loading

bibliotechy left a comment

bibliotechy Jan 29, 2024

barbarahui Jan 30, 2024 •

edited

Loading

amywieliczka left a comment

Calisphere etl #714

Calisphere etl #714

Conversation

barbarahui commented Jan 27, 2024 • edited Loading

bibliotechy left a comment

Choose a reason for hiding this comment

bibliotechy Jan 29, 2024

Choose a reason for hiding this comment

barbarahui Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

amywieliczka left a comment

Choose a reason for hiding this comment

barbarahui commented Jan 27, 2024 •

edited

Loading

barbarahui Jan 30, 2024 •

edited

Loading