-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calisphere etl #714
Merged
Merged
Calisphere etl #714
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
635ef5c
update fetcher docs, check_page signature
amywieliczka 93f9d04
CalisphereSolrFetcher for Rikolti ETL jobs
amywieliczka afb023f
Add auth to CalisphereSolrFetcher
amywieliczka 70e2571
update request endpoint from /select? to /query?
amywieliczka 55bdaca
Beginnings of calisphere solr mapper
barbarahui f32803f
Calisphere Solr mapper is functional (no validator yet)
barbarahui 4263e3a
Add thumbnail_source mapping and validator
barbarahui da142b1
Make use of remove_validatable_field() function
barbarahui e76a597
Merge branch 'main' into calisphere-etl
barbarahui 1d8f2be
Remove unused import
barbarahui 2fe62d0
Add workaround for solr returning an extra page with zero docs
barbarahui f0511e5
Fix logic
barbarahui 1e8cdef
Cleanup dict.get() syntax
barbarahui b0c1ade
Fix typo
barbarahui File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
import json | ||
import requests | ||
|
||
from .Fetcher import Fetcher | ||
from ..settings import CALISPHERE_ETL_TOKEN | ||
|
||
class CalisphereSolrFetcher(Fetcher): | ||
def __init__(self, params: dict[str, str]): | ||
super(CalisphereSolrFetcher, self).__init__(params) | ||
self.collection_id = params.get("collection_id") | ||
self.cursor_mark = params.get("cursor_mark", "*") | ||
self.num_found = params.get("num_found", 0) | ||
self.num_fetched = params.get("num_fetched", 0) | ||
|
||
def build_fetch_request(self) -> dict[str, str]: | ||
""" | ||
Generates arguments for `requests.get()`. | ||
|
||
Returns: dict[str, str] | ||
""" | ||
params = { | ||
"fq": ( | ||
"collection_url:\"https://registry.cdlib.org/api/v1/" | ||
f"collection/{self.collection_id}/\"" | ||
), | ||
"rows": 100, | ||
"cursorMark": self.cursor_mark, | ||
"wt": "json", | ||
"sort": "id asc" | ||
} | ||
|
||
request = { | ||
"url": "https://solr.calisphere.org/solr/query", | ||
"headers": {'X-Authentication-Token': CALISPHERE_ETL_TOKEN}, | ||
"params": params | ||
} | ||
|
||
print( | ||
f"[{self.collection_id}]: Fetching page {self.write_page} " | ||
f"at {request.get('url')} with params {params}") | ||
|
||
return request | ||
|
||
def check_page(self, http_resp: requests.Response) -> int: | ||
""" | ||
Parameters: | ||
http_resp: requests.Response | ||
|
||
Returns: int: number of records in the response | ||
""" | ||
|
||
resp_dict = http_resp.json() | ||
hits = len(resp_dict["response"]["docs"]) | ||
|
||
print( | ||
f"[{self.collection_id}]: Fetched page {self.write_page} " | ||
f"at {http_resp.url} with {hits} hits" | ||
) | ||
|
||
return hits | ||
|
||
def increment(self, http_resp: requests.Response): | ||
""" | ||
Sets the `next_url` to fetch and increments the page number. | ||
|
||
Parameters: | ||
http_resp: requests.Response | ||
""" | ||
super(CalisphereSolrFetcher, self).increment(http_resp) | ||
resp_dict = http_resp.json() | ||
|
||
# this is a workaround for solr giving us an extra page | ||
# with zero docs after the last page of results | ||
self.num_found = resp_dict["response"]["numFound"] | ||
self.num_fetched = self.num_fetched + len(resp_dict["response"]["docs"]) | ||
if self.cursor_mark != resp_dict["nextCursorMark"] \ | ||
and self.num_fetched != self.num_found: | ||
self.cursor_mark = resp_dict["nextCursorMark"] | ||
self.finished = False | ||
else: | ||
self.finished = True | ||
|
||
def json(self) -> str: | ||
""" | ||
Generates JSON for the next page of results. | ||
|
||
Returns: str | ||
""" | ||
current_state = { | ||
"harvest_type": self.harvest_type, | ||
"collection_id": self.collection_id, | ||
"write_page": self.write_page, | ||
"cursor_mark": self.cursor_mark, | ||
"num_found": self.num_found, | ||
"num_fetched": self.num_fetched, | ||
"finished": self.finished | ||
} | ||
|
||
return json.dumps(current_state) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
87 changes: 87 additions & 0 deletions
87
metadata_mapper/mappers/calisphere_solr/calisphere_solr_mapper.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
import json | ||
|
||
from ..mapper import Record, Vernacular, Validator | ||
|
||
class CalisphereSolrRecord(Record): | ||
# This mapper does not handle Nuxeo record complexities, meaning: | ||
# - it ignores structmap* solr fields for complex objects | ||
# - it does not map media_source | ||
def UCLDC_map(self) -> dict: | ||
return { | ||
"calisphere-id": self.map_calisphere_id(), | ||
"is_shown_at": self.source_metadata.get("url_item"), | ||
"thumbnail_source": self.map_thumbnail_source(), | ||
"title": self.source_metadata.get("title"), | ||
"alternative_title": self.source_metadata.get("alternative_title", None), | ||
"contributor": self.source_metadata.get("contributor", None), | ||
"coverage": self.source_metadata.get("coverage", None), | ||
"creator": self.source_metadata.get("creator", None), | ||
"date": self.source_metadata.get("date", None), | ||
"extent": self.source_metadata.get("extent", None), | ||
"format": self.source_metadata.get("format", None), | ||
"genre": self.source_metadata.get("genre", None), | ||
"identifier": self.source_metadata.get("identifier", None), | ||
"language": self.source_metadata.get("language", None), | ||
"location": self.source_metadata.get("location", None), | ||
"publisher": self.source_metadata.get("publisher", None), | ||
"relation":self.source_metadata.get("relation", None), | ||
"rights": self.source_metadata.get("rights", None), | ||
"rights_holder": self.source_metadata.get("rights_holder", None), | ||
"rights_note": self.source_metadata.get("rights_note", None), | ||
"rights_date": self.source_metadata.get("rights_date", None), | ||
"source": self.source_metadata.get("source", None), | ||
"spatial": self.source_metadata.get("spatial", None), | ||
"subject": self.source_metadata.get("subject", None), | ||
"temporal": self.source_metadata.get("temporal", None), | ||
"type": self.source_metadata.get("type", None), | ||
"sort_title": self.source_metadata.get("sort_title", None), | ||
"description": self.source_metadata.get("description", None), | ||
"provenance": self.source_metadata.get("provenance", None), | ||
"transcription": self.source_metadata.get("transcription", None), | ||
"id": self.source_metadata.get("id", None), | ||
"campus_name": self.source_metadata.get("campus_name", None), | ||
"campus_data": self.source_metadata.get("campus_data", None), | ||
"collection_name": self.source_metadata.get("collection_name", None), | ||
"collection_data": self.source_metadata.get("collection_data", None), | ||
"collection_url": self.source_metadata.get("collection_url", None), | ||
"sort_collection_data": self.source_metadata.get("sort_collection_data", None), | ||
"repository_name": self.source_metadata.get("repository_name", None), | ||
"repository_data": self.source_metadata.get("repository_data", None), | ||
"repository_url": self.source_metadata.get("repository_url", None), | ||
"rights_uri": self.source_metadata.get("rights_uri", None), | ||
"manifest": self.source_metadata.get("manifest", None), | ||
"object_template": self.source_metadata.get("object_template", None), | ||
"url_item": self.source_metadata.get("url_item", None), | ||
"created": self.source_metadata.get("created", None), | ||
"last_modified": self.source_metadata.get("last_modified", None), | ||
"sort_date_start": self.source_metadata.get("sort_date_start", None), | ||
"sort_date_end": self.source_metadata.get("sort_date_end", None), | ||
"campus_id": self.source_metadata.get("campus_id", None), | ||
"collection_id": self.source_metadata.get("collection_id", None), | ||
"repository_id": self.source_metadata.get("repository_id", None), | ||
"item_count": self.source_metadata.get("item_count", None), | ||
"reference_image_md5": self.source_metadata.get("reference_image_md5", None), | ||
"reference_image_dimensions": self.source_metadata.get("reference_image_dimensions", None), | ||
} | ||
|
||
def map_calisphere_id(self): | ||
harvest_id = self.source_metadata.get('harvest_id_s') | ||
return harvest_id.split("--")[1] | ||
|
||
def map_thumbnail_source(self): | ||
image_md5 = self.source_metadata.get("reference_image_md5", None) | ||
if image_md5: | ||
return f"https://static-ucldc-cdlib-org.s3.us-west-2.amazonaws.com/harvested_images/{image_md5}" | ||
|
||
class CalisphereSolrValidator(Validator): | ||
def setup(self): | ||
self.remove_validatable_field(field="is_shown_by") | ||
|
||
class CalisphereSolrVernacular(Vernacular): | ||
record_cls = CalisphereSolrRecord | ||
validator = CalisphereSolrValidator | ||
|
||
def parse(self, api_response): | ||
page_element = json.loads(api_response) | ||
records = page_element.get("response", {}).get("docs", []) | ||
return self.get_records([record for record in records]) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've done this in several places, replacing
collection.get(some_value, [])
withcollection.get(some_value) or []
. Is there a difference in behavior between those two? It isn't obvious to me what that is. No issue with it, just wondering.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amywieliczka @bibliotechy so yeah, it turns out that if the key exists and the value is explicitly
None
then this is what happens:Whereas: