You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for each page of Solr objects, map the fields to the same fields in OpenSearch (for this most part, this is a direct mapping)
stash these "mapped" pages to s3?
migrate thumbnails from the legacy s3 location (in dsc acct) to the new s3 thumbnail location (in pad acct)
for each page of mapped metadata, bulk index to OpenSearch
Dev Work:
add a "Rikolti ETL" value to the registry for a collection's harvest type field
update all the collections to have a "Rikolti ETL" harvest type
write a "CalisphereSolrFetcher" that queries the Solr index given a collection ID from the registry
modify all the collections to have an enrichments item of exactly one item: /dpla-mapper?mapper_type=rikolti_etl
write a "rikolti_etl" mapper that implements a very straight mapping.
write a new migrate_thumbnails_task that would read every page of mapped metadata and go find the thumbnails in our s3 bucket at the md5 listed in the metadata and copy those thumbnails over to the new s3 bucket.
create a new "rikolti_etl" Airflow DAG that runs the rikolti fetcher, rikolti mapper, the new thumbnail migration task, and the rikolti record indexer tasks.
bulk process on airflow side, on registry side, or run by collection?
Airflow has access to Solr queries, has access to the Rikolti s3 buckets and the Rikolti OpenSearch instance. The thumbnail s3 bucket is actually open to the public, so Airflow should be able to access it as well, making airflow a well-positioned piece of infrastructure for running this job.
The text was updated successfully, but these errors were encountered:
This would need to:
Dev Work:
harvest type
field/dpla-mapper?mapper_type=rikolti_etl
migrate_thumbnails_task
that would read every page of mapped metadata and go find the thumbnails in our s3 bucket at the md5 listed in the metadata and copy those thumbnails over to the new s3 bucket.Airflow has access to Solr queries, has access to the Rikolti s3 buckets and the Rikolti OpenSearch instance. The thumbnail s3 bucket is actually open to the public, so Airflow should be able to access it as well, making airflow a well-positioned piece of infrastructure for running this job.
The text was updated successfully, but these errors were encountered: