Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr to ElasticSearch ETL development #657

Closed
4 of 8 tasks
christinklez opened this issue Nov 30, 2023 · 2 comments · Fixed by #714
Closed
4 of 8 tasks

Solr to ElasticSearch ETL development #657

christinklez opened this issue Nov 30, 2023 · 2 comments · Fixed by #714
Assignees

Comments

@christinklez
Copy link
Collaborator

christinklez commented Nov 30, 2023

This would need to:

  1. run a paginated Solr query for all objects in a particular collection, effectively the same query that powers this page: https://calisphere.org/collections/26233/
  2. stash these "fetched" pages to s3?
  3. for each page of Solr objects, map the fields to the same fields in OpenSearch (for this most part, this is a direct mapping)
  4. stash these "mapped" pages to s3?
  5. migrate thumbnails from the legacy s3 location (in dsc acct) to the new s3 thumbnail location (in pad acct)
  6. for each page of mapped metadata, bulk index to OpenSearch

Dev Work:

  • add a "Rikolti ETL" value to the registry for a collection's harvest type field
  • update all the collections to have a "Rikolti ETL" harvest type
  • write a "CalisphereSolrFetcher" that queries the Solr index given a collection ID from the registry
  • modify all the collections to have an enrichments item of exactly one item: /dpla-mapper?mapper_type=rikolti_etl
  • write a "rikolti_etl" mapper that implements a very straight mapping.
  • write a new migrate_thumbnails_task that would read every page of mapped metadata and go find the thumbnails in our s3 bucket at the md5 listed in the metadata and copy those thumbnails over to the new s3 bucket.
  • create a new "rikolti_etl" Airflow DAG that runs the rikolti fetcher, rikolti mapper, the new thumbnail migration task, and the rikolti record indexer tasks.
  • bulk process on airflow side, on registry side, or run by collection?

Airflow has access to Solr queries, has access to the Rikolti s3 buckets and the Rikolti OpenSearch instance. The thumbnail s3 bucket is actually open to the public, so Airflow should be able to access it as well, making airflow a well-positioned piece of infrastructure for running this job.

@amywieliczka
Copy link
Collaborator

Punting to 2024. Alas.

@amywieliczka
Copy link
Collaborator

amywieliczka commented Jan 19, 2024

[moved to issue description]

@barbarahui barbarahui linked a pull request Jan 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants