Merge branch 'main' into barbara/csudh-call-collate_field

ucldc · Nov 28, 2023 · 77521ae · 77521ae
2 parents c4fbb12 + e757872
commit 77521ae
Show file tree

Hide file tree

Showing 41 changed files with 1,333 additions and 1,167 deletions.
diff --git a/content_harvester/Dockerfile → Dockerfile.content_harvester b/content_harvester/Dockerfile → Dockerfile.content_harvester
@@ -9,11 +9,12 @@ RUN sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<!--<policy
 
 WORKDIR /
 
-COPY requirements.txt ./
+COPY content_harvester/requirements.txt ./
 
 RUN pip install --upgrade pip && pip install -r requirements.txt
 
-COPY ./ /content_harvester
+COPY content_harvester/ /content_harvester
+COPY utils/ /rikolti/utils
 
 RUN chmod +x /content_harvester/by_collection.py
 

diff --git a/Dockerfile → Dockerfile.dev b/Dockerfile → Dockerfile.dev
diff --git a/README.md b/README.md
@@ -48,20 +48,18 @@ vi env.local
 
 Currently, I only use one virtual environment, even though each folder located at the root of this repository represents an isolated component. If dependency conflicts are encountered, I'll wind up creating separate environments.
 
-Similarly, I also only use one env.local as well. Rikolti fetches data to your local system, maps that data, and then fetches relevant content files (media files, previews, and thumbnails). Set `FETCHER_DATA_DEST` to the URI where you would like Rikolti to store fetched data - Rikolti will create a folder (or s3 prefix) `<collection_id>/vernacular_metadata` at this location. Set `MAPPER_DATA_SRC` to the URI where Rikolti can find a `<collection_id>/vernacular_metadata` folder that contains the fetched data you're attempting to map. Set `MAPPER_DATA_DEST` to the URI where you would like Rikolti to store mapped data - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_metadata` at this location. Set `CONTENT_DATA_SRC` to the URI where Rikolti can find a `<collection_id>/mapped_metadata` folder that contains the mapped metadata describing where to find content. Set `CONTENT_DATA_DEST` to the URI where you would like Rikolti to store mapped data that has been updated with pointers to content files - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_with_content` at this location. Set `CONTENT_DEST` to the URI where you would like Rikolti to store content files.
+Similarly, I also only use one env.local as well. Rikolti fetches data to your local system, maps that data, and then fetches relevant content files (media files, previews, and thumbnails). Set `VERNACULAR_DATA` to the URI where you would like Rikolti to store and retrieve fetched data - Rikolti will create a folder (or s3 prefix) `<collection_id>/vernacular_metadata` at this location. Set `MAPPED_DATA` to the URI where you would like Rikolti to store and retrieve mapped data - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_metadata` at this location. Set `CONTENT_DATA` to the URI where you would like Rikolti to store mapped data that has been updated with pointers to content files - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_with_content` at this location. Set `CONTENT_ROOT` to the URI where you would like Rikolti to store content files.
 
 For example, one way to configure `env.local` is:
 
 ```
-FETCHER_DATA_DEST=file:///Users/awieliczka/Projects/rikolti/rikolti_data
-MAPPER_DATA_SRC=$FETCHER_DATA_DEST
-MAPPER_DATA_DEST=$FETCHER_DATA_DEST
-CONTENT_DATA_SRC=$FETCHER_DATA_DEST
-CONTENT_DATA_DEST=$FETCHER_DATA_DEST
-CONTENT_DEST=file:///Users/awieliczka/Projects/rikolti/rikolti_content
+VERNACULAR_DATA=file:///Users/awieliczka/Projects/rikolti/rikolti_data
+MAPPED_DATA=$VERNACULAR_DATA
+CONTENT_DATA=$VERNACULAR_DATA
+CONTENT_ROOT=file:///Users/awieliczka/Projects/rikolti/rikolti_content
 ```
 
-Each of these can be different locations, however. For example, if you're attempting to re-run a mapper locally off of previously fetched data stored on s3, you might set `MAPPER_DATA_SRC=s3://rikolti_data`.
+Each of these can be different locations, however. For example, if you're attempting to re-run a mapper locally off of previously fetched data stored on s3, you might set `VERNACULAR_DATA=s3://rikolti_data`.
 
 In env.example you'll also see `CONTENT_DATA_MOUNT` and `CONTENT_MOUNT` environment variables. These are only relevant if you are running the content harvester using airflow, and want to set and of the CONTENT_ environment variables to the local filesystem. Their usage is described below in the Airflow Development section.
 
@@ -172,9 +170,8 @@ The docker socket will typically be at `/var/run/docker.sock`. On Mac OS Docker
 Next, back in the Rikolti repository, create the `startup.sh` file by running `cp env.example dags/startup.sh`. Update the startup.sh file with Nuxeo, Flickr, and Solr keys as available, and make sure that the following environment variables are set:
 
 ```
-export FETCHER_DATA_DEST=file:///usr/local/airflow/rikolti_data
-export MAPPER_DATA_SRC=file:///usr/local/airflow/rikolti_data
-export MAPPER_DATA_DEST=file:///usr/local/airflow/rikolti_data
+export VERNACULAR_DATA=file:///usr/local/airflow/rikolti_data
+export MAPPED_DATA=file:///usr/local/airflow/rikolti_data
 ```
 
 The folder located at `RIKOLTI_DATA_HOME` (set in `aws-mwaa-local-runner/docker/.env`) is mounted to `/usr/local/airflow/rikolti_data` on the airflow docker container.
@@ -184,9 +181,8 @@ Please also make sure the following `CONTENT_*` variables are set - `CONTENT_DAT
 ```
 export CONTENT_DATA_MOUNT=/Users/awieliczka/Projects/rikolti_data
 export CONTENT_MOUNT=/Users/awieliczka/Projects/rikolti_content
-export CONTENT_DATA_SRC=file:///rikolti_data
-export CONTENT_DATA_DEST=file:///rikolti_data
-export CONTENT_DEST=file:///rikolti_content
+export CONTENT_DATA=file:///rikolti_data
+export CONTENT_ROOT=file:///rikolti_content
 ```
 
 The folder located at `CONTENT_DATA_MOUNT` is mounted to `/rikolti_data` and the folder located at `CONTENT_MOUNT` is mounted to `/rikolti_content` on the content_harvester docker container.
@@ -197,7 +193,11 @@ If you would like to run the content harvester on AWS infrastructure using the E
 
 > A note about Docker vs. ECS: Since we do not actively maintain our own Docker daemon, and since MWAA workers do not come with a Docker daemon installed, we cannot use a docker execution environment in deployed MWAA and instead use ECS to run our content harvester containers on Fargate infrastructure. The EcsRunTaskOperator allows us to run a pre-defined ECS Task Definition. The EcsRegisterTaskDefinitionOperator allows us to define an ECS Task Definition which we could then run. At this time, we are defining the Task Definition in our [cloudformation templates](https://github.com/cdlib/pad-airflow), rather than using the EcsRegisterTaskDefinitionOperator, but this does mean that we cannot modify the container's image or version using the EcsRunTaskOperator.
 
-If you would like to run your own rikolti/content_harvester image instead of pulling the image from AWS, then from inside the Rikolti repo, run `docker build -t rikolti/content_harvester content_harvester` to build the `rikolti/content_harvester` image locally and update the `content_harvester_image` to be `rikolti/content_harvester`.
+If you would like to run your own rikolti/content_harvester image instead of pulling the image from AWS, then from inside the Rikolti repo, run `docker build -f Dockerfile.content_harvester -t rikolti/content_harvester .` to build the `rikolti/content_harvester` image locally and add the following line to `dags/startup.sh` to update `CONTENT_HARVEST_IMAGE` to be `rikolt/content_harvester`:
+
+```
+export CONTENT_HARVEST_IMAGE=rikolti/content_harvester
+```
 
 Finally, from inside the aws-mwaa-local-runner repo, run `./mwaa-local-env build-image` to build the docker image, and `./mwaa-local-env start` to start the mwaa local environment.
 

diff --git a/content_harvester/README.md b/content_harvester/README.md
@@ -34,12 +34,14 @@ The above media and thumbnail fetching processes are enacted upon child metadata
 
 # Settings
 
-You can bypass uploading to s3 by setting `settings.CONTENT_DATA_DEST = "file://<local path>"` and `settings.CONTENT_DEST = "file://<local_path>"`. This is useful for local development and testing. This will, however, set the metadata records' `media['media_filepath']` and `thumbnail['thumbnail_filepath']` to a local filepath. 
+You can bypass uploading to s3 by setting `settings.CONTENT_DATA = "file://<local path>"` and `settings.CONTENT_ROOT = "file://<local_path>"`. This is useful for local development and testing. This will, however, set the metadata records' `media['media_filepath']` and `thumbnail['thumbnail_filepath']` to a local filepath. 
 
 # Local Development
 
+From inside the rikolti folder:
 ```
-docker build -t rikolti/content_harvester .
+docker build -f Dockerfile.content_harvester -t rikolti/content_harvester .
+cd content_harvester
 docker compose run --entrypoint "python3 -m content_harvester.by_registry_endpoint" --rm content_harvester https://registry.cdlib.org/api/v1/rikoltimapper/26147/?format=json
 ```
 
@@ -59,7 +61,7 @@ To build manually: From a terminal with AWS credentials, get login password for
 ```
 aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/b6c7x7s4
 docker buildx create --use
-docker buildx build --platform linux/arm64,linux/amd64 -t public.ecr.aws/b6c7x7s4/rikolti/content_harvester content_harvester --push
+docker buildx build -f Dockerfile.content_harvester --platform linux/arm64,linux/amd64 -t public.ecr.aws/b6c7x7s4/rikolti/content_harvester . --push
 ```
 
 # TODO:

diff --git a/content_harvester/by_collection.py b/content_harvester/by_collection.py
@@ -1,54 +1,38 @@
 import json
-import os
 
-import boto3
-
-from . import settings
 from .by_page import harvest_page_content
-
-
-def get_mapped_pages(collection_id):
-    page_list = []
-    if settings.DATA_SRC['STORE'] == 'file':
-        mapped_path = settings.local_path(collection_id, 'mapped_metadata')
-        try:
-            page_list = [f for f in os.listdir(mapped_path)
-                            if os.path.isfile(os.path.join(mapped_path, f))]
-        except FileNotFoundError as e:
-            print(f"{e} - have you mapped {collection_id}?")
-    else:
-        s3_client = boto3.client(
-            's3',
-            aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
-            aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
-            aws_session_token=settings.AWS_SESSION_TOKEN,
-            region_name=settings.AWS_REGION
-        )
-        response = s3_client.list_objects_v2(
-            Bucket=settings.DATA_SRC["BUCKET"],
-            Prefix=f'{collection_id}/mapped_metadata/'
-        )
-        page_list = [obj['Key'].split('/')[-1] for obj in response['Contents']]
-    return page_list
+from . import settings
+from rikolti.utils.versions import get_mapped_pages, create_content_data_version
 
 
 # {"collection_id": 26098, "rikolti_mapper_type": "nuxeo.nuxeo"}
-def harvest_collection(collection):
+def harvest_collection(collection, mapped_data_version: str):
     if isinstance(collection, str):
         collection = json.loads(collection)
 
     collection_id = collection.get('collection_id')
 
-    if not collection_id:
-        print("ERROR ERROR ERROR\ncollection_id required")
+    if not collection_id or not mapped_data_version:
+        print("ERROR ERROR ERROR\ncollection_id and mapped_data_version required")
         exit()
 
-    page_list = get_mapped_pages(collection_id)
+    page_list = get_mapped_pages(
+        mapped_data_version,
+        aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
+        aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
+        aws_session_token=settings.AWS_SESSION_TOKEN,
+        region_name=settings.AWS_REGION
+    )
 
     print(f"[{collection_id}]: Harvesting content for {len(page_list)} pages")
     collection_stats = {}
-    for page in page_list:
-        collection.update({'page_filename': page})
+
+    collection.update({
+        'content_data_version': create_content_data_version(mapped_data_version)
+    })
+
+    for page_path in page_list:
+        collection.update({'mapped_page_path': page_path})
         page_stats = harvest_page_content(**collection)
 
         # in some cases, value is int and in some cases, value is Counter
@@ -71,11 +55,12 @@ def harvest_collection(collection):
     parser = argparse.ArgumentParser(
         description="Harvest content by collection using mapped metadata")
     parser.add_argument('collection_id', help="Collection ID")
+    parser.add_argument('mapped_data_version', help="URI to mapped data version: ex: s3://rikolti-data-root/3433/vernacular_data_version_1/mapped_data_version_2/")
     parser.add_argument('--nuxeo', action="store_true", help="Use Nuxeo auth")
     args = parser.parse_args()
     arguments = {
         'collection_id': args.collection_id,
     }
     if args.nuxeo:
         arguments['rikolti_mapper_type'] = 'nuxeo.nuxeo'
-    print(harvest_collection(arguments))
+    print(harvest_collection(arguments, args.mapped_data_version))