Skip to content

Commit

Permalink
Merge branch 'main' into barbara/csudh-call-collate_field
Browse files Browse the repository at this point in the history
  • Loading branch information
barbarahui committed Nov 28, 2023
2 parents c4fbb12 + e757872 commit 77521ae
Show file tree
Hide file tree
Showing 41 changed files with 1,333 additions and 1,167 deletions.
5 changes: 3 additions & 2 deletions content_harvester/Dockerfile → Dockerfile.content_harvester
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@ RUN sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<!--<policy

WORKDIR /

COPY requirements.txt ./
COPY content_harvester/requirements.txt ./

RUN pip install --upgrade pip && pip install -r requirements.txt

COPY ./ /content_harvester
COPY content_harvester/ /content_harvester
COPY utils/ /rikolti/utils

RUN chmod +x /content_harvester/by_collection.py

Expand Down
File renamed without changes.
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,20 +48,18 @@ vi env.local

Currently, I only use one virtual environment, even though each folder located at the root of this repository represents an isolated component. If dependency conflicts are encountered, I'll wind up creating separate environments.

Similarly, I also only use one env.local as well. Rikolti fetches data to your local system, maps that data, and then fetches relevant content files (media files, previews, and thumbnails). Set `FETCHER_DATA_DEST` to the URI where you would like Rikolti to store fetched data - Rikolti will create a folder (or s3 prefix) `<collection_id>/vernacular_metadata` at this location. Set `MAPPER_DATA_SRC` to the URI where Rikolti can find a `<collection_id>/vernacular_metadata` folder that contains the fetched data you're attempting to map. Set `MAPPER_DATA_DEST` to the URI where you would like Rikolti to store mapped data - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_metadata` at this location. Set `CONTENT_DATA_SRC` to the URI where Rikolti can find a `<collection_id>/mapped_metadata` folder that contains the mapped metadata describing where to find content. Set `CONTENT_DATA_DEST` to the URI where you would like Rikolti to store mapped data that has been updated with pointers to content files - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_with_content` at this location. Set `CONTENT_DEST` to the URI where you would like Rikolti to store content files.
Similarly, I also only use one env.local as well. Rikolti fetches data to your local system, maps that data, and then fetches relevant content files (media files, previews, and thumbnails). Set `VERNACULAR_DATA` to the URI where you would like Rikolti to store and retrieve fetched data - Rikolti will create a folder (or s3 prefix) `<collection_id>/vernacular_metadata` at this location. Set `MAPPED_DATA` to the URI where you would like Rikolti to store and retrieve mapped data - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_metadata` at this location. Set `CONTENT_DATA` to the URI where you would like Rikolti to store mapped data that has been updated with pointers to content files - Rikolti will create a folder (or s3 prefix) `<collection_id>/mapped_with_content` at this location. Set `CONTENT_ROOT` to the URI where you would like Rikolti to store content files.

For example, one way to configure `env.local` is:

```
FETCHER_DATA_DEST=file:///Users/awieliczka/Projects/rikolti/rikolti_data
MAPPER_DATA_SRC=$FETCHER_DATA_DEST
MAPPER_DATA_DEST=$FETCHER_DATA_DEST
CONTENT_DATA_SRC=$FETCHER_DATA_DEST
CONTENT_DATA_DEST=$FETCHER_DATA_DEST
CONTENT_DEST=file:///Users/awieliczka/Projects/rikolti/rikolti_content
VERNACULAR_DATA=file:///Users/awieliczka/Projects/rikolti/rikolti_data
MAPPED_DATA=$VERNACULAR_DATA
CONTENT_DATA=$VERNACULAR_DATA
CONTENT_ROOT=file:///Users/awieliczka/Projects/rikolti/rikolti_content
```

Each of these can be different locations, however. For example, if you're attempting to re-run a mapper locally off of previously fetched data stored on s3, you might set `MAPPER_DATA_SRC=s3://rikolti_data`.
Each of these can be different locations, however. For example, if you're attempting to re-run a mapper locally off of previously fetched data stored on s3, you might set `VERNACULAR_DATA=s3://rikolti_data`.

In env.example you'll also see `CONTENT_DATA_MOUNT` and `CONTENT_MOUNT` environment variables. These are only relevant if you are running the content harvester using airflow, and want to set and of the CONTENT_ environment variables to the local filesystem. Their usage is described below in the Airflow Development section.

Expand Down Expand Up @@ -172,9 +170,8 @@ The docker socket will typically be at `/var/run/docker.sock`. On Mac OS Docker
Next, back in the Rikolti repository, create the `startup.sh` file by running `cp env.example dags/startup.sh`. Update the startup.sh file with Nuxeo, Flickr, and Solr keys as available, and make sure that the following environment variables are set:

```
export FETCHER_DATA_DEST=file:///usr/local/airflow/rikolti_data
export MAPPER_DATA_SRC=file:///usr/local/airflow/rikolti_data
export MAPPER_DATA_DEST=file:///usr/local/airflow/rikolti_data
export VERNACULAR_DATA=file:///usr/local/airflow/rikolti_data
export MAPPED_DATA=file:///usr/local/airflow/rikolti_data
```

The folder located at `RIKOLTI_DATA_HOME` (set in `aws-mwaa-local-runner/docker/.env`) is mounted to `/usr/local/airflow/rikolti_data` on the airflow docker container.
Expand All @@ -184,9 +181,8 @@ Please also make sure the following `CONTENT_*` variables are set - `CONTENT_DAT
```
export CONTENT_DATA_MOUNT=/Users/awieliczka/Projects/rikolti_data
export CONTENT_MOUNT=/Users/awieliczka/Projects/rikolti_content
export CONTENT_DATA_SRC=file:///rikolti_data
export CONTENT_DATA_DEST=file:///rikolti_data
export CONTENT_DEST=file:///rikolti_content
export CONTENT_DATA=file:///rikolti_data
export CONTENT_ROOT=file:///rikolti_content
```

The folder located at `CONTENT_DATA_MOUNT` is mounted to `/rikolti_data` and the folder located at `CONTENT_MOUNT` is mounted to `/rikolti_content` on the content_harvester docker container.
Expand All @@ -197,7 +193,11 @@ If you would like to run the content harvester on AWS infrastructure using the E

> A note about Docker vs. ECS: Since we do not actively maintain our own Docker daemon, and since MWAA workers do not come with a Docker daemon installed, we cannot use a docker execution environment in deployed MWAA and instead use ECS to run our content harvester containers on Fargate infrastructure. The EcsRunTaskOperator allows us to run a pre-defined ECS Task Definition. The EcsRegisterTaskDefinitionOperator allows us to define an ECS Task Definition which we could then run. At this time, we are defining the Task Definition in our [cloudformation templates](https://github.com/cdlib/pad-airflow), rather than using the EcsRegisterTaskDefinitionOperator, but this does mean that we cannot modify the container's image or version using the EcsRunTaskOperator.
If you would like to run your own rikolti/content_harvester image instead of pulling the image from AWS, then from inside the Rikolti repo, run `docker build -t rikolti/content_harvester content_harvester` to build the `rikolti/content_harvester` image locally and update the `content_harvester_image` to be `rikolti/content_harvester`.
If you would like to run your own rikolti/content_harvester image instead of pulling the image from AWS, then from inside the Rikolti repo, run `docker build -f Dockerfile.content_harvester -t rikolti/content_harvester .` to build the `rikolti/content_harvester` image locally and add the following line to `dags/startup.sh` to update `CONTENT_HARVEST_IMAGE` to be `rikolt/content_harvester`:

```
export CONTENT_HARVEST_IMAGE=rikolti/content_harvester
```

Finally, from inside the aws-mwaa-local-runner repo, run `./mwaa-local-env build-image` to build the docker image, and `./mwaa-local-env start` to start the mwaa local environment.

Expand Down
8 changes: 5 additions & 3 deletions content_harvester/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,14 @@ The above media and thumbnail fetching processes are enacted upon child metadata

# Settings

You can bypass uploading to s3 by setting `settings.CONTENT_DATA_DEST = "file://<local path>"` and `settings.CONTENT_DEST = "file://<local_path>"`. This is useful for local development and testing. This will, however, set the metadata records' `media['media_filepath']` and `thumbnail['thumbnail_filepath']` to a local filepath.
You can bypass uploading to s3 by setting `settings.CONTENT_DATA = "file://<local path>"` and `settings.CONTENT_ROOT = "file://<local_path>"`. This is useful for local development and testing. This will, however, set the metadata records' `media['media_filepath']` and `thumbnail['thumbnail_filepath']` to a local filepath.

# Local Development

From inside the rikolti folder:
```
docker build -t rikolti/content_harvester .
docker build -f Dockerfile.content_harvester -t rikolti/content_harvester .
cd content_harvester
docker compose run --entrypoint "python3 -m content_harvester.by_registry_endpoint" --rm content_harvester https://registry.cdlib.org/api/v1/rikoltimapper/26147/?format=json
```

Expand All @@ -59,7 +61,7 @@ To build manually: From a terminal with AWS credentials, get login password for
```
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/b6c7x7s4
docker buildx create --use
docker buildx build --platform linux/arm64,linux/amd64 -t public.ecr.aws/b6c7x7s4/rikolti/content_harvester content_harvester --push
docker buildx build -f Dockerfile.content_harvester --platform linux/arm64,linux/amd64 -t public.ecr.aws/b6c7x7s4/rikolti/content_harvester . --push
```

# TODO:
Expand Down
57 changes: 21 additions & 36 deletions content_harvester/by_collection.py
Original file line number Diff line number Diff line change
@@ -1,54 +1,38 @@
import json
import os

import boto3

from . import settings
from .by_page import harvest_page_content


def get_mapped_pages(collection_id):
page_list = []
if settings.DATA_SRC['STORE'] == 'file':
mapped_path = settings.local_path(collection_id, 'mapped_metadata')
try:
page_list = [f for f in os.listdir(mapped_path)
if os.path.isfile(os.path.join(mapped_path, f))]
except FileNotFoundError as e:
print(f"{e} - have you mapped {collection_id}?")
else:
s3_client = boto3.client(
's3',
aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
aws_session_token=settings.AWS_SESSION_TOKEN,
region_name=settings.AWS_REGION
)
response = s3_client.list_objects_v2(
Bucket=settings.DATA_SRC["BUCKET"],
Prefix=f'{collection_id}/mapped_metadata/'
)
page_list = [obj['Key'].split('/')[-1] for obj in response['Contents']]
return page_list
from . import settings
from rikolti.utils.versions import get_mapped_pages, create_content_data_version


# {"collection_id": 26098, "rikolti_mapper_type": "nuxeo.nuxeo"}
def harvest_collection(collection):
def harvest_collection(collection, mapped_data_version: str):
if isinstance(collection, str):
collection = json.loads(collection)

collection_id = collection.get('collection_id')

if not collection_id:
print("ERROR ERROR ERROR\ncollection_id required")
if not collection_id or not mapped_data_version:
print("ERROR ERROR ERROR\ncollection_id and mapped_data_version required")
exit()

page_list = get_mapped_pages(collection_id)
page_list = get_mapped_pages(
mapped_data_version,
aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
aws_session_token=settings.AWS_SESSION_TOKEN,
region_name=settings.AWS_REGION
)

print(f"[{collection_id}]: Harvesting content for {len(page_list)} pages")
collection_stats = {}
for page in page_list:
collection.update({'page_filename': page})

collection.update({
'content_data_version': create_content_data_version(mapped_data_version)
})

for page_path in page_list:
collection.update({'mapped_page_path': page_path})
page_stats = harvest_page_content(**collection)

# in some cases, value is int and in some cases, value is Counter
Expand All @@ -71,11 +55,12 @@ def harvest_collection(collection):
parser = argparse.ArgumentParser(
description="Harvest content by collection using mapped metadata")
parser.add_argument('collection_id', help="Collection ID")
parser.add_argument('mapped_data_version', help="URI to mapped data version: ex: s3://rikolti-data-root/3433/vernacular_data_version_1/mapped_data_version_2/")
parser.add_argument('--nuxeo', action="store_true", help="Use Nuxeo auth")
args = parser.parse_args()
arguments = {
'collection_id': args.collection_id,
}
if args.nuxeo:
arguments['rikolti_mapper_type'] = 'nuxeo.nuxeo'
print(harvest_collection(arguments))
print(harvest_collection(arguments, args.mapped_data_version))
Loading

0 comments on commit 77521ae

Please sign in to comment.