Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barbara/publishing pipeline #703

Merged
merged 5 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@ samconfig.toml
!/.github
startup.sh

__pycache__
__pycache__
.venv/
env/
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,23 @@ If you would like to mount your own codebase to the content_harvester container
export MOUNT_CODEBASE=<path to rikolti, for example: /Users/awieliczka/Projects/rikolti>
```

In order to run the indexer code, make sure the following variables are set:

```
export RIKOLTI_ES_ENDPOINT= # ask for endpoint url
export RIKOLTI_HOME=/usr/local/airflow/dags/rikolti
export INDEX_RETENTION=1
```

Also make sure to set your temporary AWS credentials and the region so that the mwaa-local-runner container can authenticate when talking to the OpenSearch API:

```
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=
export AWS_REGION=us-west-2
```

Finally, from inside the aws-mwaa-local-runner repo, run `./mwaa-local-env build-image` to build the docker image, and `./mwaa-local-env start` to start the mwaa local environment.

For more information on `mwaa-local-env`, look for instructions in the [ucldc/aws-mwaa-local-runner:README](https://github.com/ucldc/aws-mwaa-local-runner/#readme) to build the docker image, run the container, and do local development.
Expand Down
2 changes: 1 addition & 1 deletion dags/index_to_prod_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
schedule=None,
start_date=datetime(2023, 1, 1),
catchup=False,
params={'collection_id': Param(None, description="Collection ID to index")},
params={'collection_id': Param(None, description="Collection ID to move to prod")},
tags=["rikolti"],
)
def index_to_prod_dag():
Expand Down
1 change: 1 addition & 0 deletions dags/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.10.txt"
boto3
opensearch-py
requests
sickle
python-dotenv
Expand Down
10 changes: 7 additions & 3 deletions env.example
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,11 @@ export CONTENT_ROOT=file:///usr/local/airflow/rikolti_content

# indexer
export RIKOLTI_ES_ENDPOINT= # ask for endpoint url
export RIKOLTI_ES_PASS= # ask for password

export RIKOLTI_HOME=/usr/local/airflow/dags/rikolti
export INDEX_RETENTION=1 # number of unaliased indices to retain during cleanup
export INDEX_RETENTION=1 # number of unaliased indices to retain during cleanup

# indexer when run locally via aws-mwaa-local-runner
# export AWS_ACCESS_KEY_ID=
# export AWS_SECRET_ACCESS_KEY=
# export AWS_SESSION_TOKEN=
# export AWS_REGION=us-west-2
23 changes: 22 additions & 1 deletion record_indexer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ python index_templates/rikolti_template.py

This creates a template that will be used whenever an index with name matching `rikolti*` is added to the cluster.

## Run indexer
## Run indexer from command line

Create a new index for a collection and add it to the `rikolti-stg` alias:

Expand All @@ -22,8 +22,29 @@ Add the current stage index for a collection to the `rikolti-prd` alias:
python -m record_indexer.move_index_to_prod <collection_id>
```

## Indexer development using aws-mwaa-local-runner

See the Rikolti README page section on [Airflow Development](https://github.com/ucldc/rikolti/#airflow-development). In particular, make sure that indexer-related env vars are set as described there.

## Index lifecycle

The lifecycle of an index is as follows:

#### Create new index
1. Create a new index named `rikolti-{collection_id}-{version}`, where `version` is the current datetime).
2. Remove any existing indices for the collection from the `rikolti-stg` alias.
3. Add the new index to the `rikolti-stg` alias.
4. Delete any older unaliased indices, retaining the number of unaliased indices specified by `settings.INDEX_RETENTION`.

Note that the index creation code enforces the existence of one stage index at a time.

#### Move staged index to production
1. Identify the current stage index for the collection.
2. Remove any existing indices for the collection from the `rikolti-prd` alias.
3. Add the current stage index to the `rikolti-prd` alias. (This means that at this stage in the lifecycle, the index will be aliased to `rikolti-stg` and `rikolti-prd` at the same time.)

#### Delete old index
This happens during index creation (see step 4. above).



Expand Down
2 changes: 1 addition & 1 deletion record_indexer/add_page_to_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def build_bulk_request_body(records: list, index: str):
# https://opensearch.org/docs/1.2/opensearch/rest-api/document-apis/bulk/
body = ""
for record in records:
doc_id = record.get("calisphere-id")
doc_id = record.get("id")

action = {"create": {"_index": index, "_id": doc_id}}

Expand Down
1 change: 1 addition & 0 deletions record_indexer/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
boto3
opensearch-py
python-dotenv
requests
requests-aws4auth
Expand Down
8 changes: 7 additions & 1 deletion record_indexer/settings.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
import os

from boto3 import Session
from dotenv import load_dotenv
from opensearchpy import AWSV4SignerAuth

load_dotenv()

def get_auth():
credentials = Session().get_credentials()
return AWSV4SignerAuth(credentials, os.environ.get("AWS_REGION"))

ENDPOINT = os.environ.get("RIKOLTI_ES_ENDPOINT")
AUTH = ("rikolti", os.environ.get("RIKOLTI_ES_PASS"))
AUTH = get_auth()

RIKOLTI_HOME = os.environ.get("RIKOLTI_HOME", "/usr/local/airflow/dags/rikolti")
RECORD_INDEX_CONFIG = os.sep.join(
Expand Down
1 change: 1 addition & 0 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
-r ./metadata_mapper/requirements.txt
-r ./metadata_fetcher/requirements.txt
-r ./record_indexer/requirements.txt
ipython
ruff
isort
Loading