Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOCA2 First Implementation #3

Merged
merged 20 commits into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.venv
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
exclude = .venv,__pycache__,.pytest_cache,.git,.tox,.eggs,*.egg,*.egg-info,*.egg-info/*
max-line-length = 90
ignore = I201, I100, I101, W503
59 changes: 59 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: CI/CD Pipeline

on:
push:
branches: [ '*' ]
pull_request:
branches: [ '*' ]

jobs:
test-and-build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 flake8-import-order
pip install -e ".[test]"

- name: Run flake8
run: |
flake8 . --count --statistics \
--select=E,W,F,I \
--show-source

- name: Run pytest
run: |
pytest tests/

- name: Login to DockerHub
if: github.event_name == 'push'
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}


- name: Extract branch name
if: github.event_name == 'push'
shell: bash
run: echo "BRANCH_NAME=$(echo ${GITHUB_REF#refs/heads/})" >> $GITHUB_ENV

- name: Build and push Docker image
if: github.event_name == 'push'
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ncsa/downscaled-climate-data:${{ env.BRANCH_NAME }}
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
7 changes: 7 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM python:3.12
LABEL org.opencontainers.image.source="https://github.com/atmsillinois/DownscaledClimateData"
WORKDIR /project
COPY pyproject.toml /project/
RUN pip install .
COPY downscaled_climate_data/ /project/downscaled_climate_data/
RUN pip install -e .
147 changes: 147 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,149 @@
# DownscaledClimateData
Data Pipelines for Downscaled Climate Datasets

Dagster port of original work by Maile Sasaki in the [Climate_Map Repo](https://github.com/mailesasaki/climate_map).

This is a dagster project that downloads and processes data from various downscaled climate datasets.
The data is stored in a cloud bucket and is for use in the University of Illinois
Department of Climate, Meteorology & Atmospheric Sciences analysis facility.

The goal of this project is to deploy a set of sensors and jobs to maintain data assets of downscaled
climate data. The datasets will include:
- Raw netcdf files downloaded from the sources
- Cloud optimized Zarr files
- Analyzed results ready for display on the Climate Map

## Getting Started
Dagster offers a great developer experience. It's easy to run a dagster engine on your laptop and
try out these pipelines.

### Install the Project
Create a virtual environment and install the project and the development dependencies.

```bash
python3 -m venv .venv

source .venv/bin/activate
pip install -e ".[dev]"
```

### Configure Access to Storage Bucket
The goal of the project is to store data in a cloud bucket. You can experiment with some
portions of the pipeline without a cloud bucket, but you will need to configure the bucket
to run the full pipeline.

Access token and URL are read through environment variables. Dagster offers an easy way to
configure these through a `.env` file.

Create this file in the root of the project and add the following lines:

```bash
S3_ENDPOINT_URL=https://url-of-your-s3-provider
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-access-key

LOCA2_BUCKET=loca2-data

# No leading slashes on these paths - the downloaded netcdf and zarr files will
# be stored in subdirectories of these paths
LOCA2_ZARR_PATH_ROOT=zarr/LOCA2
LOCA2_RAW_PATH_ROOT=raw/LOCA2
```

The .env file is already in the `.gitignore` file so you don't have to worry about accidentally
committing your secrets.

### Run the Dagster Dashboard
The [Dagster UI](https://docs.dagster.io/concepts/webserver/ui) is a web-based interface
for viewing and interacting with Dagster objects. Since we have already installed this project
in the virtual environment, you can just launch the dagster UI with the command:

```bash
dagster dev
```

This will run the web server and print out a local url for you to visit that will get you into
the dashboard.

### Development Configuration
The project includes a `dagster.yaml` file that sets up the configuration for the project.
It currently has a configuration to limit the number of concurrent workers to 1 to avoid
blowing up your laptop.

## Dagster Project Structure
There are three main concepts in the Dagster project:
1. Assets - The data that is produced by the pipelines. They are implemented by python code. Assets can depend on other assets
2. Sensors - The sensors are used to monitor sources for new data. They can run on a schedule and trigger asset jobs.
3. Resources - The resources are the connections to external systems. They are used to connect to the cloud storage and other services.

Here are descriptions of the assets, sensors, and resources that make up the project.

### Assets
[loca2_raw_netcdf](downscaled_climate_data/assets/loca2.py)

This asset represents the raw netcdf data downloaded from the LOCA2 dataset.
The data is stored in a cloud bucket and can be used as the source for the other assets. It accepts
the following parameters:
- `url` - The url of the netcdf file from the UCSD web server
- `bucket` - The name of the cloud bucket where the data will be stored
- `s3_key` - The key of the object in the bucket. This is the full path where the object will be stored. It looks like a directory structure.

These values are typically produced by the `Loca2Datasets` resource.

[loca2_zarr](downscaled_climate_data/assets/loca2.py)
Convert the netcdf files to Zarr format. This asset uses the `xarray` library to read the netcdf file and
convert it to Zarr format. The Zarr format is a cloud optimized format that is more efficient for reading
data in the cloud. The asset accepts the output from the `loca2_raw_netcdf` asset as input.

[loca2_esm_catalog](downscaled_climate_data/assets/loca2.py)
This asset scans the objects in the bucket to produce an intake-esm catalog and uploads the catalog to the bucket.
The catalog is used to drive the analysis of the data in the Climate Map. The catalog is a JSON file that describes
the dataset and includes a csv file with each dataset along with some searchable metadata.

The asset config controls which datasets to include in the catalog. The asset accepts the following parameters:
- `data_format` - Can be 'zarr' or 'netcdf'. This is important catalog metadata and also controls how the bucket is scanned.
- `id` - the dataset id. This is used to identify the dataset in the catalog.
- `description` - A description of the dataset

### Resources
These resources are consumed by the sensor to make the entire pipeline easily configurable and to
encapsulate the interactions with the UCSD web server and the cloud storage.

`Loca2Models` - This resource doesn't actually interact with the outside world, but is the source of the
dictionary of models and scenarios that are used in the LOCA2 dataset. This resource is used to drive
the search for new files on the UCSD web server.

`Loca2Datasets` - This resource is used to interact with the UCSD web server to find new files in the LOCA2
dataset for a specific model/scenario combination. It uses the `beautifulsoup4` library to parse the
web pages and find the links to the netcdf files. It filters out any monthly summary files. This resource works
like an interator and yields one record per file on the webserver. The record is a dictionary with the
following keys:
- `model` - The name of the model
- `scenario` - The name of the scenario
- `member ID` - The member ID of the model
- `variable` - The variable represented in the dataset
- `url` - The url of the netcdf file
- `s3_key` - The suggested S3 key this file should be saved as in the cloud bucket



### Sensors
[LOCA2_Sensor](downscaled_climate_data/sensors/loca2_sensor.py)

This sensor finds new data in the LOCA2 dataset and triggers the asset job to download the data. It
creates an asset key to make sure we don't download the same file twice. Dagster, by design, wants
Sensors to complete in a short amount of time. We work at the model/scenario/member level to keep the
the number of files relatively small, so we can complete in time.

## Development
Since the instructions above specify installing the project with the `-e` flag, changes
to the source code are immediately in the Dagster UI. You do need to tell Dagster to reload
the project. This can be done by clicking the "Reload" button in the Dagster UI on the _Deployment_ tab.

The assets and sensors have unit tests which also simplify development since you can
observe the code running in a variety of scenarios without the whole Dagster infrastructure.
You can run the tests with the command:

```bash
pytest
```
2 changes: 2 additions & 0 deletions dagster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
run_queue:
max_concurrent_runs: 1
3 changes: 3 additions & 0 deletions downscaled_climate_data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
import importlib.metadata

__version__ = importlib.metadata.version("DownscaledClimateData")
Empty file.
Loading
Loading