Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bye prefect #9

Merged
merged 11 commits into from
Apr 24, 2024
Merged
28 changes: 28 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
.env
service-account.json
gdrive-service-account.json
.git
.gitignore
environment.yaml
Makefile
README.md
LICENSE
deps/requirements.in
deps/dev-requirements.in
deps/dev-requirements.txt
notebooks/
scripts/
pipeline/data/**/*.csv
pipeline/data/**/*.json
pipeline/data/**/*.kml
pipeline/data/**/*.aspx
pipeline/logs/*
pipeline/bfro_mini_warehouse/logs/*
pipeline/bfro_mini_warehouse/target/*
pipeline/bfro_mini_warehouse/README.md
pipeline/bfro_mini_warehouse/.user.yml
**/__pycache__/
.dvc
.github
.ruff_cache
.vscode
4 changes: 2 additions & 2 deletions .github/workflows/test-and-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ jobs:
- name: Set up python
uses: actions/setup-python@v2
with:
python-version: "3.9"
python-version: "3.11"

- name: Install dependencies
run: pip install -r deps/dev-requirements.txt

- name: Run checks
run: make check
run: make lint

# uhhhh no tests for now I guess
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ logs
data_old
test.csv
data_backup

service-account.json
gdrive-service-account.json
run-docker.sh
run-pipeline-docker.sh
16 changes: 13 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
FROM python:3.9-slim-buster
FROM python:3.11-slim-bullseye

COPY deps/requirements.txt .
ENV VIRTUAL_ENV=/usr/local
# Install basics
RUN apt-get update -y && apt-get install -y zip wget apt-transport-https ca-certificates gnupg curl
# Install mc
RUN curl https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/bin/mc && chmod +x /usr/bin/mc
# Install gcloud
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && apt-get update -y && apt-get install google-cloud-sdk -y

RUN pip install -r requirements.txt
# Install the python stuff.
COPY deps/requirements.txt requirements.txt
RUN pip install uv && uv pip install -r requirements.txt

COPY . .
41 changes: 7 additions & 34 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,61 +2,34 @@

## Compile requirements.in into requirements.txt
deps/requirements.txt: deps/requirements.in
pip-compile deps/requirements.in --output-file deps/requirements.txt
uv pip compile deps/requirements.in --output-file deps/requirements.txt

## Compile dev-requirements.in into dev-requirements.txt
deps/dev-requirements.txt: deps/dev-requirements.in deps/requirements.txt
pip-compile deps/dev-requirements.in --output-file deps/dev-requirements.txt
uv pip compile deps/dev-requirements.in --output-file deps/dev-requirements.txt

## Install non-dev dependencies.
env: deps/requirements.txt
pip-sync deps/requirements.txt
uv pip sync deps/requirements.txt

## Install dev and non-dev dependencies.
dev-env: deps/dev-requirements.txt
pip-sync deps/dev-requirements.txt
uv pip sync deps/dev-requirements.txt

## Lint project with ruff.
lint:
python -m ruff .
python -m ruff check .

## Format imports and code.
format:
python -m ruff . --fix
python -m black .

## Check linting and formatting.
check:
python -m ruff check .
python -m black --check .
python -m ruff format .

.PHONY: build-docker
## Build docker with local registry tag
## Build docker with local registry tag and push to local registry
build-docker:
docker build --tag localhost:5000/bfro_pipeline:latest .

.PHONY: push-docker
## Push docker to local registry
push-docker:
docker push localhost:5000/bfro_pipeline:latest

.PHONY: build-deployment
## Builds the Prefect deployment yaml file.
build-deployment:
cd pipeline && \
prefect deployment build \
bfro_pipeline_docker:main \
--name bfro-pipeline \
--pool bfro-agent-pool \
--work-queue default \
--infra-block process/bfro-local \
--storage-block gcs/bfro-pipeline-storage

.PHONY: apply-deployment
## Sends the Prefect deployment file to the server.
apply-deployment:
prefect deployment apply pipeline/main-deployment.yaml

.PHONY: pull-data
## Downloads the data locally for testing
pull-data:
Expand Down
35 changes: 10 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ To get started, run the following commands to set up the environment.

```shell
conda env create -f environment.yaml
conda activate bfro
conda activate bfro-sightings-data
make dev-env
```

Expand All @@ -33,38 +33,23 @@ For more information on the weather data, see the [Visual Crossing documentation
## Full Pipeline

The pipeline (including scraper, weather, and the DBT project), is in the `pipeline/` directory.
Everything assumes relative paths from that directory, rather than the project root (which is just for setup and deployment operations).
However, there's a shell script that will run it all for you.

```
cd pipeline/
python bfro_pipeline.py
```sh
# In project root.
./run-pipeline-local.sh False
```

For a test run (which runs the scraper on a small set of pages, and pulls a small set of weather data), use `--test-run`.
To run a test run, use

```sh
./run-pipeline-local.sh True
```
python bfro_pipeline.py --test-run
```

Once the sources are in place (however you decide to run them), you can run dbt directly from within the `pipeline/bfro_mini_warehouse` directory, or use the script.

```
python bfro_pipeline.py --dbt-only
```

The pipeline command runs the source tests first, builds the csv files, then runs the rest of the data tests.

## Deployment and Orchestration

The pipelines are [Prefect](https://www.prefect.io/) flows, and they can be deployed but require some setup.
The `pipeline/bfro_pipeline_docker.py` file has the blocks you need (basically `prefect-gcs-rw` as GCP credentials, and `visual-crossing-key` as a Secret).
I assume if you're messing with deployments you probably know how that stuff works.
It's not _super_ hard to self host Prefect, but it's not super easy either.

Also worth noting - while the thing says `_docker` in the file name and pipeline name, I don't actually have the dockerized version working yet 😬 .

It will still deploy and run, as is, with a process infrastructure block on an agent within the conda env provided in this repo though.
When I get docker working, you'll be able to launch it with a docker container infrastructure block and no code change to the flow.
There's a Dockerfile and docker make targets (set to push to a local registry).
If you want to run in a container you should wrap the `run-pipeline-local.sh` shell script into one that will pull down / push updated data before the container tears down, or execute with a bind mount or volume mount on the local project root.

## Data Dictionary

Expand Down
1 change: 0 additions & 1 deletion deps/dev-requirements.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
-r requirements.txt
black
ruff
Loading
Loading