Better Nightly Builds #3002

bendnorman · 2023-11-01T20:26:11Z

bendnorman
Nov 1, 2023
Maintainer

Requirements:

Kick off an arbitrary number of builds so we can do full builds on multiple feature branches
Get notified when any part of the build process fails
Collect the logs for all of the build steps in one place. Currently, to debug the full nightly build process you need to look at Github Action logs, the ETL logs and GCP Cloud Logging.

Nice to haves:

Deploy builds in a way that would allow us to process data on multiple machines. Not important right now but could be when we get into larger datasets like EQR and SEC PDFs.
Have access to a full dagster deployment so we can use the UI, track changes between runs, use sensors.
Deploy the builds in a way that allow us to use postgres for the pudl database

Notes from call with Dazhong

Do we need to to spin up and down a cloud SQL database? If something goes wrong in the ETL or container is killed the database isn't killed. Fighting the intent of Cloud SQL.
Does a full dagster deployment actually solve these problems? Moving from batch to continuous process feels like a big change. The easiest thing to do is just stay in the batch model. Put one service in a container, database is only used by the database. The gcp_etl.sh is trying to do too much. Using the UI to do debug things is the biggest benefit. Not worth doing a full deployment.
how hard would it be to start a postgres db in the VM? Install postgres in the image. Probably nuke the life cycle stuff.
Long-term merge ferc2sqlite stuff into normal ETL.

jdangerx · 2023-11-01T21:12:19Z

jdangerx
Nov 1, 2023
Maintainer

@bendnorman and I had a call, I think our high level takeaways are:

running a whole Dagster service seems like overkill - the main benefit we would get is the UI, but then we incur the cost of Running a Whole Service
batch-world gets us what we need - the ability to kick off ETL runs for different git refs.
we can run postgres within the container to avoid having to manage a whole cloud SQL instance - it's not really being exposed to the outside world anyways, so doesn't seem like it adds a ton of complexity to the system
we would like to move complexity out of gcp_pudl_etl.sh

Immediate next steps are:

don't try to manage Cloud SQL lifecycle within gcp_pudl_etl.sh - it's brittle and doesn't really save us that much money
keep pushing on the Batch-ification of the nightly builds in between other work

1 reply

bendnorman Nov 1, 2023
Maintainer Author

I think another worth while benefit of running dagster as a service would be to track metadata about assets overtime which we're discussing in #2944. Also, while a fully kubernetes deployment is likely overkill in the short and medium term, I wonder if it would benefit us in the long term to use k8s given it is the standard for orchestrating multi container applications. I could imagine a world where PUDL is doing:

multi run metric tracking
using postgres databases to store processed data and dagster event logs
spinning up multiple VMs to process large datasets like EQR and SEC PDFs.

Basically, I don't want to sink time into another semi-interim nightly build solution just to have to re work the entire thing in a year or two. I think setting up Google Batch and postgres inside the container is good for now but I thought I'd share my anxieties haha

jdangerx · 2023-12-23T00:18:31Z

jdangerx
Dec 23, 2023
Maintainer

cc @zaneselvans @bendnorman

Thought a bunch about how we might want to set up our build process with Batch. If the plan / steps sound reasonable I'll turn this into GH issues and start working on them.

Actual high level needs

We want to be able to run the ETL, validate its outputs, publish artifacts, and update the nightly/stable branches.

We'd also like to be able to correlate the build artifacts with the code version that generated them.

Finally, we'd like to be able to change some behavior based on whether it's a nightly, stable, or ad-hoc run:

Nightly:

publish build artifacts to an internal cache on GCS, AWS, and datasette
update nightly branch

Stable:

publish build artifacts to GCS, AWS, but not datasette
update stable branch

Ad-hoc

only publish build artifacts to GCS
potentially run with different ETL configuration files
do not update any Git branches

Desired end technical state

We'll still kick off the build process with GitHub Actions, which will configure/submit a Google Batch job. The Batch job will run a build script within a Docker container.

GHA workflow

build docker image
using GHA context, such as github ref / triggering event / workflow dispatch inputs, set specific settings in Google Batch job description as env vars:
- ETL configuration
  - path to configuration YML file - workflow dispatch input
- publication settings
  - GCS namespace (nightly/stable/ad-hoc) - choose nightly or stable, or ad-hoc based on the tag name
  - AWS namespace (nightly/stable/none)
  - do/don't publish to Datasette
- Git settings
  - git branch to update (nightly/stable/none)
  - current git ref

Google batch job description

This will have to be generated dynamically by a Python script that passes the various settings from the GHA context into a JSON file.

The secrets will be kept in Google Secrets so that we don't have to pass them around.

The non-secret settings will be passed into the main script as CLI args via the commands array.

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "container": {
              "imageUri": "docker.io/catalystcoop/pudl-etl:<TAG>",
              "commands": [
                "micromamba",
                ...
              ]
            },
            "environment": {
              "secretVariables": {
                "PUDL_BOT_PAT": "projects/PROJECT_ID/secrets/SECRET_NAME/versions/VERSION",
                ...
              }
            }
          }
        ]
      }
    }
  ],
  "allocationPolicy": {
    "service_account": {
      "email": "some-special-service-account"
    }
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

In-container script

This will be a Python script that replicates the functionality of gcp_pudl_etl.sh, except that it takes command-line arguments that correspond to the actual behaviors we want to customize (as opposed to inferring these from GH ref, etc. in the script as gcp_pudl_etl.sh does).

Example call:

$ ./run_the_dang_build.py --etl-config-file /path/to/etl_fast.yml --gcs-dest gs://foo --aws-dest s3://... --do-publish-datasette --current-git-ref nightly-YYYY-MM-DD --git-target-branch nightly

It will pick up secrets from the environment variables.

Path to power

There's a bunch of things that have to happen - here's the order in which we should do things so that we get something useful out fast.

Update build-deploy-pudl.yml, to use a new batch-configuration generation script to run the existing gcp_pudl_etl.sh in a container on Batch. We keep gcp_pudl_etl.sh largely unchanged, just getting rid of the VM shutdown logic. This means we have to do all the configuration of gcp_pudl_etl.sh with env vars. After this step, we can trigger build-deploy-pudl-batch to run the ETL on Batch, replacing the existing VM-based workflow. We'll be able to debug the nightly/stable issues we found in Fix errors in new build-deploy-pudl workflow #3183 with better logs. We can also start to see how increasing Batch job resources affects the job performance.
Port gcp_pudl_etl.sh to a new Python script, which takes semantically meaningful CLI args; pass these arguments instead of the various env vars from GHA. After this, we'll have a code structure that more closely mirrors the actual concepts behind our deploy process, which makes further changes easier.
Use Terraform to manage secrets in Google Secret Manager; then use those secrets in the Batch job. This helps our security posture and sets us up to use Google Secret Manager for more things in the future.

User flows

Running nightly builds

On a schedule, push nightly-YYYY-MM-DD tag.
build-deploy-pudl workflow kicks off on tag push, kicking off Google Batch job, which runs all the stuff.

Running stable builds

Manually push vYYYY.MM.DD tag.
build-deploy-pudl workflow kicks off on tag push, kicking off Google Batch job, which runs all the stuff.

Running ad-hoc builds

Manually push tag/branch that doesn't match any of the nightly/stable patterns.
Trigger manual run via workflow-dispatch, targeting tag/branch.

How to re-run a nightly build that failed?

Apply a fix, then push tag nightly-YYYY-MM-DD-fix-<num> - this will get treated as a nightly build.

Or, if you don't need to apply a code fix, just re-run the GHA workflow that failed / trigger workflow run manually based on the nightly-YYYY-MM-DD tag.

5 replies

bendnorman Dec 27, 2023
Maintainer Author

This sounds good to me! I have a few additional thoughts:

Our logging is kind of all over the place right now. We have the docker image build logs in Github actions, some container logs captured a written to pudl-etl.log, all VM logs collected in Cloud Logging and notifications sent to Slack. To debug a failure we currently need to visit a handful of these logging locations to track down the issue. It would be nice if all logs from the image build, etl, tests, and deployments steps are collected in one place. I assume there are more advanced ways to use Cloud Logging that allow us to collect all of these logs in one place and make them all searchable. We could also use Cloud Build so the image build logs are in Googly land. Looks like you can send structured logs from python to cloud logging if we want to add metadata do our ETL logs.
How do we want to improve the orchestration of the logic stored in gcp_pudl_etl.sh? The combination of if-else statements and env vars feels brittle. Could we use something like Google Cloud Workflows to specify a DAG of deployment tasks? Would it be appropriate to create a dagster job to manage the orchestration of these steps?
Currently, we're running the ETL using the pudl_etl and ferc_to_sqlite commands which are a wrapper around the dagster python API. This isn't ideal because our ETL configuration is specified between a mixture of python dictionaries and yaml files and the dagster multiprocess python executor API is still experimental and kind of kludgy. While resolving Remove pudl_etl and ferc_to_sqlite commands in favor of dagster job execute #3161 isn't a requirement for moving towards this new deployment workflow you've specified @jdangerx, I think it would be a nice improvement.
How can we authenticate batch VMs to the Postgres dagster logs server? Is this a good time to tackle Spin up a postgres database within our nightly build container #3003

I think we should write up a full design doc before we implement these changes.

zaneselvans Dec 27, 2023
Maintainer

I would really like to get the nightly/stable tagging and release-on-tag up and running sooner rather than later. Is there a minimal set of changes that make that happen before we dive into a bigger design discussion? Also who are we going to bill that process to?

The reason Batch came up again is the current VM deployment was exhibiting some very flaky behavior in this PR which made it impossible to update the nightly tag from within the docker container.

So far as I know, updating the nightly branch automatically contingent on the successful build outcome, plus getting rid of dev are the only things blocking the switch to the new release system.

bendnorman Dec 28, 2023
Maintainer Author

Sounds like step 1 from @jdangerx's comment would unblock #3140. If we are confident Batch is going to be a part of our CD system it seems like we can tackle step 1 without doing a design doc. There are a lot of pieces to this system so it would be nice to create a clear vision of what we're working towards.

@jdangerx do you have thoughts on how to authenticate the Batch VMs with the postgres database? Currently our two VMs have static IPs that are whitelisted.

jdangerx Dec 29, 2023
Maintainer

Yes, step 1 of "port everything over to Batch with minimal changes" would potentially fix some of the flakiness we were encountering with VM startup & container image loading.

To authenticate the batch VMs with postgres, I think we can investigate a couple options:

use private IP we might need to do some twiddling to make sure everything is in the same VPC, not quite sure how the GCP default VPC works right now. I'd assume "everything you make is in the same VPC, unless you configure otherwise" but that might not be the case.
use the Cloud SQL Auth Proxy from within the container

I'm sort of partial to using Auth Proxy since that's the recommended route, though it will probably be good for us to learn how our VPC / private IP stuff works at some point.

zaneselvans · 2023-12-27T17:50:03Z

zaneselvans
Dec 27, 2023
Maintainer

For reasons that I don't understand at all, the current VM deployment setup seems to gotten very flaky when combined with the new nightly tagging / branch migration setup, so it seems like that flakiness is currently what stands in the way of getting the nightly/stable branches + tagging / trunk-based development / release-on-tag setup working.

It might be helpful to have a slightly bigger outline of that chain of tasks and the order they need to happen in? Are there things other than:

Switch builds to use Batch so VMs aren't flaky
Turn on automatic update of nightly branch on successful nightly build
Pythonize the nightly build script
Fix the "success" vs. "failure" logic in the build script which is currently broken (earlier failed step can get overwritten by later successful step)
Switch to trunk-based development workflow and remove dev
Enable / test data+software release-on-tag

It seems like pythonizing the build script can be deferred until the other deployment infrastructure has been updated.

Note that we'll distribute 2 different kinds of "build artifacts"

successful nightly builds & stable builds will get published /nightly and /vYYYY.MM.DD which only contain the public-facing subset of the files generated by the nightly builds. The SQLite files are zipped and there are no subdirectories containing additional files (e.g. CEMS outputs). These go both to AWS and GCP in a pudl.catalyst.coop bucket. These should always be exact mirrors of each other, so should it even be possible to configure other destinations or allow these to diverge from each other? Why do we want configuration here beyond specifying the path after the bucket name?
All builds (both failed and successful nightly and ad-hoc builds), including all build outputs whether or not they're intended for public consumption get saved to the private bucket: gs://nightly-build-outputs.catalyst.coop/$BUILD_ID before the outputs are cleaned up for distribution. This is for debugging / forensic use, and I don't think it should be conditional on tags or branches or workflow triggers. These outputs are kept around for 30 days.

0 replies

zaneselvans · 2024-01-02T19:44:36Z

zaneselvans
Jan 2, 2024
Maintainer

Another thing we should do in the new nightly builds setup is abort the build if there have been no changes to the codebase

This should be easy once we get the nightly branch updating, since the nightly build workflow can just compare against the current checkout. This could save us a couple hundred dollars a month on cloud costs.

2 replies

zaneselvans Jan 2, 2024
Maintainer

It's not the most elegant thing in the world but I think I have this working in #3195 now.

bendnorman Jan 3, 2024
Maintainer Author

Ah yes this is a great idea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Better Nightly Builds #3002

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Better Nightly Builds #3002

bendnorman Nov 1, 2023 Maintainer

Replies: 4 comments · 8 replies

jdangerx Nov 1, 2023 Maintainer

bendnorman Nov 1, 2023 Maintainer Author

jdangerx Dec 23, 2023 Maintainer

Actual high level needs

Desired end technical state

GHA workflow

Google batch job description

In-container script

Path to power

User flows

Running nightly builds

Running stable builds

Running ad-hoc builds

How to re-run a nightly build that failed?

bendnorman Dec 27, 2023 Maintainer Author

zaneselvans Dec 27, 2023 Maintainer

bendnorman Dec 28, 2023 Maintainer Author

jdangerx Dec 29, 2023 Maintainer

jdangerx Dec 29, 2023 Maintainer

zaneselvans Dec 27, 2023 Maintainer

zaneselvans Jan 2, 2024 Maintainer

zaneselvans Jan 2, 2024 Maintainer

bendnorman Jan 3, 2024 Maintainer Author

bendnorman
Nov 1, 2023
Maintainer

Replies: 4 comments 8 replies

jdangerx
Nov 1, 2023
Maintainer

bendnorman Nov 1, 2023
Maintainer Author

jdangerx
Dec 23, 2023
Maintainer

bendnorman Dec 27, 2023
Maintainer Author

zaneselvans Dec 27, 2023
Maintainer

bendnorman Dec 28, 2023
Maintainer Author

jdangerx Dec 29, 2023
Maintainer

jdangerx Dec 29, 2023
Maintainer

zaneselvans
Dec 27, 2023
Maintainer

zaneselvans
Jan 2, 2024
Maintainer

zaneselvans Jan 2, 2024
Maintainer

bendnorman Jan 3, 2024
Maintainer Author