Reproducible environments for local dev, CI, and nightly builds #2979

zaneselvans · 2023-10-25T18:41:00Z

zaneselvans
Oct 25, 2023
Maintainer

Recently we've had a slew of build issues resulting from changes in our upstream dependencies. We've also known for a while that our previously released catalystcoop.pudl packages don't always keep working because of creeping dependency incompatibilities.

We've talked in the past year about treating PUDL like an application, not a library -- with the move to writing all of our derived outputs into the database that we publish, the main PUDL repository has become a centralized means of producing data, which is then distributed, rather than a tool that we expect others to use to produce data on their own. That said, ideally the data we produce should be reproducible -- given a commit or tag from the main repo, ideally we or someone else down the line should be able to generate the same data outputs. The biggest weak point in this aspiration has been our python environment, which hasn't used individually pinned dependencies. We've archived docker images in the past, but that's quite a heavyweight system for most people (including us).

In PR #2968 I've created a setup that uses conda-lock to create a reproducible conda environment, that refers to exact versions of released packages, including hashes, for all of our direct an indirect dependencies, based entirely on dependencies specified in pyproject.toml. Using these conda environments, we should be able to use exactly the same software, operating on exactly the same input data (archived on Zenodo) to produce exactly the same outputs. The software environment shouldn't change unless we update the lockfile, and the lockfile is checked into the github repo, so a given git commit or tag will always contain the full conda environment specification.

The trick with the conda lockfiles is that we need to be able to use them to manage our environment in several different places, which have different environment expectations:

Local development: Here we just need to change the way we create our local pudl-dev conda environment, to use the appropriate platform specific rendered environment file (e.g. environments/conda-osx-arm64.lock.yml).
Local testing: If you're just using pytest directly, this works fine, since it runs in your local development environment. However, we're currently using Tox to do several distinct things: Isolate the installation of the PUDL package from the repository, manage virtual environments that are separate from the conda environment, and use pip to install dependencies, and also to store a bunch of script-like logic about what set of commands are run, which environment variables are set, and which sets of optional dependencies are installed for each test environment. And unfortunately, Tox doesn't really integrate with the conda lockfiles (the tox-conda extension is way out of date). So to use the locked conda environment for local testing, these Tox functionalities would need to either be abandoned or migrated to some other system.
CI on GitHub: We're using the mambaorg/setup-micromamba action, so micromamba is available in CI, and we can use it to install the locked environment quickly from the master lockfile (no solver is invoked). This means that on the conda-lockfile branch the locked environment is already in place. However, Tox is then being used to run the tests, which means that pip is being invoked to build another virtual environment which means we aren't actually using the locked environment for the tests.
Nightly Builds: We are already just installing PUDL from the cloned repo in the nightly builds. On the conda-lockfile branch the Docker image for the nightly builds has been switched to mambaorg/micromamba which can build the locked conda environment very quickly using micromamba and the explicit master lockfile (no solver is invoked).
Read The Docs: RTD works fine with conda environments. This is set up on the conda-lockfile branch already.

One Scenario

Treat PUDL as an application that has to be installed from the repository rather than a package that will be distributed via PyPI and/or conda-forge.
Git tags become the definitive versioned releases, and they also cause the repo (including the lockfile) to be archived on Zenodo via the Zenodo-GitHub integration.
Stop trying to isolate the PUDL package installation from the repo in testing via Tox.
Assume that tests are running in an environment based on the lockfile(s) under environments/
Use pytest directly to run the tests locally for debugging purposes.
If there are multi-command steps that we want to encapsulate, migrate them out of tox.ini and into either the GitHub Actions workflow file (where they will definitely be run in CI) or into something like a Makefile so we can run them locally (and also potentially on GitHub Actions). E.g. make docs could do all the things that tox -e docs does now -- lint the docs, remove old docs builds, and recreate the docs using sphinx), see catalog of all the commands in tox.ini below.
In our GitHub Actions CI and nightly build Docker container, use micromamba and the explicit master lockfile to create the python environment that the tests or ETL run in.
Migrate the script-like logic currently contained tox.ini to a simpler commonly used tool like a Makefile and then use make to run various tests and builds locally and in CI on GitHub in a uniform way.
We could also create Makefile targets for tasks like re-locking the conda environment, and use make conda_lockfile both locally and in scheduled GitHub Actions.

The above is just one possibility, but I think it would:

Use the same conda lockfiles to specify our local development, local test, GitHub CI, and nightly build environments.
Separate responsibility for environment management, and storing test & build commands that we want to re-use uniformly into different tools (conda-lock and make).
Avoid having multiple layers of python environment specification that could behave differently or conflict with each other.
Store all of our software dependency information in a single place, pyproject.toml which can be consumed by multiple other tools if need be (pip, tox, conda-lock, dependabot, etc.)

Stuff in `tox.ini`

linters: This is stuff that's already done by pre-commit / pre-commit.ci and should also be happening in your IDE. It can just be removed.
docs: Can easily be replaced with a Makefile target make docs
unit / integration / minmax_rows / validate / jupyter / full_integration / full: I think these can pytest commands and compositions of pytest commands can be turned into Makefile targets pretty easily.
ci: Could be a high-level Makefile target, which runs everything which would be run as part of the CI, but locally. Could imagine it setting $PUDL_OUTPUT and $DAGSTER_HOME environment variables to temporary directories, running the ferc_to_sqlite and pudl_etl scripts in such a way as to gather coverage information, using the etl_fast inputs.
nuke: Could be another high-level Makefile target that doesn't reset $PUDL_OUTPUT or $DAGSTER_HOME to a temporary directory, and clobbers everything from scratch, then runs all the tests, validations etc.
get_unmapped_ids: Could be a Makefile target make unmapped_ids that invokes the current pytest command or a script which replaces it, and outputs the new IDs for mapping.
build/testrelease/release: Can be replaced with the package-on-tag action that has been deployed in some of our other repositories, or it can be set aside if we don't want to create a pip / conda installable version of catalystcoop.pudl going forward (If we are going to distribute that package, it needs to be tested somewhere!)

Parallel ETL in CI; Don't make pytest build PUDL DB

@rousik has been working on Add support for running ETL on google batch #2792 which uses the ferc_to_sqlite and pudl_etl scripts to invoke Dagster with multiprocessing support, and to take the database construction work away from pytest fixtures (which was always kind of a hack) before running the integration tests with --live-dbs.
In theory this should significantly speed up our CI if we throw a bigger runner at that process.
It would also be nice to separate the creation of the database from the validation of the database, since there are several potentially time-consuming and separable jobs we can run in parallel once the DB is available, including the data validations, the integration tests, and (soon!) generating diffs which catalog the changes between the new DB, and the previously published version of the DB.
This PR will also change our GitHub Actions testing workflow so should be coordinated with any modifications to the environment isolation / test running setup.

Other options?

I'm sure there are other options we could explore (like moving from tox to the python-based Nox). Any other thoughts or suggestions?

jdangerx · 2023-10-31T12:53:12Z

jdangerx
Oct 31, 2023
Maintainer

Use the same conda lockfiles to specify our local development, local test, GitHub CI, and nightly build environments.

Separate responsibility for environment management, and storing test & build commands that we want to re-use uniformly into different tools (conda-lock and make).

Avoid having multiple layers of python environment specification that could behave differently or conflict with each other.

Store all of our software dependency information in a single place, pyproject.toml which can be consumed by multiple other tools if need be (pip, tox, conda-lock, dependabot, etc.)

Yeah, love these goals! And I think the specific bullets listed out are a good way to get there.

I think we should consider a generic task runner like Invoke over make:

make is really designed for creating file outputs; we already want to do things that aren't that, like nuke, and will probably need more scripts in the future
we are better at Python than we are at Makefile

It seems like most of this work is already done in #2968 - I think if we want to try using invoke instead of make, we could merge 2968 and migrate the Makefile to invoke scripts as a separate, small PR.

1 reply

zaneselvans Oct 31, 2023
Maintainer Author

I like the idea of having a Python + CLI version of Make that's more generic. It seems like Invoke is kind of a one-dude project, which at first made me nervous, but then I saw he is on the same Mastodon instance as I am (social.coop) and that made it seem fine which is completely irrational.

It took like half a day to get the Makefile version of this set up, so I assume that Invoke wouldn't take much longer, and it seems like we will continue collecting these kinds of shared tasks / micro-scripts. If we might not use Make, I don't think we should merge it in until we have something else set up, to avoid making people learn more than one new way of running things in quick succession.

jdangerx · 2023-10-31T15:42:30Z

jdangerx
Oct 31, 2023
Maintainer

One note - if we make it harder/unsupported to use PUDL as a library, we are going to have to mess with how we use pudl as a dependency in the archiver... I guess that's already pinned to a git ref of pre-dagster, but that might also have its own issues.

3 replies

zaneselvans Oct 31, 2023
Maintainer Author

Eeek, that's an ancient pin! We should update that... But if it's already installing from the git repo via that pin, I think it should probably keep working? I think we want the repo to remain installable via pip as it is now (which we're using all over the place) but not test or distribute it as a standalone package with a more limited manifest of files that are included in the package.

jdangerx Oct 31, 2023
Maintainer

Yeah, I think it would keep working, but we should really think about how to break that dependency... here are the specific things we use 🙄

tests/integration/zenodo_test.py
14:from pudl.metadata.classes import DataSource
15:from pudl.metadata.constants import LICENSES

src/pudl_archiver/frictionless.py
10:from pudl.metadata.classes import Contributor, DataSource, License
11:from pudl.metadata.constants import CONTRIBUTORS

src/pudl_archiver/zenodo/entities.py
11:from pudl.metadata.classes import Contributor, DataSource

zaneselvans Oct 31, 2023
Maintainer Author

IIRC we thought about breaking this tenuous dependency by factoring out the Datastore stuff out of PUDL and making it a standalone library (that would pair with the archives created by the archiver in a generic way) and allowing some of our metadata definitions to live in the Datastore repository, where they are available with minimal additional dependencies to both the main PUDL repo and the archivers.

I could also imagine having a catalystcoop.metadata package/repo that is dedicated just to managing some Pydantic classes and a library of specific instances, describing data sources, contributors, licenses etc. that we want to be able to use in the archiver, and in publishing our nightly outputs w/ metadata, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Reproducible environments for local dev, CI, and nightly builds #2979

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Reproducible environments for local dev, CI, and nightly builds #2979

zaneselvans Oct 25, 2023 Maintainer

One Scenario

Stuff in tox.ini

Parallel ETL in CI; Don't make pytest build PUDL DB

Other options?

Replies: 2 comments · 4 replies

jdangerx Oct 31, 2023 Maintainer

zaneselvans Oct 31, 2023 Maintainer Author

jdangerx Oct 31, 2023 Maintainer

zaneselvans Oct 31, 2023 Maintainer Author

jdangerx Oct 31, 2023 Maintainer

zaneselvans Oct 31, 2023 Maintainer Author

zaneselvans
Oct 25, 2023
Maintainer

Stuff in `tox.ini`

Replies: 2 comments 4 replies

jdangerx
Oct 31, 2023
Maintainer

zaneselvans Oct 31, 2023
Maintainer Author

jdangerx
Oct 31, 2023
Maintainer

zaneselvans Oct 31, 2023
Maintainer Author

jdangerx Oct 31, 2023
Maintainer

zaneselvans Oct 31, 2023
Maintainer Author