Reproducible environments for local dev, CI, and nightly builds #2979
Replies: 2 comments 4 replies
-
Yeah, love these goals! And I think the specific bullets listed out are a good way to get there. I think we should consider a generic task runner like Invoke over
It seems like most of this work is already done in #2968 - I think if we want to try using |
Beta Was this translation helpful? Give feedback.
-
One note - if we make it harder/unsupported to use PUDL as a library, we are going to have to mess with how we use |
Beta Was this translation helpful? Give feedback.
-
Recently we've had a slew of build issues resulting from changes in our upstream dependencies. We've also known for a while that our previously released
catalystcoop.pudl
packages don't always keep working because of creeping dependency incompatibilities.We've talked in the past year about treating PUDL like an application, not a library -- with the move to writing all of our derived outputs into the database that we publish, the main PUDL repository has become a centralized means of producing data, which is then distributed, rather than a tool that we expect others to use to produce data on their own. That said, ideally the data we produce should be reproducible -- given a commit or tag from the main repo, ideally we or someone else down the line should be able to generate the same data outputs. The biggest weak point in this aspiration has been our python environment, which hasn't used individually pinned dependencies. We've archived docker images in the past, but that's quite a heavyweight system for most people (including us).
In PR #2968 I've created a setup that uses
conda-lock
to create a reproducible conda environment, that refers to exact versions of released packages, including hashes, for all of our direct an indirect dependencies, based entirely on dependencies specified inpyproject.toml
. Using these conda environments, we should be able to use exactly the same software, operating on exactly the same input data (archived on Zenodo) to produce exactly the same outputs. The software environment shouldn't change unless we update the lockfile, and the lockfile is checked into the github repo, so a given git commit or tag will always contain the full conda environment specification.The trick with the conda lockfiles is that we need to be able to use them to manage our environment in several different places, which have different environment expectations:
pudl-dev
conda environment, to use the appropriate platform specific rendered environment file (e.g.environments/conda-osx-arm64.lock.yml
).pytest
directly, this works fine, since it runs in your local development environment. However, we're currently using Tox to do several distinct things: Isolate the installation of the PUDL package from the repository, manage virtual environments that are separate from the conda environment, and usepip
to install dependencies, and also to store a bunch of script-like logic about what set of commands are run, which environment variables are set, and which sets of optional dependencies are installed for each test environment. And unfortunately, Tox doesn't really integrate with the conda lockfiles (thetox-conda
extension is way out of date). So to use the locked conda environment for local testing, these Tox functionalities would need to either be abandoned or migrated to some other system.mambaorg/setup-micromamba
action, somicromamba
is available in CI, and we can use it to install the locked environment quickly from the master lockfile (no solver is invoked). This means that on theconda-lockfile
branch the locked environment is already in place. However, Tox is then being used to run the tests, which means thatpip
is being invoked to build another virtual environment which means we aren't actually using the locked environment for the tests.conda-lockfile
branch the Docker image for the nightly builds has been switched tomambaorg/micromamba
which can build the locked conda environment very quickly usingmicromamba
and the explicit master lockfile (no solver is invoked).conda-lockfile
branch already.One Scenario
conda-forge
.environments/
pytest
directly to run the tests locally for debugging purposes.tox.ini
and into either the GitHub Actions workflow file (where they will definitely be run in CI) or into something like a Makefile so we can run them locally (and also potentially on GitHub Actions). E.g.make docs
could do all the things thattox -e docs
does now -- lint the docs, remove old docs builds, and recreate the docs using sphinx), see catalog of all the commands intox.ini
below.micromamba
and the explicit master lockfile to create the python environment that the tests or ETL run in.tox.ini
to a simpler commonly used tool like aMakefile
and then usemake
to run various tests and builds locally and in CI on GitHub in a uniform way.Makefile
targets for tasks like re-locking the conda environment, and usemake conda_lockfile
both locally and in scheduled GitHub Actions.The above is just one possibility, but I think it would:
conda-lock
andmake
).pyproject.toml
which can be consumed by multiple other tools if need be (pip
,tox
,conda-lock
, dependabot, etc.)Stuff in
tox.ini
linters
: This is stuff that's already done bypre-commit
/pre-commit.ci
and should also be happening in your IDE. It can just be removed.docs
: Can easily be replaced with a Makefile targetmake docs
unit
/integration
/minmax_rows
/validate
/jupyter
/full_integration
/full
: I think these canpytest
commands and compositions ofpytest
commands can be turned intoMakefile
targets pretty easily.ci
: Could be a high-levelMakefile
target, which runs everything which would be run as part of the CI, but locally. Could imagine it setting$PUDL_OUTPUT
and$DAGSTER_HOME
environment variables to temporary directories, running theferc_to_sqlite
andpudl_etl
scripts in such a way as to gather coverage information, using theetl_fast
inputs.nuke
: Could be another high-levelMakefile
target that doesn't reset$PUDL_OUTPUT
or$DAGSTER_HOME
to a temporary directory, and clobbers everything from scratch, then runs all the tests, validations etc.get_unmapped_ids
: Could be a Makefile targetmake unmapped_ids
that invokes the currentpytest
command or a script which replaces it, and outputs the new IDs for mapping.build
/testrelease
/release
: Can be replaced with the package-on-tag action that has been deployed in some of our other repositories, or it can be set aside if we don't want to create a pip / conda installable version ofcatalystcoop.pudl
going forward (If we are going to distribute that package, it needs to be tested somewhere!)Parallel ETL in CI; Don't make pytest build PUDL DB
ferc_to_sqlite
andpudl_etl
scripts to invoke Dagster with multiprocessing support, and to take the database construction work away frompytest
fixtures (which was always kind of a hack) before running the integration tests with--live-dbs
.Other options?
I'm sure there are other options we could explore (like moving from
tox
to the python-based Nox). Any other thoughts or suggestions?Beta Was this translation helpful? Give feedback.
All reactions