-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Next-generation CI/CD pipelines with RunsOn #11001
Conversation
--------- Co-authored-by: Hyunsu Cho <[email protected]>
28f51be
to
0783f38
Compare
I will re-visit the R tests after this PR is merged. |
#10933 might be due to the particular setup of the RunsOn stack. Today I am seeing random occurrence of error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to review this today by reading the following:
- the PR description
- the prior discussions on the PR
- the docs at https://xgboost--11001.org.readthedocs.build/en/11001/contrib/ci.html (thanks for that link!)
- the diff
But I'm sorry, this diff is just too large for me to understand what's going on. Is it possible to make this smaller, for example to only modify the workflows that affect the Python package? And to move to a separate PR seemingly unrelated changes like whitespace and adding type hints to Python code?
Things I haven't been able to figure out in this first pass:
- where / when are docker images and VM images built?
- every commit to
master
? on a schedule? manually by maintainers?
- every commit to
- what's the relationship between the VM images (built with
packer
) and the container images (built withdocker
and pushed to ECR)? - which logs from this PR should I be looking at to see jobs run on RunsOn?
- I still see BuildKite check statuses on this PR... is that expected?
Sorry for not leaving a more helpful review. If there are smaller, more self-contained pieces of this you want my feedback on please point me to them and I'd be happy to review.
Merging this early so that we can start monitoring for potential issues. We will continue our design discussion in #11046. |
Suggestion to reviewers: Go to https://github.com/hcho3/xgboost/tree/xgboost-ci-ng to easily browse the workflow definitions and scripts. Also pay attention to the following:
See https://xgboost--11001.org.readthedocs.build/en/11001/contrib/ci.html for the overview.
In #8142, we migrated our CI/CD pipelines from Jenkins to BuildKite. BuildKite has served us well in the past 2 years.
BuildKite was superior to Jenkins in the following ways:
However, over time, the BuildKite setup has entailed the following disadvantages:
tests/buildkite/infrastructure
.) The infra scripts require specialized knowledge of the cloud, and currently I am the only person who understands these scripts.tests/buildkite
andtests/ci_build
, and it was unclear as to what kinds of files belonged to each directory. (Better organize scripts for CI and maintenance tasks #9896)I propose a new solution for hosting our CI/CD pipelines: RunsOn. RunsOn is a web app that lets host GitHub Action runners on Amazon EC2. By migrating to RunsOn, we address all the issues described above.
ops/packer
, but at least this bit is easy to understand for newcomers. (It basically provisions an EC2 instance, runs a few commands, and saves the machine image for later use.)packer build linux.pck.hcl
. Unlike BuildKite, RunsOn's CloudFormation template allows us to specify machine images using wildcards (not exact IDs), meaning that we can refresh machine images while keeping the rest of cloud infra the same. Furthermore, we should be able to setup an automated pipeline that runs Packer periodically.ops/
directory, as follows:ops/conda_env
: Definitions for Conda environmentsops/packer
: Packer scripts to build machine images for Amazon EC2ops/patch
: Patch filesops/pipeline
: Shell scripts defining CI/CD pipelines. Most of these scripts can be run locally (to assist with development and debugging); those that must be run in the CI are guarded with invocation toenforce-ci.sh
.ops/script
: Various utility scripts useful for testing.ops/docker
: Dockerfiles to define containersops/docker_build.py
/ops/docker_run.py
: replacesci_build.sh
with much more friendly UI. Allows users and CI pipelines to run arbitrary tasks inside containers.Other improvements:
ops/docker/ci_container.yml
) that keeps track of all build args to the containers..github/workflows/jvm_tests.yml
.The only disadvantage of RunsOn is that we no longer have the option to require manual approvals prior to CI workflows running. So we need to retain
enforce_daily_budget.sh
for the cost control. (It's currently broken; I will fix it and bring it back in a different PR.) In addition, I temporarily disabled the Dependabot, until we figure out how to prevent the bot from filing a torrent of pull requests (which would explode our CI budget!).Closes #10229
Closes #6306
Closes #9896
Closes #7525 (Now we build
libxgboost4j.so
once, using the cheapest instance.)Progress towards #8311: A future pull request will automatically run
packer build
to rotate machine images.