Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next-generation CI/CD pipelines with RunsOn #11001

Merged
merged 98 commits into from
Dec 6, 2024
Merged

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Nov 19, 2024

Suggestion to reviewers: Go to https://github.com/hcho3/xgboost/tree/xgboost-ci-ng to easily browse the workflow definitions and scripts. Also pay attention to the following:

  1. Overall organization of workflows
  2. How containers are defined and constructed

See https://xgboost--11001.org.readthedocs.build/en/11001/contrib/ci.html for the overview.


In #8142, we migrated our CI/CD pipelines from Jenkins to BuildKite. BuildKite has served us well in the past 2 years.
BuildKite was superior to Jenkins in the following ways:

  • Provision of cloud components are now documented in scripts that are stored as part of the XGBoost git repository.
  • We are no longer subject to frequent failures and security issues from Jenkins plug-ins.

However, over time, the BuildKite setup has entailed the following disadvantages:

  1. We ended up with lots of infra scripts and glue code that provision cloud components. (See tests/buildkite/infrastructure.) The infra scripts require specialized knowledge of the cloud, and currently I am the only person who understands these scripts.
  2. The provision of worker images was not fully automated. The setup steps were documented in bootstrap scripts, but someone had to manually run them. Also, the BuildKite's CloudFormation template required that we specify exact IDs for machine images, meaning that we had to re-provision the entire cloud stack every time we re-generate the machine images. In short: It was cumbersome to generate new machine images, so over time, we tended to not update machine images, which led to bit rots.
  3. Developers who are familiar with GitHub Actions face substantial learning curve when configuring BuildKite, as BuildKite is a more niche application.
  4. Over time, we added more and more files in tests/buildkite and tests/ci_build, and it was unclear as to what kinds of files belonged to each directory. (Better organize scripts for CI and maintenance tasks #9896)

I propose a new solution for hosting our CI/CD pipelines: RunsOn. RunsOn is a web app that lets host GitHub Action runners on Amazon EC2. By migrating to RunsOn, we address all the issues described above.

  1. We got rid of (almost) all specialized infra scripts. The only remaining infra script is the Packer build definition in ops/packer, but at least this bit is easy to understand for newcomers. (It basically provisions an EC2 instance, runs a few commands, and saves the machine image for later use.)
  2. It's now extremely easy to refresh worker images: just run packer build linux.pck.hcl. Unlike BuildKite, RunsOn's CloudFormation template allows us to specify machine images using wildcards (not exact IDs), meaning that we can refresh machine images while keeping the rest of cloud infra the same. Furthermore, we should be able to setup an automated pipeline that runs Packer periodically.
  3. Now we can configure all CI/CD pipelines using the familiar syntax of GitHub Actions. Developers now have access to the wealthy trove of existing workflow actions.
  4. This pull request re-organizes files inside the ops/ directory, as follows:
  • ops/conda_env: Definitions for Conda environments
  • ops/packer: Packer scripts to build machine images for Amazon EC2
  • ops/patch: Patch files
  • ops/pipeline: Shell scripts defining CI/CD pipelines. Most of these scripts can be run locally (to assist with development and debugging); those that must be run in the CI are guarded with invocation to enforce-ci.sh.
  • ops/script: Various utility scripts useful for testing.
  • ops/docker: Dockerfiles to define containers
  • ops/docker_build.py / ops/docker_run.py: replaces ci_build.sh with much more friendly UI. Allows users and CI pipelines to run arbitrary tasks inside containers.

Other improvements:

  • We now have a single YAML file (ops/docker/ci_container.yml) that keeps track of all build args to the containers.
  • Windows workers now have much faster launch time, using the Fast Launch feature. The launch time is reduced from 10 minutes to 2 minutes.
  • RunsOn uses Spot instances automatically to achieve savings of 20-50%. An upcoming version of RunsOn add the ability to retry jobs automatically should spot instances are interrupted.
  • In a single workflow file, we can mix jobs that run on Microsoft-hosted and self-hosted runners, allowing us to group tasks that are related. For example, see .github/workflows/jvm_tests.yml.
  • GitHub Actions provides nice graph visualization for pipelines
    Screenshot 2024-11-20 105956
    Screenshot 2024-11-20 110018

The only disadvantage of RunsOn is that we no longer have the option to require manual approvals prior to CI workflows running. So we need to retain enforce_daily_budget.sh for the cost control. (It's currently broken; I will fix it and bring it back in a different PR.) In addition, I temporarily disabled the Dependabot, until we figure out how to prevent the bot from filing a torrent of pull requests (which would explode our CI budget!).

Closes #10229
Closes #6306
Closes #9896
Closes #7525 (Now we build libxgboost4j.so once, using the cheapest instance.)

Progress towards #8311: A future pull request will automatically run packer build to rotate machine images.

@hcho3 hcho3 force-pushed the xgboost-ci-ng branch 6 times, most recently from 28f51be to 0783f38 Compare November 29, 2024 08:46
@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 29, 2024

I will re-visit the R tests after this PR is merged.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 1, 2024

#10933 might be due to the particular setup of the RunsOn stack. Today I am seeing random occurrence of error cudaErrorNoDevice no CUDA-capable device is detected in the JVM tests: https://github.com/dmlc/xgboost/actions/runs/12101273094/job/33741364702. For now I am adding --privileged flag to Docker runs and see if that makes the error go away.

Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to review this today by reading the following:

But I'm sorry, this diff is just too large for me to understand what's going on. Is it possible to make this smaller, for example to only modify the workflows that affect the Python package? And to move to a separate PR seemingly unrelated changes like whitespace and adding type hints to Python code?

Things I haven't been able to figure out in this first pass:

  • where / when are docker images and VM images built?
    • every commit to master? on a schedule? manually by maintainers?
  • what's the relationship between the VM images (built with packer) and the container images (built with docker and pushed to ECR)?
  • which logs from this PR should I be looking at to see jobs run on RunsOn?
  • I still see BuildKite check statuses on this PR... is that expected?
image

Sorry for not leaving a more helpful review. If there are smaller, more self-contained pieces of this you want my feedback on please point me to them and I'd be happy to review.

.github/workflows/freebsd.yml Show resolved Hide resolved
ops/docker_run.py Show resolved Hide resolved
.github/workflows/python_wheels_macos.yml Show resolved Hide resolved
@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 3, 2024

where / when are docker images and VM images built?

VM images are built manually, by running packer build linux.pkr.hcl.
On the other hand, Docker images are built much more frequently, at every commit and with every pull request.

what's the relationship between the VM images (built with packer) and the container images (built with docker and pushed to ECR)?

VM images are expected to be refreshed sparingly, whereas container images are to be refreshed at every commit. The VM image contains the minimal set of drivers and system software so that it can run the containers (possibly with access to GPUs).

which logs from this PR should I be looking at to see jobs run on RunsOn?

Jobs that run on RunsOn will have the following entry:
Screenshot 2024-12-02 192555

I still see BuildKite check statuses on this PR... is that expected?

Yes, BuiltKite status checks will remain until we uninstall BuildKite completely from the repo.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 6, 2024

Merging this early so that we can start monitoring for potential issues. We will continue our design discussion in #11046.

@hcho3 hcho3 merged commit 2043679 into dmlc:master Dec 6, 2024
40 of 44 checks passed
@hcho3 hcho3 deleted the xgboost-ci-ng branch December 6, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants