-
Notifications
You must be signed in to change notification settings - Fork 37
bump NCCL floor to 2.18.1.1, relax PyTorch pin #218
Conversation
@@ -285,13 +285,13 @@ dependencies: | |||
# If conda-forge supports the new cuda-* packages for CUDA 11.8 | |||
# at some point, then we can fully support/properly specify | |||
# this environment. | |||
- pytorch=2.0.0 | |||
- &pytorch pytorch>=2.0,<2.4.0a0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs build here have been failing like this:
LibMambaUnsatisfiableError: Encountered problems while solving:
- nothing provides _python_rc needed by python-3.12.0rc3-rc3_hab00c5b_1_cpython
Could not solve for environment specs
The following packages are incompatible
├─ cpuonly is requested and can be installed;
├─ python 3.12** is installable with the potential options
│ ├─ python [3.12.0|3.12.1|...|3.12.5] would require
│ │ └─ python_abi 3.12.* *_cp312, which can be installed;
│ └─ python 3.12.0rc3 would require
│ └─ _python_rc, which does not exist (perhaps a missing channel);
└─ pytorch 2.0.0** is not installable because there are no viable options
├─ pytorch 2.0.0 would require
│ └─ python_abi 3.8.* *_cp38, which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ cpuonly <0 , which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python_abi 3.10.* *_cp310, which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python_abi 3.11.* *_cp311, which conflicts with any installable versions previously reported;
├─ pytorch 2.0.0 would require
│ └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported;
└─ pytorch 2.0.0 would require
└─ __cuda, which is missing on the system.
because:
- docker image
rapidsai/ci-conda:latest
was updated to Python 3.12 in update 'latest' tags to Python 3.12 images ci-imgs#188 wholegraph
is pinningpytorch=2.0.0
and there are not Python 3.12 packages for that version on conda-forge (https://anaconda.org/conda-forge/pytorch/files?version=2.0.0)
This PR proposes loosening the pin on pytorch
here.
cugraph
currently pins to pytorch>=2.0,<2.2.0a0
(code link), but might go to <2.4.0a0
in rapidsai/cugraph#4615 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this seems sensible (especially for this release). In 24.12 we can look at relaxing this further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to a wholegraph dev on the pytorch change in case there are any reasons not to do that, but the nccl update looks fine.
Thanks! @linhu-nv could you please review here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me. Thanks!
/merge |
Follow-up to #218 This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context. cc @linhu-nv for awareness Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham URL: #223
Contributes to rapidsai/build-planning#102
Fixes #217
Notes for Reviewers
How I tested this
Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246.
Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102.
(build link)
Pushed a commit adding a floor of
nccl>=2.18.1.1
. Saw all tests pass with CUDA 11.4 😁(build link)