Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

bump NCCL floor to 2.18.1.1, relax PyTorch pin #218

Merged
merged 5 commits into from
Sep 25, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Sep 20, 2024

Contributes to rapidsai/build-planning#102

Fixes #217

Notes for Reviewers

How I tested this

Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246.

Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102.

...
  + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB
...
./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST 

(build link)

Pushed a commit adding a floor of nccl>=2.18.1.1. Saw all tests pass with CUDA 11.4 😁

...
  + nccl                     2.22.3.1  hee583db_1                  conda-forge          131MB
...
(various log messages showing all tests passed)

(build link)

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 20, 2024
@jameslamb jameslamb changed the title WIP: bump nccl floor to 2.18.1.1 WIP: bump NCCL floor to 2.18.1.1 Sep 20, 2024
@jameslamb jameslamb changed the title WIP: bump NCCL floor to 2.18.1.1 bump NCCL floor to 2.18.1.1 Sep 20, 2024
@jameslamb jameslamb requested a review from linhu-nv September 20, 2024 21:51
@jameslamb jameslamb marked this pull request as ready for review September 20, 2024 21:51
@jameslamb jameslamb requested a review from a team as a code owner September 20, 2024 21:51
@jameslamb jameslamb requested a review from msarahan September 20, 2024 21:51
@@ -285,13 +285,13 @@ dependencies:
# If conda-forge supports the new cuda-* packages for CUDA 11.8
# at some point, then we can fully support/properly specify
# this environment.
- pytorch=2.0.0
- &pytorch pytorch>=2.0,<2.4.0a0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs build here have been failing like this:

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides _python_rc needed by python-3.12.0rc3-rc3_hab00c5b_1_cpython

Could not solve for environment specs
The following packages are incompatible
├─ cpuonly is requested and can be installed;
├─ python 3.12**  is installable with the potential options
│  ├─ python [3.12.0|3.12.1|...|3.12.5] would require
│  │  └─ python_abi 3.12.* *_cp312, which can be installed;
│  └─ python 3.12.0rc3 would require
│     └─ _python_rc, which does not exist (perhaps a missing channel);
└─ pytorch 2.0.0**  is not installable because there are no viable options
   ├─ pytorch 2.0.0 would require
   │  └─ python_abi 3.8.* *_cp38, which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ cpuonly <0 , which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python_abi 3.10.* *_cp310, which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python_abi 3.11.* *_cp311, which conflicts with any installable versions previously reported;
   ├─ pytorch 2.0.0 would require
   │  └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported;
   └─ pytorch 2.0.0 would require
      └─ __cuda, which is missing on the system.

(build link)

because:

This PR proposes loosening the pin on pytorch here.

cugraph currently pins to pytorch>=2.0,<2.2.0a0 (code link), but might go to <2.4.0a0 in rapidsai/cugraph#4615 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this seems sensible (especially for this release). In 24.12 we can look at relaxing this further

@jameslamb jameslamb changed the title bump NCCL floor to 2.18.1.1 bump NCCL floor to 2.18.1.1, relax PyTorch pin Sep 23, 2024
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to a wholegraph dev on the pytorch change in case there are any reasons not to do that, but the nccl update looks fine.

@jameslamb
Copy link
Member Author

Thanks!

@linhu-nv could you please review here?

Copy link
Contributor

@linhu-nv linhu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. Thanks!

@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 73266e2 into rapidsai:branch-24.10 Sep 25, 2024
48 checks passed
@jameslamb jameslamb deleted the fix/update-nccl branch September 25, 2024 14:11
rapids-bot bot pushed a commit that referenced this pull request Sep 26, 2024
Follow-up to #218 

This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context.

cc @linhu-nv for awareness

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham

URL: #223
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nightly docs-build is broken
5 participants