-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] strict driver and cuda version assignment #1275
Open
cjac
wants to merge
111
commits into
GoogleCloudDataproc:master
Choose a base branch
from
LLC-Technologies-Collier:gpu-20241212
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
111 commits
Select commit
Hold shift + click to select a range
c0ea631
[gpu] toward a more consistent driver and CUDA install
cjac f210adf
correcting driver for cuda 12.4
cjac f6ff5a3
correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver …
cjac e36b25b
corrected cannonical 11.8 driver version ; removed extra code and com…
cjac a2400a7
skipping most tests ; using 11.7 from the cuda 11 line instead of the…
cjac a137719
verified that the cuda and driver versions match up
cjac 693bc7f
reducing log capture
cjac 4ce1efc
temporarily increasing machine shape for build caching
cjac 05b3e2b
64 is too many for a single T4
cjac e2ab509
added a subversion for 11.7
cjac 1a39be6
add more tests to the install function
cjac 41ae069
only including architectures supported by this version of CUDA
cjac 39ac281
pinning down versions better ; more caching ; more ram disks ; new py…
cjac f116717
using maximum from 8.9 series on rocky for 11.7
cjac 976f869
skip full build
cjac 6ef2fdb
pinning to bazel-7.4.0
cjac 1539cdb
NCCL requires gcc-11 for cuda11
cjac 9a54f4c
rocky8 is now building from the source in the .run file
cjac 3316518
reverting to previous state of only selecting a compiler version on l…
cjac 722e436
replaced literal path names with variable values ; indexing builds by…
cjac f42fee6
moved variable definition to prepare function ; moved driver signing …
cjac a13122e
test whether variable is defined before checking its value
cjac 3b72048
cache only the bins and logs
cjac 2cc19ce
build index of kernel modules after unpacking ; remove call to non-ex…
cjac 5a2d783
only build module dependency index once
cjac 1cf12ab
skipping CUDA 11 NCCL build on debian12
cjac 77a95ff
skip cuda11 on debian12, rocky9
cjac 0b2da14
renamed verify_pyspark to verify_instance_pyspark
cjac 0c1df7f
failing somewhat gracefully ; skipping tests that would fail
cjac ce60b03
skipping single node tests for rocky8
cjac d16e625
re-enable other tests
cjac 7284ad7
Specifying bazel version with variable
cjac 35e4ba2
fixing up some skip logic
cjac be37569
replaced OS_NAME with _shortname
cjac c9d1d95
skip more single instance tests for rocky8
cjac b63ae17
fixing indentation ; skipping redundant test
cjac 94c1f13
remove retries of flakey tests
cjac ac477b3
oops ; need to define the cuda version to test for
cjac db7aacf
passing -q to gcloud to generate empty passphrase if no ssh key exist…
cjac e152fd8
including instructions on how to create a secure-boot key pair
cjac f113ef8
-e for expert, not -p for pro
cjac dfc433d
updated 11.8 and 12.0 driver versions
cjac 77fc42a
added a signature check test which allows granular selection of platf…
cjac 8ed498e
tuning the layout of arguments to userspace.run
cjac 842d7e5
scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including k…
cjac bb35d11
add a connect timeout to the ssh call instead of trying to patch arou…
cjac 2541a6f
add some entropy to the process
cjac ab668ff
perhaps a re-run would have fixed 2.0-rocky8 on that last run
cjac 934289a
increasing init action timeout to account for uncached builds
cjac e5920f8
cache non-open kernel build results
cjac 386177d
per-kernel sub-directory for kmod tarballs
cjac b9668e0
using upstream repo and branch
cjac 2f0148a
corrected grammar error
cjac 19b9ddb
testing Kerberos some more
cjac 1e5fc0f
better implementation of numa node selection
cjac 4023031
this time with a test which is exercised
cjac 03f59a6
skip debian11 on Kerberos
cjac f2146e3
also skipping 2.1-ubuntu20 on kerberos clusters
cjac 1cb99f8
re-adjusting tests to be performed ; adjusting rather than skipping k…
cjac 3a238d1
more temporal variance
cjac cc16aa8
skipping CUDA=12.0 for ubuntu22
cjac 3ac04bc
kerberos not known to succeed on 2.0-rocky8
cjac c6bf91a
2.2 dataproc images do not support CUDA <= 12.0
cjac d1b3d48
skipping SINGLE configuration for rocky8 again
cjac 751e7a0
not testing 2.0
cjac e5e3a9e
trying without test retries ; retries should happen within the test, …
cjac c1cd1d9
kerberos only works on 2.2
cjac eac2d46
using expectedFailure instead of skipTest for tests which are known t…
cjac bf1f0c6
document one of the failure states
cjac 12e6de9
skipping expected failures
cjac f7bf9ab
updated manual-test-runner.sh instructions
cjac 47a6e3b
this one generated from template after refactor
cjac 26719af
do not point to local rpm pgp key
cjac 74c09f4
re-ordering to reduce delta from master
cjac 53c1ef1
custom image usage can come later
cjac 97046b1
see #1283
cjac 484308b
replaced incorrectly removed presubmit.sh and removed custom image ke…
cjac 61b94da
revert nearly to master
cjac 8b4f4f8
can include extended test suite later
cjac 3bc45ff
order commands correctly
cjac 6a76b4e
placing all completion files in a common directory
cjac e592146
extend supported version list to include latest release of each minor…
cjac 4559ecc
tested with CUDA 11.6.2/510.108.03
cjac 16c8485
exercised with cuda 11.1
cjac afd5f2f
reverting cloudbuild/Dockerfile to master
cjac 2272f97
nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com
cjac 3b2dc66
skipping rocky9
cjac 0c420b7
* adding version 12.6 to the support matrix
cjac f69d071
incorrect version check removed
cjac 73ffce5
only install pytorch if include-pytorch metadata set to true
cjac 521df62
since call to install_pytorch is protected by metadata check, skip me…
cjac c0b60b2
increasing timeout and machine shape to reduce no-cache build time
cjac 30c97c4
skip full test run due to edits to integration_tests directory
cjac 84b1fb9
ubuntu18 does not know about kex-gss ; use correct driver version num…
cjac 11cbe95
on rocky9 sshd service is called sshd instead of ssh as the rest of t…
cjac 56fe50c
kex-gss is new in debian11
cjac b1cd1d0
all rocky call it sshd it seems
cjac ca94393
cudnn no longer available on debian10
cjac 1d2166c
compared with #1282 ; this change matches parity more closely
cjac 50142f6
slightly better variable declaration ordering ; it is better still in…
cjac 6363203
install spark rapids
cjac dba00df
cache the results of nvidia-smi --query-gpu
cjac 96a8d6d
reduce development time
cjac 11f099c
exercising more CUDA variants ; testing whether tests fail on long runs
cjac 8ae2c0a
try to reduce concurrent builds ; extend build time further ; only en…
cjac 02732e1
fixed bug with spark rapids version assignment ; more conservative ab…
cjac 57fef50
* gpu does not work on capacity scheduler on dataproc 2.0 ; use fair
cjac cc5abca
revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions
cjac 8936442
configure for use with JupyterLab
cjac 0bc3c1f
2.2 should use 12.6.3 (latest)
cjac e56ddd0
Addressing review from cnauroth
cjac File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.