Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] strict driver and cuda version assignment #1275

Open
wants to merge 111 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
c0ea631
[gpu] toward a more consistent driver and CUDA install
cjac Dec 7, 2024
f210adf
correcting driver for cuda 12.4
cjac Dec 7, 2024
f6ff5a3
correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver …
cjac Dec 8, 2024
e36b25b
corrected cannonical 11.8 driver version ; removed extra code and com…
cjac Dec 8, 2024
a2400a7
skipping most tests ; using 11.7 from the cuda 11 line instead of the…
cjac Dec 9, 2024
a137719
verified that the cuda and driver versions match up
cjac Dec 9, 2024
693bc7f
reducing log capture
cjac Dec 9, 2024
4ce1efc
temporarily increasing machine shape for build caching
cjac Dec 9, 2024
05b3e2b
64 is too many for a single T4
cjac Dec 9, 2024
e2ab509
added a subversion for 11.7
cjac Dec 9, 2024
1a39be6
add more tests to the install function
cjac Dec 9, 2024
41ae069
only including architectures supported by this version of CUDA
cjac Dec 9, 2024
39ac281
pinning down versions better ; more caching ; more ram disks ; new py…
cjac Dec 10, 2024
f116717
using maximum from 8.9 series on rocky for 11.7
cjac Dec 10, 2024
976f869
skip full build
cjac Dec 10, 2024
6ef2fdb
pinning to bazel-7.4.0
cjac Dec 10, 2024
1539cdb
NCCL requires gcc-11 for cuda11
cjac Dec 10, 2024
9a54f4c
rocky8 is now building from the source in the .run file
cjac Dec 11, 2024
3316518
reverting to previous state of only selecting a compiler version on l…
cjac Dec 11, 2024
722e436
replaced literal path names with variable values ; indexing builds by…
cjac Dec 11, 2024
f42fee6
moved variable definition to prepare function ; moved driver signing …
cjac Dec 11, 2024
a13122e
test whether variable is defined before checking its value
cjac Dec 11, 2024
3b72048
cache only the bins and logs
cjac Dec 11, 2024
2cc19ce
build index of kernel modules after unpacking ; remove call to non-ex…
cjac Dec 12, 2024
5a2d783
only build module dependency index once
cjac Dec 12, 2024
1cf12ab
skipping CUDA 11 NCCL build on debian12
cjac Dec 12, 2024
77a95ff
skip cuda11 on debian12, rocky9
cjac Dec 12, 2024
0b2da14
renamed verify_pyspark to verify_instance_pyspark
cjac Dec 12, 2024
0c1df7f
failing somewhat gracefully ; skipping tests that would fail
cjac Dec 12, 2024
ce60b03
skipping single node tests for rocky8
cjac Dec 12, 2024
d16e625
re-enable other tests
cjac Dec 12, 2024
7284ad7
Specifying bazel version with variable
cjac Dec 12, 2024
35e4ba2
fixing up some skip logic
cjac Dec 12, 2024
be37569
replaced OS_NAME with _shortname
cjac Dec 12, 2024
c9d1d95
skip more single instance tests for rocky8
cjac Dec 12, 2024
b63ae17
fixing indentation ; skipping redundant test
cjac Dec 12, 2024
94c1f13
remove retries of flakey tests
cjac Dec 12, 2024
ac477b3
oops ; need to define the cuda version to test for
cjac Dec 12, 2024
db7aacf
passing -q to gcloud to generate empty passphrase if no ssh key exist…
cjac Dec 12, 2024
e152fd8
including instructions on how to create a secure-boot key pair
cjac Dec 13, 2024
f113ef8
-e for expert, not -p for pro
cjac Dec 13, 2024
dfc433d
updated 11.8 and 12.0 driver versions
cjac Dec 13, 2024
77fc42a
added a signature check test which allows granular selection of platf…
cjac Dec 13, 2024
8ed498e
tuning the layout of arguments to userspace.run
cjac Dec 13, 2024
842d7e5
scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including k…
cjac Dec 13, 2024
bb35d11
add a connect timeout to the ssh call instead of trying to patch arou…
cjac Dec 13, 2024
2541a6f
add some entropy to the process
cjac Dec 13, 2024
ab668ff
perhaps a re-run would have fixed 2.0-rocky8 on that last run
cjac Dec 13, 2024
934289a
increasing init action timeout to account for uncached builds
cjac Dec 13, 2024
e5920f8
cache non-open kernel build results
cjac Dec 14, 2024
386177d
per-kernel sub-directory for kmod tarballs
cjac Dec 14, 2024
b9668e0
using upstream repo and branch
cjac Dec 14, 2024
2f0148a
corrected grammar error
cjac Dec 14, 2024
19b9ddb
testing Kerberos some more
cjac Dec 14, 2024
1e5fc0f
better implementation of numa node selection
cjac Dec 14, 2024
4023031
this time with a test which is exercised
cjac Dec 14, 2024
03f59a6
skip debian11 on Kerberos
cjac Dec 14, 2024
f2146e3
also skipping 2.1-ubuntu20 on kerberos clusters
cjac Dec 14, 2024
1cb99f8
re-adjusting tests to be performed ; adjusting rather than skipping k…
cjac Dec 14, 2024
3a238d1
more temporal variance
cjac Dec 14, 2024
cc16aa8
skipping CUDA=12.0 for ubuntu22
cjac Dec 14, 2024
3ac04bc
kerberos not known to succeed on 2.0-rocky8
cjac Dec 14, 2024
c6bf91a
2.2 dataproc images do not support CUDA <= 12.0
cjac Dec 14, 2024
d1b3d48
skipping SINGLE configuration for rocky8 again
cjac Dec 15, 2024
751e7a0
not testing 2.0
cjac Dec 15, 2024
e5e3a9e
trying without test retries ; retries should happen within the test, …
cjac Dec 15, 2024
c1cd1d9
kerberos only works on 2.2
cjac Dec 15, 2024
eac2d46
using expectedFailure instead of skipTest for tests which are known t…
cjac Dec 15, 2024
bf1f0c6
document one of the failure states
cjac Dec 15, 2024
12e6de9
skipping expected failures
cjac Dec 16, 2024
f7bf9ab
updated manual-test-runner.sh instructions
cjac Dec 16, 2024
47a6e3b
this one generated from template after refactor
cjac Dec 23, 2024
26719af
do not point to local rpm pgp key
cjac Dec 24, 2024
74c09f4
re-ordering to reduce delta from master
cjac Dec 24, 2024
53c1ef1
custom image usage can come later
cjac Dec 24, 2024
97046b1
see #1283
cjac Dec 24, 2024
484308b
replaced incorrectly removed presubmit.sh and removed custom image ke…
cjac Dec 24, 2024
61b94da
revert nearly to master
cjac Dec 24, 2024
8b4f4f8
can include extended test suite later
cjac Dec 24, 2024
3bc45ff
order commands correctly
cjac Dec 24, 2024
6a76b4e
placing all completion files in a common directory
cjac Dec 24, 2024
e592146
extend supported version list to include latest release of each minor…
cjac Jan 13, 2025
4559ecc
tested with CUDA 11.6.2/510.108.03
cjac Jan 13, 2025
16c8485
exercised with cuda 11.1
cjac Jan 14, 2025
afd5f2f
reverting cloudbuild/Dockerfile to master
cjac Jan 14, 2025
2272f97
nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com
cjac Jan 15, 2025
3b2dc66
skipping rocky9
cjac Jan 15, 2025
0c420b7
* adding version 12.6 to the support matrix
cjac Jan 15, 2025
f69d071
incorrect version check removed
cjac Jan 15, 2025
73ffce5
only install pytorch if include-pytorch metadata set to true
cjac Jan 21, 2025
521df62
since call to install_pytorch is protected by metadata check, skip me…
cjac Jan 21, 2025
c0b60b2
increasing timeout and machine shape to reduce no-cache build time
cjac Jan 21, 2025
30c97c4
skip full test run due to edits to integration_tests directory
cjac Jan 21, 2025
84b1fb9
ubuntu18 does not know about kex-gss ; use correct driver version num…
cjac Jan 21, 2025
11cbe95
on rocky9 sshd service is called sshd instead of ssh as the rest of t…
cjac Jan 22, 2025
56fe50c
kex-gss is new in debian11
cjac Jan 22, 2025
b1cd1d0
all rocky call it sshd it seems
cjac Jan 22, 2025
ca94393
cudnn no longer available on debian10
cjac Jan 22, 2025
1d2166c
compared with #1282 ; this change matches parity more closely
cjac Jan 23, 2025
50142f6
slightly better variable declaration ordering ; it is better still in…
cjac Jan 23, 2025
6363203
install spark rapids
cjac Jan 23, 2025
dba00df
cache the results of nvidia-smi --query-gpu
cjac Jan 23, 2025
96a8d6d
reduce development time
cjac Jan 23, 2025
11f099c
exercising more CUDA variants ; testing whether tests fail on long runs
cjac Jan 23, 2025
8ae2c0a
try to reduce concurrent builds ; extend build time further ; only en…
cjac Jan 23, 2025
02732e1
fixed bug with spark rapids version assignment ; more conservative ab…
cjac Jan 24, 2025
57fef50
* gpu does not work on capacity scheduler on dataproc 2.0 ; use fair
cjac Jan 24, 2025
cc5abca
revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions
cjac Jan 24, 2025
8936442
configure for use with JupyterLab
cjac Jan 28, 2025
0bc3c1f
2.2 should use 12.6.3 (latest)
cjac Jan 29, 2025
e56ddd0
Addressing review from cnauroth
cjac Feb 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cloudbuild/presubmit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # to be removed before merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.

echo "All tests will be run: '${changed_dir}' was changed"
TESTS_TO_RUN=(":DataprocInitActionsTestSuite")
return 0
Expand Down Expand Up @@ -104,7 +105,6 @@ run_tests() {
bazel test \
--jobs="${max_parallel_tests}" \
--local_test_jobs="${max_parallel_tests}" \
--flaky_test_attempts=3 \
--action_env="INTERNAL_IP_SSH=true" \
--test_output="all" \
--noshow_progress \
Expand Down
14 changes: 10 additions & 4 deletions gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,25 @@ RUN apt-get -qq update \
curl jq less screen > /dev/null 2>&1 && apt-get clean

# Install bazel signing key, repo and package
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
bazel_version=7.4.0 \
bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8" \
DEBIAN_FRONTEND=noninteractive

RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor -o "${bazel_kr_path}" \
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
&& apt-get update -qq

RUN apt-get autoremove -y -qq && \
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
apt-get clean

# Set bazel-${bazel_version} as the default bazel alternative in this container
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
update-alternatives --set bazel /usr/bin/bazel-${bazel_version}

# Install here any utilities you find useful when troubleshooting
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean

Expand Down
Loading