Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add post-kepler nvidia open kernel module installation steps #163

Merged
merged 1 commit into from
Dec 9, 2023

Conversation

fierlion
Copy link
Member

@fierlion fierlion commented Nov 10, 2023

Summary

This change builds a .tar file of the open source NVIDIA kernel module in /var/lib/dkms-archive/nvidia-open. It also adds a script to replace the closed source NVIDIA kernel module with this open source NVIDIA kernel module. This allows AMI users (especially those using P4/P5 instances with EFA) to run a simple command in userdata/cloud-init to replace the kernel module.

This also un-pins the gpu kernel version to pull in the latest updates.

Testing

Built AMI using packer REGION=us-west-2 make al2gpu.
Resulting test AMI: ami-028c1796865c0d2ae
I started an instance of this ami with userdata to run the open module installation script:

#!/bin/bash
/var/lib/ecs/scripts/install-nvidia-open-kmod.sh
yum versionlock \
cuda-* \
cuda-drivers-* \
cuda-drivers-fabricmanager-* \
nvidia-container-toolkit-*

cloud-init output:

Cloud-init v. 19.3-46.amzn2.0.1 running 'modules:final' at Fri, 08 Dec 2023 19:27:06 +0000. Up 83.46 seconds.
+ DKMS=/usr/sbin/dkms
+ DKMS_ARCHIVE_DIR=/var/lib/dkms-archive
++ uname -r
+ KERNEL_VERSION=4.14.322-246.539.amzn2.x86_64
++ /usr/sbin/dkms status -m nvidia
++ awk '{print $2}'
++ tr -d ,:
+ MODULE_VERSION=535.129.03
+ /usr/sbin/dkms uninstall -m nvidia -v 535.129.03

-------- Uninstall Beginning --------
Module:  nvidia
Version: 535.129.03
Kernel:  4.14.322-246.539.amzn2.x86_64 (x86_64)
-------------------------------------
...
------------------------------
Deleting module version: 535.129.03
completely from the DKMS tree.
------------------------------
Done.
+ echo 'found nvidia kernel module: 535.129.03'
found nvidia kernel module: 535.129.03
+ MODULE_ARCHIVE=/var/lib/dkms-archive/nvidia-open/nvidia-open-535.129.03-kernel4.14.322-246.539.amzn2.x86_64-x86_64.dkms.tar.gz
+ echo 'loading from /var/lib/dkms-archive/nvidia-open/nvidia-open-535.129.03-kernel4.14.322-246.539.amzn2.x86_64-x86_64.dkms.tar.gz'
loading from /var/lib/dkms-archive/nvidia-open/nvidia-open-535.129.03-kernel4.14.322-246.539.amzn2.x86_64-x86_64.dkms.tar.gz
+ /usr/sbin/dkms ldtarball /var/lib/dkms-archive/nvidia-open/nvidia-open-535.129.03-kernel4.14.322-246.539.amzn2.x86_64-x86_64.dkms.tar.gz
...
DKMS: install completed.
+ sudo systemctl daemon-reload
+ /usr/sbin/dkms status -m nvidia
nvidia, 535.129.03, 4.14.322-246.539.amzn2.x86_64, x86_64: installed

New tests cover the changes: no

Description for the changelog

Add open source nvidia kernel module .tar file and install script.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@fierlion fierlion force-pushed the fierlion/kmodEfaOpen branch 2 times, most recently from 8508c74 to 852b682 Compare November 10, 2023 01:17
@fierlion fierlion force-pushed the fierlion/kmodEfaOpen branch 4 times, most recently from 7016356 to b0214e9 Compare December 7, 2023 18:56
@fierlion fierlion changed the title [do not review] Add nvidia-open kernel module installation steps Add nvidia-open kernel module installation steps Dec 7, 2023
@fierlion fierlion force-pushed the fierlion/kmodEfaOpen branch 3 times, most recently from b14288b to 81f02e7 Compare December 8, 2023 19:39
variables.pkr.hcl Outdated Show resolved Hide resolved
@fierlion fierlion force-pushed the fierlion/kmodEfaOpen branch 2 times, most recently from 6078166 to bd32ddc Compare December 8, 2023 21:02
@fierlion fierlion force-pushed the fierlion/kmodEfaOpen branch from bd32ddc to 956d2bb Compare December 8, 2023 21:03
sudo mv $tmpfile /etc/yum.repos.d/amzn2-nvidia-tmp.repo

# only install open driver for post-kepler gpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): perhaps also make this apparent in the PR description?

@fierlion fierlion changed the title Add nvidia-open kernel module installation steps Add post-kepler nvidia open kernel module installation steps Dec 8, 2023
@fierlion fierlion merged commit 9e52d3f into main Dec 9, 2023
3 checks passed
@fierlion fierlion deleted the fierlion/kmodEfaOpen branch December 9, 2023 01:48
@mye956 mye956 mentioned this pull request Dec 11, 2023
rwarren25 pushed a commit to rwarren25/amazon-ecs-ami that referenced this pull request Jul 9, 2024
rwarren25 pushed a commit to rwarren25/amazon-ecs-ami that referenced this pull request Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants