Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU drivers not installing with host kernel 6.8 and vGPU 16.5 (535.161.05) #718

Closed
urbaman opened this issue May 13, 2024 · 8 comments
Closed

Comments

@urbaman
Copy link

urbaman commented May 13, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.15.0-106-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.6.28
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kubeadm
  • GPU Operator Version: 23.9.2

2. Issue or feature description

Driver installation fails in VM on kernel 6.8 Host, vGPU driver 16.5, 535.161.05

3. Steps to reproduce the issue

Install vGPU 16.5, 535.161.05 on the host, then try gpu-operator

4. Information to attach (optional if deemed irrelevant)

nvidia-driver-daemonset-k59mv logs:

Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-106-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  695 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |     ^~~~~~
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.129.03-grid/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  490 |     int status = 0;
      |     ^~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03-grid/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.129.03-grid/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@kollachaitanyakrishna
Copy link

kollachaitanyakrishna commented May 13, 2024

A similar issue for me also. attaching the crash report

Azure VM:
Linux 5.15.0-1063-azure x86_64
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

nvidia-dkms-515.0.crash.txt

@bqm1111
Copy link

bqm1111 commented May 16, 2024

I also encounter the same problem on Ubuntu 20.04, nvidia-driver-535.171.04, kernel 5.15.0-107-generic

@vicaya
Copy link

vicaya commented May 18, 2024

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

@bqm1111
Copy link

bqm1111 commented May 19, 2024

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

@Stephenfang51
Copy link

Appears to be a known issue for kernel upgrades. The current/stable nvidia driver version 550.x works fine.

How can I install nvidia-driver-550 on ubuntu 20.04?

Hi
Did you solve your problem?
same with yours :(

@bqm1111
Copy link

bqm1111 commented Jun 6, 2024

Hi

You have to manually download the driver from this site.

@2019211753
Copy link

Hi

You have to manually download the driver from this site.

Manually install works for me!

@cdesiniotis
Copy link
Contributor

The following error was fixed in the 535.183.08 driver

ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants