Driver daemonset uninstall the driver on node reboot even if no new version is available #705

slik13 · 2024-04-25T13:47:54Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHL 8.9
Kernel Version: 4.18.0-513.5.1.el8_9.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, 1.29
GPU Operator Version: v23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The init container of the gpu-driver pod (k8s-driver-manager) is set to run uninstall_driver on the driver manager script. The problem with this is that it does not check if a driver is currently installed when using the compiled driver route. This means that any node reboot will trigger a unecessary driver re-compile and install even if no new driver is available.

The nodes already expose which version of the driver is installed through labels, so it should not try to uninstall the driver a driver of the same version is already installed.

See uninstall code here: https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager#L573

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Have a GPU node with a driver
Let the gpu-operator install the driver on the node
reboot the node
The deamonset will uninstall and reinstall the driver.

shivamerla · 2024-04-25T17:05:50Z

@slik13 this is the current limitation, and we have a feature in the roadmap to avoid this. Currently, we use bind mount to mount necessary installation files (/usr/bin, /lib/modules, /lib) from the container to under /run/nvidia/driver on the host. So on every driver container restart the mount is removed. We want to persist these files onto a persistent driver root and configure the nvidia-container-toolkit to look up that path instead.

slik13 · 2024-04-25T17:33:08Z

Thanks for the quick response, I think I understand the reason for the current behavior better now. Would you have something I could track (like a feature number, etc.) to keep an eye on the addition of this feature?

cdesiniotis mentioned this issue Apr 26, 2024

After the GPU node is restarted, an error occurs when the nvidia-driver-daemonset pod is started in the offline environment #703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver daemonset uninstall the driver on node reboot even if no new version is available #705

Driver daemonset uninstall the driver on node reboot even if no new version is available #705

slik13 commented Apr 25, 2024

shivamerla commented Apr 25, 2024

slik13 commented Apr 25, 2024

Driver daemonset uninstall the driver on node reboot even if no new version is available #705

Driver daemonset uninstall the driver on node reboot even if no new version is available #705

Comments

slik13 commented Apr 25, 2024

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

shivamerla commented Apr 25, 2024

slik13 commented Apr 25, 2024