Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

GabQ-Bib · 2024-04-03T13:00:02Z

1. Quick Debug Information

OS/Version: Ubuntu22.04
Kernel Version: 5.15.0-1031-azure x86_64
Container Runtime Type/Version: containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Tanzu Kubernetes Grid (IaaS distribution deployed on Azure VMs)
GPU Operator Version: v23.9.2

2. Issue or feature description

Daemonset Pod nvidia-driver-daemonset starting on a VM with flavour Standard_NV36ads_A10_v5 fails to load.
As far as I can tell from the compatibility matrix, the GPU from this flavour is compatible with this operator/device driver combo.
The same operator on the same cluster deploy drivers successfully to T4 type GPUs.

It may be worth noting that this flavour was selected because it is compatible with GEN1 hypervisors, which is a prerequisite for the OS image our kubernetes distribution is using.

An acceptable workaround to this issue would be to find a compatible flavour with comparable GPU performance (the project team specifically requested this gpu because of a performance bottleneck on T4 based VM).

3. Steps to reproduce the issue

1/ Add a GPU node to the cluster with flavour Standard_NV36ads_A10_v5 on azure
2/ Use a standard OS image (no drivers)
3/ deploy the GPU Operator with helm chart v23.9.2

4. nvidia-driver-daemonset pod logs :

[1624686.270184] device teststor-3e0101 left promiscuous mode
[1624693.290594] device pbsmexpc-2be292 left promiscuous mode
[1624695.307411] device pbsmestp-bb1e30 left promiscuous mode
[1624726.882775] SUNRPC: reached max allowed number (1) did not add transport to server: 52.239.243.168
[1624727.259206] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[1624727.259257] IPv6: ADDRCONF(NETDEV_CHANGE): teststor-891195: link becomes ready
[1624727.262738] device teststor-891195 entered promiscuous mode
[1624732.369340] device teststor-891195 left promiscuous mode
[1624787.675720] SUNRPC: reached max allowed number (1) did not add transport to server: 52.239.243.168
[1624787.857829] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[1624787.857867] IPv6: ADDRCONF(NETDEV_CHANGE): teststor-7540d4: link becomes ready
[1624787.861027] device teststor-7540d4 entered promiscuous mode
[1624793.511435] device teststor-7540d4 left promiscuous mode
[1624833.096438] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[1624833.098384] NVRM: The NVIDIA GPU 0002:00:00.0 (PCI ID: 10de:2236)
                 NVRM: installed in this system is not supported by the
                 NVRM: NVIDIA 550.54.14 driver release.
                 NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                 NVRM: in this release's README, available on the operating system
                 NVRM: specific graphics driver download page at www.nvidia.com.
[1624833.098729] nvidia: probe of 0002:00:00.0 failed with error -1
[1624833.098767] NVRM: The NVIDIA probe routine failed for 1 device(s).
[1624833.098768] NVRM: None of the NVIDIA devices were initialized.
[1624833.098998] nvidia-nvlink: Unregistered Nvlink Core, major device number 510

The text was updated successfully, but these errors were encountered:

shivamerla · 2024-04-04T18:21:49Z

@GabQ-Bib you need NVIDIA GRID drivers on these machines. The gpu-operator by default will install NVIDIA Data Center drivers. The process to build and deploy NVIDIA vGPU/GRID drivers with the operator is covered here.

GabQ-Bib · 2024-04-09T14:14:45Z

Hello @shivamerla
You were indeed correct ! I had a suspicion that I needed a different driver for vGPU but couldn't find the proper documentation.
I used the doc you pointed to make sure the operator use a GRID driver and it worked!
Thanks a lot for your help.

GabQ-Bib closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

GabQ-Bib commented Apr 3, 2024

shivamerla commented Apr 4, 2024

GabQ-Bib commented Apr 9, 2024

Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

Comments

GabQ-Bib commented Apr 3, 2024

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. nvidia-driver-daemonset pod logs :

shivamerla commented Apr 4, 2024

GabQ-Bib commented Apr 9, 2024