Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver Daemonset crash looping on azure flavour (Standard_NV36ads_A10_v5) #691

Closed
GabQ-Bib opened this issue Apr 3, 2024 · 2 comments
Closed

Comments

@GabQ-Bib
Copy link

GabQ-Bib commented Apr 3, 2024

1. Quick Debug Information

  • OS/Version: Ubuntu22.04
  • Kernel Version: 5.15.0-1031-azure x86_64
  • Container Runtime Type/Version: containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Tanzu Kubernetes Grid (IaaS distribution deployed on Azure VMs)
  • GPU Operator Version: v23.9.2

2. Issue or feature description

Daemonset Pod nvidia-driver-daemonset starting on a VM with flavour Standard_NV36ads_A10_v5 fails to load.
As far as I can tell from the compatibility matrix, the GPU from this flavour is compatible with this operator/device driver combo.
The same operator on the same cluster deploy drivers successfully to T4 type GPUs.

It may be worth noting that this flavour was selected because it is compatible with GEN1 hypervisors, which is a prerequisite for the OS image our kubernetes distribution is using.

An acceptable workaround to this issue would be to find a compatible flavour with comparable GPU performance (the project team specifically requested this gpu because of a performance bottleneck on T4 based VM).

3. Steps to reproduce the issue

1/ Add a GPU node to the cluster with flavour Standard_NV36ads_A10_v5 on azure
2/ Use a standard OS image (no drivers)
3/ deploy the GPU Operator with helm chart v23.9.2

4. nvidia-driver-daemonset pod logs :

[1624686.270184] device teststor-3e0101 left promiscuous mode
[1624693.290594] device pbsmexpc-2be292 left promiscuous mode
[1624695.307411] device pbsmestp-bb1e30 left promiscuous mode
[1624726.882775] SUNRPC: reached max allowed number (1) did not add transport to server: 52.239.243.168
[1624727.259206] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[1624727.259257] IPv6: ADDRCONF(NETDEV_CHANGE): teststor-891195: link becomes ready
[1624727.262738] device teststor-891195 entered promiscuous mode
[1624732.369340] device teststor-891195 left promiscuous mode
[1624787.675720] SUNRPC: reached max allowed number (1) did not add transport to server: 52.239.243.168
[1624787.857829] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[1624787.857867] IPv6: ADDRCONF(NETDEV_CHANGE): teststor-7540d4: link becomes ready
[1624787.861027] device teststor-7540d4 entered promiscuous mode
[1624793.511435] device teststor-7540d4 left promiscuous mode
[1624833.096438] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[1624833.098384] NVRM: The NVIDIA GPU 0002:00:00.0 (PCI ID: 10de:2236)
                 NVRM: installed in this system is not supported by the
                 NVRM: NVIDIA 550.54.14 driver release.
                 NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                 NVRM: in this release's README, available on the operating system
                 NVRM: specific graphics driver download page at www.nvidia.com.
[1624833.098729] nvidia: probe of 0002:00:00.0 failed with error -1
[1624833.098767] NVRM: The NVIDIA probe routine failed for 1 device(s).
[1624833.098768] NVRM: None of the NVIDIA devices were initialized.
[1624833.098998] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
@shivamerla
Copy link
Contributor

@GabQ-Bib you need NVIDIA GRID drivers on these machines. The gpu-operator by default will install NVIDIA Data Center drivers. The process to build and deploy NVIDIA vGPU/GRID drivers with the operator is covered here.

@GabQ-Bib
Copy link
Author

GabQ-Bib commented Apr 9, 2024

Hello @shivamerla
You were indeed correct ! I had a suspicion that I needed a different driver for vGPU but couldn't find the proper documentation.
I used the doc you pointed to make sure the operator use a GRID driver and it worked!
Thanks a lot for your help.

@GabQ-Bib GabQ-Bib closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants