Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured #413

Closed
searce-aditya opened this issue Sep 29, 2022 · 27 comments

Comments

@searce-aditya
Copy link

searce-aditya commented Sep 29, 2022

We are using GKE cluster(v1.23.8-gke.1900) with Nvidia multi-instance A100 Gpu nodes. We want to install Nvidia Gpu-operator on this cluster.

The default container-runtime in our case is containerd, so we followed the following steps to change the container runtime to nvidia i.e by adding below config to /etc/containerd/config.toml and restarting containerd service.:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

Then restarting the containerd daemon:

systemctl restart containerd

We have followed the following steps to deploy Gpu-operator:
Note: The nvidia drivers and device-plugins are already present by default in the kube-system namespace.

  1. curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
    && chmod 700 get_helm.sh
    && ./get_helm.sh

  2. helm repo add nvidia https://nvidia.github.io/gpu-operator
    && helm repo update

  3. helm install gpu-operator nvidia/gpu-operator --set operator.defaultRuntime=containerd --namespace kube-system --create-namespace --set driver.enabled=false --set mig.strategy=single

Now when I run kubectl get all -n kube-system command, all the below listed pods are not coming up:

pod/gpu-feature-discovery-5lxwl                                   0/1     Init:0/1   0              85m
pod/gpu-feature-discovery-6cjmm                                   0/1     Init:0/1   0              85m
pod/gpu-feature-discovery-bc5c9                                   0/1     Init:0/1   0              85m
pod/gpu-feature-discovery-chhng                                   0/1     Init:0/1   0              85m
pod/nvidia-container-toolkit-daemonset-47g2p                      0/1     Init:0/1   0              85m
pod/nvidia-container-toolkit-daemonset-r6s67                      0/1     Init:0/1   0              85m
pod/nvidia-container-toolkit-daemonset-svksk                      0/1     Init:0/1   0              85m
pod/nvidia-container-toolkit-daemonset-z5m4r                      0/1     Init:0/1   0              85m
pod/nvidia-dcgm-exporter-49t58                                    0/1     Init:0/1   0              85m
pod/nvidia-dcgm-exporter-k6wbg                                    0/1     Init:0/1   0              85m
pod/nvidia-dcgm-exporter-m8jrq                                    0/1     Init:0/1   0              85m
pod/nvidia-dcgm-exporter-p5tl9                                    0/1     Init:0/1   0              85m
pod/nvidia-device-plugin-daemonset-2jnnn                          0/1     Init:0/1   0              85m
pod/nvidia-device-plugin-daemonset-hwmlp                          0/1     Init:0/1   0              85m
pod/nvidia-device-plugin-daemonset-qnv6n                          0/1     Init:0/1   0              85m
pod/nvidia-device-plugin-daemonset-zxvgh                          0/1     Init:0/1   0              85m

When I tried to describe one of the dcgm pods by running kubectl describe pod nvidia-dcgm-exporter-49t58 -n kube-system it showing the following error:

Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  4m22s (x394 over 89m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Please suggest if we are missing something & how can we resolve this issue quickly.

@elezar
Copy link
Member

elezar commented Sep 29, 2022

@searce-aditya I see from the list of pods that you are adeploying the nvidia-container-toolkit-daemonset, however your comments seem to indicate that the driver and the NVIDIA Container Toolkit are installed on the host.

Could you add --set toolkit.enabled=false to the options you use when deploying the operator?

@searce-aditya
Copy link
Author

Hello @elezar we tried adding --set toolkit.enabled=false while deploying operator, still facing the same issue

@shivamerla
Copy link
Contributor

Is the driver ready on the node? Can you share the output of nvidia-smi from the node.

@arag00rn
Copy link

arag00rn commented Oct 10, 2022

I've a different setup, but facing exactly the same error. I tried using chart 1.11.1 and 22.9.0 with same results:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set version=22.9.0 --set driver.enabled=false --set operator.defaultRuntime=containerd

| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:05:00.0  On |                  N/A |
|  0%   35C    P8    13W / 170W |    101MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1577      G   /usr/lib/xorg/Xorg                 70MiB |
|    0   N/A  N/A      1762      G   /usr/bin/gnome-shell               28MiB |
+-----------------------------------------------------------------------------+

I tried with 22.9.0 following the [Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html) matrix

@LukasIAO
Copy link

LukasIAO commented Oct 18, 2022

I'm running into the same problem on a Nvidia DGX Station A100 running microk8s 1.25.2 and following the process outlined in the GPU-operator manual for DGX systems.

The container environment works as expected in Docker.

@mristok
Copy link

mristok commented Oct 20, 2022

Experiencing this same issue running K8s v1.21.14 and containerd. I have tried all suggestions in this issue and the issue has not resolved.

@shivamerla
Copy link
Contributor

shivamerla commented Oct 20, 2022

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured This is not an error and expected until drivers are loaded and nvidia-container-runtime is setup. With driver containers, it takes about 3-4minutes to setup/load drivers and toolkit, after that these errors should go away. The reason these errors appear is we run some operands(device-plugin, gfd, dcgm etc) with runtimeClass set to nvidia, and this will cause above error until drivers/toolkit are ready.

Please ignore these errors and let us know the state of all pods running with GPU operator. If any of pod is still in init phase, we need to debug that component. Make sure to understand caveats with pre-installed drivers and container-toolkit in some cases as described here.

@LukasIAO
Copy link

LukasIAO commented Oct 21, 2022

It's definitely not a timing issue, these pods have been in this state for quite a while.

kube-system    calico-node-lkd9l                                                 1/1     Running    0          2d23h
kube-system    calico-kube-controllers-97b47d84d-2kghb                           1/1     Running    0          2d23h
kube-system    coredns-d489fb88-wnxnh                                            1/1     Running    0          2d23h
kube-system    dashboard-metrics-scraper-64bcc67c9c-thq65                        1/1     Running    0          2d23h
kube-system    kubernetes-dashboard-74b66d7f9c-lkjcg                             1/1     Running    0          2d23h
kube-system    hostpath-provisioner-85ccc46f96-sqgnl                             1/1     Running    0          2d23h
kube-system    metrics-server-6b6844c455-w56rj                                   1/1     Running    0          2d23h
gpu-operator   gpu-operator-1666083289-node-feature-discovery-worker-gq9fw       1/1     Running    0          2d23h
gpu-operator   gpu-operator-5dc6b8989b-dflzh                                     1/1     Running    0          2d23h
gpu-operator   gpu-operator-1666083289-node-feature-discovery-master-6f49r85q7   1/1     Running    0          2d23h
gpu-operator   nvidia-operator-validator-z8bvv                                   0/1     Init:0/4   0          2d23h
gpu-operator   nvidia-device-plugin-daemonset-c84bd                              0/1     Init:0/1   0          2d23h
gpu-operator   nvidia-dcgm-exporter-fn2cs                                        0/1     Init:0/1   0          2d23h
gpu-operator   gpu-feature-discovery-4px4v                                       0/1     Init:0/1   0          2d23h
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   41C    P0    56W / 275W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   41C    P0    56W / 275W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   41C    P0    58W / 275W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display  On   | 00000000:C1:00.0 Off |                  N/A |
| 34%   43C    P8    N/A /  50W |      1MiB /  3911MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:C2:00.0 Off |                    0 |
| N/A   42C    P0    63W / 275W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Containerd and docker configs have been set up as instructed in the documentation. Initially, I tried to install the operator via microk8s GPU add-on but ran into issues. Following the recommended Helm3 installation, all expected pods were created, but some of them got stuck in the init phase.

Docker config:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Containerd:

#temp disabled
#disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

More details on the feature-discovery pod:

Name:                 gpu-feature-discovery-4px4v
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-gpu-feature-discovery
Node:                 dgxadmin-dgx-station-a100-920-23487-2530-000/10.36.40.65
Start Time:           Tue, 18 Oct 2022 10:55:23 +0200
Labels:               app=gpu-feature-discovery
                      app.kubernetes.io/part-of=nvidia-gpu
                      controller-revision-hash=5ffb7c7b8b
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/gpu-feature-discovery
Init Containers:
  toolkit-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wqqm (ro)
Containers:
  gpu-feature-discovery:
    Container ID:
    Image:          nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      GFD_SLEEP_INTERVAL:          60s
      GFD_FAIL_ON_INIT_ERROR:      true
      GFD_MIG_STRATEGY:            single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /etc/kubernetes/node-feature-discovery/features.d from output-dir (rw)
      /sys/class/dmi/id/product_name from dmi-product-name (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wqqm (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  output-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/node-feature-discovery/features.d
    HostPathType:
  dmi-product-name:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/class/dmi/id/product_name
    HostPathType:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-5wqqm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.gpu-feature-discovery=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                  Age                       From     Message
  ----     ------                  ----                      ----     -------
  Warning  FailedCreatePodSandBox  4m5s (x19852 over 2d23h)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Logs of the feature-discovery container are unavailable due to the init status.

@shivamerla
Copy link
Contributor

@LukasIAO Can you add following debug options under /etc/nvidia-container-runtime/.config.toml. Can you also confirm root is set to / here. With this if you restart any of the operand pods(operator-validator for e.g.) we should see logs in file /var/log/nvidia-container-runtime.log and /var/log/nvidia-container-cli.log. That will help us to confirm if the runtime hook is invoked or not by docker or containerd.

disable-require = false
  
[nvidia-container-cli]
  debug = "/var/log/nvidia-container-toolkit.log"
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/"

[nvidia-container-runtime]
  debug = "/var/log/nvidia-container-runtime.log"

@LukasIAO
Copy link

LukasIAO commented Oct 24, 2022

Hi @shivamerla,
thank you very much for your help. I've added/enabled the recommended lines in the config. Some variables varied in my installation, which I left at their default value. I assume they are configured correctly, since the DGX OS ships with the runtime already installed.

Current config:

#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

Updated config:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#temp change to '/'
root = "/"
#disabled by default
path = "/usr/bin/nvidia-container-cli"
environment = []
#disabled by default
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"




[nvidia-container-runtime]
#disabled by default
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

I've restarted the deamonset pod which is still running into the same issue.

Name:                 nvidia-device-plugin-daemonset-tltdv
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-device-plugin
Node:                 dgxadmin-dgx-station-a100-920-23487-2530-000/10.36.40.65
Start Time:           Mon, 24 Oct 2022 10:57:01 +0200
Labels:               app=nvidia-device-plugin-daemonset
                      controller-revision-hash=784ffbf4f9
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2dpht (ro)
Containers:
  nvidia-device-plugin:
    Container ID:
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      PASS_DEVICE_SPECS:           true
      FAIL_ON_INIT_ERROR:          true
      DEVICE_LIST_STRATEGY:        envvar
      DEVICE_ID_STRATEGY:          uuid
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
      MIG_STRATEGY:                single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2dpht (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-2dpht:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.device-plugin=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  4m24s (x26 over 9m50s)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Restarting the pod, unfortunately, did not create any nvidia-container-runtime or cli logs. However, here are the logs for the gpu-operator as well as the gpu-operator-namespace, perhaps they can be of some help.

/var/log/gpu-operator.log:

log_file: /var/log/gpu-manager.log
last_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
new_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
can't access /opt/amdgpu-pro/bin/amdgpu-pro-px
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-515srv
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-515
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-510srv
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-510
Looking for nvidia modules in /lib/modules/5.4.0-125-generic/kernel/nvidia-470srv
Found nvidia.ko module in /lib/modules/5.4.0-125-generic/kernel/nvidia-470srv/nvidia.ko
Looking for amdgpu modules in /lib/modules/5.4.0-125-generic/kernel
Looking for amdgpu modules in /lib/modules/5.4.0-125-generic/updates/dkms
Is nvidia loaded? yes
Was nvidia unloaded? no
Is nvidia blacklisted? no
Is intel loaded? no
Is radeon loaded? no
Is radeon blacklisted? no
Is amdgpu loaded? no
Is amdgpu blacklisted? no
Is amdgpu versioned? no
Is amdgpu pro stack? no
Is nouveau loaded? no
Is nouveau blacklisted? yes
Is nvidia kernel module available? yes
Is amdgpu kernel module available? no
Vendor/Device Id: 1a03:2000
BusID "PCI:70@0:0:0"
Is boot vga? yes
Vendor/Device Id: 10de:20b0
BusID "PCI:129@0:0:0"
can't open /sys/bus/pci/devices/0000:81:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:81:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:81:00.0/power
Runtime D3 status:          ?
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:71@0:0:0"
can't open /sys/bus/pci/devices/0000:47:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:47:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:47:00.0/power
Runtime D3 status:          ?
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:1fb0
BusID "PCI:193@0:0:0"
Is boot vga? no
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x1fb0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:c1:00.0/power
Runtime D3 status:          ?
Is nvidia runtime pm enabled for "0x1fb0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:1@0:0:0"
can't open /sys/bus/pci/devices/0000:01:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:01:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:01:00.0/power
Runtime D3 status:          Disabled by default
Is nvidia runtime pm enabled for "0x20b0"? no
Vendor/Device Id: 10de:20b0
BusID "PCI:194@0:0:0"
can't open /sys/bus/pci/devices/0000:c2:00.0/boot_vga
Is boot vga? no
can't open /sys/bus/pci/devices/0000:c2:00.0/boot_vga
Chassis type: "17"
Laptop not detected
Is nvidia runtime pm supported for "0x20b0"? no
Checking power status in /proc/driver/nvidia/gpus/0000:c2:00.0/power
Runtime D3 status:          ?
Is nvidia runtime pm enabled for "0x20b0"? no
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Skipping "/dev/dri/card5", driven by "nvidia-drm"
Skipping "/dev/dri/card4", driven by "nvidia-drm"
Skipping "/dev/dri/card3", driven by "nvidia-drm"
Skipping "/dev/dri/card2", driven by "nvidia-drm"
Skipping "/dev/dri/card1", driven by "nvidia-drm"
Skipping "/dev/dri/card0", driven by "ast"
Does it require offloading? no
last cards number = 6
Has amd? no
Has intel? no
Has nvidia? yes
How many cards? 6
Has the system changed? No
Takes 0ms to wait for nvidia udev rules completed.
Unsupported discrete card vendor: 10de
Nothing to do

/var/log/containers/gpu-operator*.logs:

2022-10-24T11:19:44.889924275+02:00 stderr F 1.666603184889807e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-node-status-exporter", "status": "disabled"}
2022-10-24T11:19:44.897367095+02:00 stderr F 1.666603184897253e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-manager", "status": "disabled"}
2022-10-24T11:19:44.903645824+02:00 stderr F 1.6666031849035316e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-device-manager", "status": "disabled"}
2022-10-24T11:19:44.911491405+02:00 stderr F 1.6666031849113731e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-validation", "status": "disabled"}
2022-10-24T11:19:44.920587781+02:00 stderr F 1.6666031849204967e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vfio-manager", "status": "disabled"}
2022-10-24T11:19:44.928322513+02:00 stderr F 1.6666031849282417e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-device-plugin", "status": "disabled"}
2022-10-24T11:19:44.928330588+02:00 stderr F 1.666603184928259e+09      INFO    controllers.ClusterPolicy       ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}
2022-10-24T11:19:49.929276951+02:00 stderr F 1.6666031899291391e+09     INFO    controllers.ClusterPolicy       Sandbox workloads       {"E
nabled": false, "DefaultWorkload": "container"}
2022-10-24T11:19:49.929296869+02:00 stderr F 1.666603189929195e+09      INFO    controllers.ClusterPolicy       GPU workload configuration{"NodeName": "dgxadmin-dgx-station-a100-920-23487-2530-000", "GpuWorkloadConfig": "container"}
2022-10-24T11:19:49.929301377+02:00 stderr F 1.6666031899292035e+09     INFO    controllers.ClusterPolicy       Checking GPU state labels o
n the node      {"NodeName": "dgxadmin-dgx-station-a100-920-23487-2530-000"}
2022-10-24T11:19:49.929305134+02:00 stderr F 1.6666031899292104e+09     INFO    controllers.ClusterPolicy       Number of nodes with GPU la
bel     {"NodeCount": 1}
2022-10-24T11:19:49.929308751+02:00 stderr F 1.6666031899292326e+09     INFO    controllers.ClusterPolicy       Using container runtime: co
ntainerd
2022-10-24T11:19:49.929341132+02:00 stderr F 1.6666031899292479e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RuntimeClass": "nvidia"}
2022-10-24T11:19:49.932388012+02:00 stderr F 1.6666031899322655e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "pre-requisites", "status": "ready"}
2022-10-24T11:19:49.932422487+02:00 stderr F 1.666603189932333e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Service": "gpu-operator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.935075583+02:00 stderr F 1.666603189934942e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-operator-metrics", "status": "ready"}
2022-10-24T11:19:49.944488527+02:00 stderr F 1.6666031899443653e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-driver", "status": "disabled"}
2022-10-24T11:19:49.950519708+02:00 stderr F 1.6666031899504201e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-container-toolkit", "status": "disabled"}
2022-10-24T11:19:49.952929514+02:00 stderr F 1.6666031899528463e+09     INFO    controllers.ClusterPolicy       Found Resource, skipping up
date    {"ServiceAccount": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.955563323+02:00 stderr F 1.666603189955487e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Role": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.959463015+02:00 stderr F 1.6666031899593854e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRole": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.963319546+02:00 stderr F 1.6666031899632366e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RoleBinding": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.967390461+02:00 stderr F 1.6666031899673078e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.969474571+02:00 stderr F 1.666603189969397e+09      INFO    controllers.ClusterPolicy       DaemonSet identical, skippi
ng update       {"DaemonSet": "nvidia-operator-validator", "Namespace": "gpu-operator", "name": "nvidia-operator-validator"}
2022-10-24T11:19:49.969479841+02:00 stderr F 1.6666031899694111e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"L
abelSelector": "app=nvidia-operator-validator"}
2022-10-24T11:19:49.969482196+02:00 stderr F 1.6666031899694364e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:49.969484289+02:00 stderr F 1.6666031899694402e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberUnavailable": 1}
2022-10-24T11:19:49.969486083+02:00 stderr F 1.6666031899694436e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-operator-validation", "status": "notReady"}
2022-10-24T11:19:49.971824964+02:00 stderr F 1.6666031899717581e+09     INFO    controllers.ClusterPolicy       Found Resource, skipping up
date    {"ServiceAccount": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.974246151+02:00 stderr F 1.666603189974167e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Role": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.97832913+02:00 stderr F 1.6666031899782453e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRole": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.982385448+02:00 stderr F 1.666603189982306e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.98709994+02:00 stderr F 1.6666031899865522e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRoleBinding": "nvidia-device-plugin", "Namespace": "gpu-operator"}
2022-10-24T11:19:49.989852654+02:00 stderr F 1.666603189989776e+09      INFO    controllers.ClusterPolicy       DaemonSet identical, skippi
ng update       {"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "gpu-operator", "name": "nvidia-device-plugin-daemonset"}
2022-10-24T11:19:49.989857924+02:00 stderr F 1.6666031899898036e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"L
abelSelector": "app=nvidia-device-plugin-daemonset"}
2022-10-24T11:19:49.989889704+02:00 stderr F 1.6666031899898317e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:49.989891998+02:00 stderr F 1.6666031899898388e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberUnavailable": 1}
2022-10-24T11:19:49.989894192+02:00 stderr F 1.666603189989843e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-device-plugin", "status": "notReady"}
2022-10-24T11:19:49.99555849+02:00 stderr F 1.6666031899954753e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-dcgm", "status": "disabled"}
2022-10-24T11:19:49.998001799+02:00 stderr F 1.6666031899979186e+09     INFO    controllers.ClusterPolicy       Found Resource, skipping up
date    {"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.000277611+02:00 stderr F 1.666603190000206e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Role": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.004443246+02:00 stderr F 1.666603190004365e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RoleBinding": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.006322138+02:00 stderr F 1.666603190006245e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Service": "nvidia-dcgm-exporter", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.008463596+02:00 stderr F 1.6666031900083907e+09     INFO    controllers.ClusterPolicy       DaemonSet identical, skippi
ng update       {"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "gpu-operator", "name": "nvidia-dcgm-exporter"}
2022-10-24T11:19:50.00847625+02:00 stderr F 1.666603190008402e+09       INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"L
abelSelector": "app=nvidia-dcgm-exporter"}
2022-10-24T11:19:50.008479637+02:00 stderr F 1.6666031900084198e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.00848162+02:00 stderr F 1.6666031900084226e+09      INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberUnavailable": 1}
2022-10-24T11:19:50.008484415+02:00 stderr F 1.6666031900084267e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-dcgm-exporter", "status": "notReady"}
2022-10-24T11:19:50.01115833+02:00 stderr F 1.6666031900110757e+09      INFO    controllers.ClusterPolicy       Found Resource, skipping up
date    {"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.013800435+02:00 stderr F 1.6666031900137217e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Role": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.017510739+02:00 stderr F 1.6666031900174518e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRole": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.021401925+02:00 stderr F 1.666603190021325e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.025175158+02:00 stderr F 1.6666031900250971e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.027109224+02:00 stderr F 1.6666031900270455e+09     INFO    controllers.ClusterPolicy       DaemonSet identical, skippi
ng update       {"DaemonSet": "gpu-feature-discovery", "Namespace": "gpu-operator", "name": "gpu-feature-discovery"}
2022-10-24T11:19:50.027116348+02:00 stderr F 1.66660319002706e+09       INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"L
abelSelector": "app=gpu-feature-discovery"}
2022-10-24T11:19:50.027118752+02:00 stderr F 1.6666031900270865e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.027121447+02:00 stderr F 1.666603190027092e+09      INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberUnavailable": 1}
2022-10-24T11:19:50.027129413+02:00 stderr F 1.6666031900270965e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "gpu-feature-discovery", "status": "notReady"}
2022-10-24T11:19:50.029697307+02:00 stderr F 1.6666031900296195e+09     INFO    controllers.ClusterPolicy       Found Resource, skipping up
date    {"ServiceAccount": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.031903999+02:00 stderr F 1.6666031900318484e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"Role": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.035654919+02:00 stderr F 1.6666031900355768e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRole": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.039583005+02:00 stderr F 1.6666031900394998e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"RoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.043497384+02:00 stderr F 1.6666031900434043e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.047675753+02:00 stderr F 1.666603190047592e+09      INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ConfigMap": "default-mig-parted-config", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.051867748+02:00 stderr F 1.6666031900517852e+09     INFO    controllers.ClusterPolicy       Found Resource, updating...
        {"ConfigMap": "default-gpu-clients", "Namespace": "gpu-operator"}
2022-10-24T11:19:50.053732303+02:00 stderr F 1.666603190053671e+09      INFO    controllers.ClusterPolicy       DaemonSet identical, skippi
ng update       {"DaemonSet": "nvidia-mig-manager", "Namespace": "gpu-operator", "name": "nvidia-mig-manager"}
2022-10-24T11:19:50.053737482+02:00 stderr F 1.666603190053683e+09      INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"L
abelSelector": "app=nvidia-mig-manager"}
2022-10-24T11:19:50.053739546+02:00 stderr F 1.6666031900537012e+09     INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberOfDaemonSets": 1}
2022-10-24T11:19:50.05374156+02:00 stderr F 1.6666031900537043e+09      INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"N
umberUnavailable": 0}
2022-10-24T11:19:50.053743965+02:00 stderr F 1.666603190053708e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-mig-manager", "status": "ready"}
2022-10-24T11:19:50.062922406+02:00 stderr F 1.666603190062841e+09      INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-node-status-exporter", "status": "disabled"}
2022-10-24T11:19:50.070873386+02:00 stderr F 1.6666031900707908e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-manager", "status": "disabled"}
2022-10-24T11:19:50.077472792+02:00 stderr F 1.6666031900773983e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vgpu-device-manager", "status": "disabled"}
2022-10-24T11:19:50.084970015+02:00 stderr F 1.6666031900848877e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-validation", "status": "disabled"}
2022-10-24T11:19:50.093761595+02:00 stderr F 1.6666031900936785e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-vfio-manager", "status": "disabled"}
2022-10-24T11:19:50.101456491+02:00 stderr F 1.6666031901013722e+09     INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step co
mpleted {"state:": "state-sandbox-device-plugin", "status": "disabled"}
2022-10-24T11:19:50.101463524+02:00 stderr F 1.6666031901013896e+09     INFO    controllers.ClusterPolicy       ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}

ClusterPolicy isn't ready {"states not ready": ["state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]} Does this tell us anything useful?

@LukasIAO
Copy link

LukasIAO commented Nov 9, 2022

@shivamerla Turns out an unrelated system reboot did lead to the creation of the /var/log/nvidia-container-runtime.log file.

.
.
.
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/ebb013110bd0849f6ead70f5251c0d607cdfae181bca8f6f7ed05adc33c3af14/config.json","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:15+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:16+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:16+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/84f417760864250d40425109d07c6d03dddecf355a022f2001057726782c049a/config.json","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:59+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:27:59+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/146ba8cd3689b35d81c18c33bce37f687f3174f947f2a59bba2669437b355172/config.json","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:33+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:36+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:33:36+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/a46c815e945c9c29ec298bb247e3cbb6845dcaddf9c5463ec51b58c3cbbe5d3f/config.json","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:26+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:28+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:35:28+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/a504d55e7a9f34970a6c5eb0f8a69e15aa64fa7e1b75b1da0556cff8fa1da4fe/config.json","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:56+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:58+01:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-11-09T20:36:58+01:00"}

The /var/log/nvidia-container-cli.log file is still missing, however. I hope it's still helpful.

@elezar
Copy link
Member

elezar commented Nov 10, 2022

@LukasIAO the other log file of interest may be /var/log/nvidia-container-toolkit.log.

@LukasIAO
Copy link

Hi @elezar, thank you for taking the time.

Here the toolkit.log:

I1109 19:36:56.714689 2558903 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I1109 19:36:56.714724 2558903 nvc.c:350] using root /
I1109 19:36:56.714729 2558903 nvc.c:351] using ldcache /etc/ld.so.cache
I1109 19:36:56.714734 2558903 nvc.c:352] using unprivileged user 65534:65534
I1109 19:36:56.714774 2558903 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1109 19:36:56.714846 2558903 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1109 19:36:56.716665 2558909 nvc.c:278] loading kernel module nvidia
I1109 19:36:56.716824 2558909 nvc.c:282] running mknod for /dev/nvidiactl
I1109 19:36:56.716851 2558909 nvc.c:286] running mknod for /dev/nvidia0
I1109 19:36:56.716867 2558909 nvc.c:286] running mknod for /dev/nvidia1
I1109 19:36:56.716882 2558909 nvc.c:286] running mknod for /dev/nvidia2
I1109 19:36:56.716896 2558909 nvc.c:286] running mknod for /dev/nvidia3
I1109 19:36:56.716911 2558909 nvc.c:286] running mknod for /dev/nvidia4
I1109 19:36:56.716926 2558909 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1109 19:36:56.722348 2558909 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I1109 19:36:56.722439 2558909 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I1109 19:36:56.725132 2558909 nvc.c:296] loading kernel module nvidia_uvm
I1109 19:36:56.725192 2558909 nvc.c:300] running mknod for /dev/nvidia-uvm
I1109 19:36:56.725255 2558909 nvc.c:305] loading kernel module nvidia_modeset
I1109 19:36:56.725310 2558909 nvc.c:309] running mknod for /dev/nvidia-modeset
I1109 19:36:56.725496 2558910 rpc.c:71] starting driver rpc service
I1109 19:36:56.729632 2558911 rpc.c:71] starting nvcgo rpc service
I1109 19:36:56.730197 2558903 nvc_container.c:240] configuring container with 'compute utility video supervised'
I1109 19:36:56.731494 2558903 nvc_container.c:262] setting pid to 2558855
I1109 19:36:56.731502 2558903 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
I1109 19:36:56.731507 2558903 nvc_container.c:264] setting owner to 0:0
I1109 19:36:56.731512 2558903 nvc_container.c:265] setting bins directory to /usr/bin
I1109 19:36:56.731517 2558903 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I1109 19:36:56.731521 2558903 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I1109 19:36:56.731527 2558903 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I1109 19:36:56.731533 2558903 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)
I1109 19:36:56.731538 2558903 nvc_container.c:270] setting mount namespace to /proc/2558855/ns/mnt
I1109 19:36:56.731542 2558903 nvc_container.c:272] detected cgroupv1
I1109 19:36:56.731547 2558903 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/docker/a504d55e7a9f34970a6c5eb0f8a69e15aa64fa7e1b75b1da0556cff8fa1da4fe
I1109 19:36:56.731553 2558903 nvc_info.c:766] requesting driver information with ''
I1109 19:36:56.732442 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03
I1109 19:36:56.732486 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.141.03
I1109 19:36:56.732509 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.141.03
I1109 19:36:56.732534 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I1109 19:36:56.732570 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03
I1109 19:36:56.732607 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03
I1109 19:36:56.732635 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03
I1109 19:36:56.732667 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.141.03
I1109 19:36:56.732689 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03
I1109 19:36:56.732720 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.141.03
I1109 19:36:56.732751 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.141.03
I1109 19:36:56.732795 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.141.03
I1109 19:36:56.732816 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.141.03
I1109 19:36:56.732838 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.141.03
I1109 19:36:56.732869 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03
I1109 19:36:56.732900 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.141.03
I1109 19:36:56.732921 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03
I1109 19:36:56.732944 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03
I1109 19:36:56.732972 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.141.03
I1109 19:36:56.732992 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03
I1109 19:36:56.733023 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03
I1109 19:36:56.733210 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03
I1109 19:36:56.733292 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.141.03
I1109 19:36:56.733314 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.141.03
I1109 19:36:56.733335 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03
I1109 19:36:56.733359 2558903 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.141.03
W1109 19:36:56.733390 2558903 nvc_info.c:399] missing library libcudadebugger.so
W1109 19:36:56.733395 2558903 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1109 19:36:56.733400 2558903 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1109 19:36:56.733405 2558903 nvc_info.c:399] missing library libvdpau_nvidia.so
W1109 19:36:56.733409 2558903 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W1109 19:36:56.733414 2558903 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1109 19:36:56.733419 2558903 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1109 19:36:56.733423 2558903 nvc_info.c:403] missing compat32 library libcuda.so
W1109 19:36:56.733428 2558903 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1109 19:36:56.733433 2558903 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W1109 19:36:56.733437 2558903 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W1109 19:36:56.733442 2558903 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1109 19:36:56.733446 2558903 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1109 19:36:56.733451 2558903 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W1109 19:36:56.733456 2558903 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1109 19:36:56.733460 2558903 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1109 19:36:56.733465 2558903 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1109 19:36:56.733469 2558903 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W1109 19:36:56.733474 2558903 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W1109 19:36:56.733479 2558903 nvc_info.c:403] missing compat32 library libnvcuvid.so
W1109 19:36:56.733483 2558903 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W1109 19:36:56.733488 2558903 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W1109 19:36:56.733493 2558903 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W1109 19:36:56.733497 2558903 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W1109 19:36:56.733502 2558903 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W1109 19:36:56.733506 2558903 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1109 19:36:56.733511 2558903 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1109 19:36:56.733523 2558903 nvc_info.c:403] missing compat32 library libnvoptix.so
W1109 19:36:56.733527 2558903 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W1109 19:36:56.733532 2558903 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W1109 19:36:56.733537 2558903 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W1109 19:36:56.733541 2558903 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W1109 19:36:56.733546 2558903 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W1109 19:36:56.733550 2558903 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1109 19:36:56.733745 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1109 19:36:56.733758 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1109 19:36:56.733770 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1109 19:36:56.733788 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1109 19:36:56.733801 2558903 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1109 19:36:56.733830 2558903 nvc_info.c:425] missing binary nv-fabricmanager
I1109 19:36:56.733849 2558903 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/470.141.03/gsp.bin
I1109 19:36:56.733864 2558903 nvc_info.c:529] listing device /dev/nvidiactl
I1109 19:36:56.733869 2558903 nvc_info.c:529] listing device /dev/nvidia-uvm
I1109 19:36:56.733874 2558903 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1109 19:36:56.733879 2558903 nvc_info.c:529] listing device /dev/nvidia-modeset
I1109 19:36:56.733896 2558903 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1109 19:36:56.733911 2558903 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1109 19:36:56.733921 2558903 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1109 19:36:56.733926 2558903 nvc_info.c:822] requesting device information with ''
I1109 19:36:56.740598 2558903 nvc_info.c:713] listing device /dev/nvidia4 (GPU-f03eaa43-86e1-e9e9-9611-a7ac07919c57 at 00000000:01:00.0)
I1109 19:36:56.747015 2558903 nvc_info.c:713] listing device /dev/nvidia3 (GPU-07f52695-96d8-e750-2a03-fd133cfc332e at 00000000:47:00.0)
I1109 19:36:56.753315 2558903 nvc_info.c:713] listing device /dev/nvidia2 (GPU-f8ef4daa-4956-327d-fa81-1f9168e99402 at 00000000:81:00.0)
I1109 19:36:56.759505 2558903 nvc_info.c:713] listing device /dev/nvidia1 (GPU-c725c00f-684f-0913-2178-b85c8589f26d at 00000000:c2:00.0)
I1109 19:36:56.759543 2558903 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nvidia
I1109 19:36:56.759937 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-smi
I1109 19:36:56.759990 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-debugdump
I1109 19:36:56.760025 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-persiste
nced
I1109 19:36:56.760059 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-cuda
-mps-control
I1109 19:36:56.760091 2558903 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/bin/nvidia-cuda-
mps-server
I1109 19:36:56.760250 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03
I1109 19:36:56.760294 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merge
d/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03
I1109 19:36:56.760339 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merg
ed/usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.470.141.03
I1109 19:36:56.760373 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/
lib/x86_64-linux-gnu/libcuda.so.470.141.03
I1109 19:36:56.760409 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/me
rged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03
I1109 19:36:56.760448 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb0343
20613/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I1109 19:36:56.760482 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613
/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03
I1109 19:36:56.760518 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/
merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03
I1109 19:36:56.760556 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/me
rged/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03
I1109 19:36:56.760591 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb0343206
13/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03
I1109 19:36:56.760625 2558903 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/u
sr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03
I1109 19:36:56.760645 2558903 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.
so.1
I1109 19:36:56.760668 2558903 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.
so -> libnvidia-opticalflow.so.1
I1109 19:36:56.761145 2558903 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/470.141.03/gsp.bin at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/lib/firmw
are/nvidia/470.141.03/gsp.bin with flags 0x7
I1109 19:36:56.761247 2558903 nvc_mount.c:261] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/run/nvidia-persisten
ced/socket
I1109 19:36:56.761280 2558903 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidiactl
I1109 19:36:56.761449 2558903 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia-uvm
I1109 19:36:56.761531 2558903 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia-uvm-tools
I1109 19:36:56.761624 2558903 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia4
I1109 19:36:56.761696 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:01:00.0
I1109 19:36:56.761792 2558903 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia3
I1109 19:36:56.761837 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:47:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:47:00.0
I1109 19:36:56.761926 2558903 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia2
I1109 19:36:56.761969 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:81:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:81:00.0
I1109 19:36:56.762058 2558903 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/dev/nvidia1
I1109 19:36:56.762098 2558903 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:c2:00.0 at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged/proc/driver/nv
idia/gpus/0000:c2:00.0
I1109 19:36:56.762144 2558903 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/0e055b09100eda7c57cc731534aa536fe7c0b1e279ddbf0aaa21eeb034320613/merged
I1109 19:36:56.800505 2558903 nvc.c:434] shutting down library context
I1109 19:36:56.800661 2558911 rpc.c:95] terminating nvcgo rpc service
I1109 19:36:56.801331 2558903 rpc.c:135] nvcgo rpc service terminated successfully
I1109 19:36:56.802516 2558910 rpc.c:95] terminating driver rpc service
I1109 19:36:56.802611 2558903 rpc.c:135] driver rpc service terminated successfully
 

@hholst80
Copy link

hholst80 commented Jan 1, 2023

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

You need to define a runtimeClass named nvidia with handler nvidia.

apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  name: nvidia

@captainsk7
Copy link

@hholst80 in which file should I define a runtimeClass ?

@shivamerla
Copy link
Contributor

@captainsk7 nvidia runtimeClass is created by default with the gpu-operator and set as default within containerd config. Please provide more details on the error you are seeing. failed to get sandbox runtime: no runtime for "nvidia" is configured is expected when driver is still getting installed, but should recover after that. Can you provide more details about your config?

  1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?
  2. OS version
  3. Containerd config (/etc/containerd/config.toml)
  4. Status of all pods under gpu-operator namespace
  5. Logs from init-containers of device-plugin and container-toolkit. (kubectl logs --all-containers -lapp=nvidia-device-plugin-daemonset -n gpu-operator)

@captainsk7
Copy link

captainsk7 commented Jan 11, 2023

@shivamerla thanks for reply,
I have created a multi-node k0s Kubernetes cluster using this blog https://www.padok.fr/en/blog/k0s-kubernetes-gpu
I'm getting the same error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured .

1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?

  • on both worker nodes the drivers/container-toolkit is pre-installed.
  • on controller node its not installed because its non-GPU machine.

2. OS version Ubuntu 20.04.5 LTS

3. Status of all pods under gpu-operator namespace

NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-jc4wt                                       0/1     Init:0/1   0          18h
gpu-feature-discovery-r27zv                                       0/1     Init:0/1   0          18h
gpu-operator-1673351272-node-feature-discovery-master-65d8hl88v   1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-8j72k       1/1     Running    0          18h
gpu-operator-1673351272-node-feature-discovery-worker-wj5gd       1/1     Running    0          18h
gpu-operator-95b545d6f-r2cnp                                      1/1     Running    0          18h
nvidia-container-toolkit-daemonset-lg79g                          1/1     Running    0          18h
nvidia-container-toolkit-daemonset-q26kq                          1/1     Running    0          18h
nvidia-dcgm-exporter-2vpwj                                        0/1     Init:0/1   0          18h
nvidia-dcgm-exporter-gx6dv                                        0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-tbbgb                              0/1     Init:0/1   0          18h
nvidia-device-plugin-daemonset-z29kx                              0/1     Init:0/1   0          18h
nvidia-operator-validator-79s4j                                   0/1     Init:0/4   0          18h
nvidia-operator-validator-thbq2                                   0/1     Init:0/4   0          18h

4. Logs from init-containers

from device-plugin

Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-tbbgb" is waiting to start: PodInitializing

from container-toolkit
time="2023-01-10T11:57:43Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:57:43Z" level=info msg="Config version: 2"
time="2023-01-10T11:57:43Z" level=info msg="Updating config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully updated config"
time="2023-01-10T11:57:43Z" level=info msg="Flushing config"
time="2023-01-10T11:57:43Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:57:43Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:57:43Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:57:43Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:57:43Z" level=info msg="Waiting for signal"
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
time="2023-01-10T11:51:53Z" level=info msg="Successfully loaded config"
time="2023-01-10T11:51:53Z" level=info msg="Config version: 2"
time="2023-01-10T11:51:53Z" level=info msg="Updating config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully updated config"
time="2023-01-10T11:51:53Z" level=info msg="Flushing config"
time="2023-01-10T11:51:53Z" level=info msg="Successfully flushed config"
time="2023-01-10T11:51:53Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-01-10T11:51:53Z" level=info msg="Successfully signaled containerd"
time="2023-01-10T11:51:53Z" level=info msg="Completed 'setup' for containerd"
time="2023-01-10T11:51:53Z" level=info msg="Waiting for signal"

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1601      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1601      G   /usr/lib/xorg/Xorg                  9MiB |
|    1   N/A  N/A      1736      G   /usr/bin/gnome-shell                8MiB |
+-----------------------------------------------------------------------------+

@shivamerla
Copy link
Contributor

shivamerla commented Jan 12, 2023

@captainsk7 can you get output of kubectl logs nvidia-operator-validator-79s4j -n gpu-operator -c driver-validation and also double check "nvidia" runtime settings are correct in /etc/containerd/config.toml?

@captainsk7
Copy link

@shivamerla ouput of kubectl logs nvidia-operator-validator-79s4j -n gpu-operator -c driver-validation is
Error from server (BadRequest): container "driver-validation" in pod "nvidia-operator-validator-79s4j" is waiting to start: PodInitializing
and "nvidia" runtime settings are

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

@shivamerla
Copy link
Contributor

The config looks good, somehow containerd might not be picking up this change. Did you confirm that the right config file (/etc/k0s/containerd.toml) used by containerd is changed? Please try restart of containerd service as well to confirm.

@captainsk7
Copy link

captainsk7 commented Jan 13, 2023

@shivamerla yes, the file (/etc/k0s/containerd.toml) is changed on both worker nodes

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

@captainsk7
Copy link

@shivamerla reply please

@shivamerla
Copy link
Contributor

@captainsk7 We have to debug couple of things.

  1. If nvidia-container-runtime is getting invoked at all. This can be debugged by following steps here.
  2. If containerd is not picking up config file changes.
    2.1. This is done by confirming the correct config is setup and restart of containerd.
    2.2. We can spin up additional sample pods with runtimeClass set to nvidia to verify if container can start.
    2.3. Create a complete containerd config file using "containerd config default > /etc/k0s/containerd.toml" and then install GPU Operator. This is to confirm if all required fields are set. SystemdCgroup=true is required for K8s 1.25 and above.

I am not too familiar with K0s and not something we test internally. But above steps will ensure the runtime is setup correctly.

@Dentrax
Copy link

Dentrax commented Sep 18, 2023

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured This is not an error and expected until drivers are loaded and nvidia-container-runtime is setup. With driver containers, it takes about 3-4minutes to setup/load drivers and toolkit, after that these errors should go away. The reason these errors appear is we run some operands(device-plugin, gfd, dcgm etc) with runtimeClass set to nvidia, and this will cause above error until drivers/toolkit are ready.

I'd expect gpu-operator should deploy nvidia-driver-daemonset first before everything else, once all of the Pods get into Running state, it should deploy other daemonsets. Otherwise, race condition occurs in case other daemonsets un and running: failed to get sandbox runtime: no runtime for "nvidia" is configured.

cc @shivamerla

@RameshOswal
Copy link

In our case, the cluster is up and running for a few days where everything with nvidia is working. But suddenly after a few days we get an issue where the pods dont see nvidia-smi. Only after restarting the nvidia-driver-daemonset the issue gets resolved. When we restart the nvidia-driver-daemonset it throws the above error on another nvidia- pod.
Did anyone face the same issue?

@riteshsonawane1372
Copy link

Getting same issue on a kubeadm setup. Need help

@e3oroush
Copy link

e3oroush commented Jul 2, 2024

I faced similar problems but using microk8s on a single DGX machine having H100. The problem for me that the default cuds-operators didn't support H100 with the latest nvidia driver. I was hinted from this github issue, and that fix worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests