You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
Events:
Type Reason Age From Message
Normal Scheduled 4m7s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-container-toolkit-daemonset-sgrz4 to aks-gpunodepool-64725455-vmss000000
Warning FailedCreatePodSandBox 4m6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a82b1c9978772313ebcaf17c11cd585c534989ae82ca2e8a53a4284feca8f8c4": plugin type="azure-vnet" failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS: http request failed: Post "http://localhost:10090/network/requestipconfigs": dial tcp 127.0.0.1:10090: connect: connection refused
Normal Pulling 3m51s kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2"
Normal Pulled 3m46s kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 358ms (4.617s including waiting)
Normal Created 3m46s kubelet Created container driver-validation
Normal Started 3m46s kubelet Started container driver-validation
Normal Pulling 3m44s kubelet Pulling image "nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04"
Normal Pulled 3m37s kubelet Successfully pulled image "nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04" in 3.088s (7.127s including waiting)
Normal Created 3m37s kubelet Created container nvidia-container-toolkit-ctr
Normal Started 3m37s kubelet Started container nvidia-container-toolkit-ctr
Normal Killing 2m26s kubelet Stopping container nvidia-container-toolkit-ctr
Warning FailedKillPod 2m26s kubelet error killing pod: [failed to "KillContainer" for "nvidia-container-toolkit-ctr" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "8ac03393-3529-4611-b82b-3440b0631a87" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: con
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
{"level":"info","ts":1715327639.301056,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1715327639.3285427,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1715327639.3493087,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1715327694.9110024,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1715327694.9249668,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1715327694.9249868,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1715327694.9249895,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
Ubuntu 22.04
5.15.0-1060-azure
containerd github.com/containerd/containerd 1.7.15-1
aks version 1.27.9 /1.28.5
v23.9.2
2. Issue or feature description
deleting container toolkit pod on GPU node directly cause it to become stuck and the GPU node turn into not ready status
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
helm install --wait gpu-operator -n nvidia-gpu-operator nvidia/gpu-operator --set operator.runtimeClass=nvidia-container-runtime --set driver.enabled=false --set devicePlugin.enabled=true --set migManager.enabled=false --set toolkit.enabled=true
kubectl -nvidia-gpu-operator delete pod nvidia-container-toolkit-daemonset-sgrz4
pod "nvidia-container-toolkit-daemonset-sgrz4" deleted
kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-gpunodepool-64725455-vmss000000 NotReady agent 5h v1.28.5
aks-nodepool1-22111699-vmss000000 Ready agent 46h v1.28.5
aks-nodepool1-22111699-vmss000001 Ready agent 46h v1.28.5
aks-ondemandpool-14960313-vmss000000 Ready agent 46h v1.28.5
aks-spotazlinux1-26861551-vmss000000 Ready agent 46h v1.28.5
4. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
kubectl -n nvidia-gpu-operator get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-rcdlm 1/1 Running 0 5h13m
gpu-operator-574c687b59-t7xlz 1/1 Running 0 4h57m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-8k2df 1/1 Running 0 4h57m
gpu-operator-node-feature-discovery-master-d8597d549-vp28d 1/1 Running 0 4h57m
gpu-operator-node-feature-discovery-worker-4wxvl 1/1 Running 0 5h14m
gpu-operator-node-feature-discovery-worker-dk7k2 1/1 Running 0 5h14m
gpu-operator-node-feature-discovery-worker-lhqmx 1/1 Running 0 5h14m
gpu-operator-node-feature-discovery-worker-qvx82 1/1 Running 0 5h14m
nvidia-container-toolkit-daemonset-sgrz4 1/1 Terminating 0 2m37s
nvidia-cuda-validator-dgtjd 0/1 Completed 0 2m13s
nvidia-dcgm-exporter-dq89t 1/1 Running 0 5h13m
nvidia-device-plugin-daemonset-cx8nk 1/1 Running 0 5h13m
nvidia-operator-validator-v6dcr 1/1 Running 0 5h13m
kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
kubectl -n nvidia-gpu-operator get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 5h14m
gpu-operator-node-feature-discovery-worker 3 3 3 3 3 5h14m
nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true 5h14m
nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 5h14m
nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 5h14m
nvidia-operator-validator 0 0 0 0 0 nvidia.com/gpu.deploy.operator-validator=true 5h14m
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
Events:
Type Reason Age From Message
Normal Scheduled 4m7s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-container-toolkit-daemonset-sgrz4 to aks-gpunodepool-64725455-vmss000000
Warning FailedCreatePodSandBox 4m6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a82b1c9978772313ebcaf17c11cd585c534989ae82ca2e8a53a4284feca8f8c4": plugin type="azure-vnet" failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS: http request failed: Post "http://localhost:10090/network/requestipconfigs": dial tcp 127.0.0.1:10090: connect: connection refused
Normal Pulling 3m51s kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2"
Normal Pulled 3m46s kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 358ms (4.617s including waiting)
Normal Created 3m46s kubelet Created container driver-validation
Normal Started 3m46s kubelet Started container driver-validation
Normal Pulling 3m44s kubelet Pulling image "nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04"
Normal Pulled 3m37s kubelet Successfully pulled image "nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04" in 3.088s (7.127s including waiting)
Normal Created 3m37s kubelet Created container nvidia-container-toolkit-ctr
Normal Started 3m37s kubelet Started container nvidia-container-toolkit-ctr
Normal Killing 2m26s kubelet Stopping container nvidia-container-toolkit-ctr
Warning FailedKillPod 2m26s kubelet error killing pod: [failed to "KillContainer" for "nvidia-container-toolkit-ctr" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "8ac03393-3529-4611-b82b-3440b0631a87" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: con
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
{"level":"info","ts":1715327639.301056,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1715327639.3285427,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1715327639.3493087,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1715327694.9110024,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1715327694.9249668,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1715327694.9249868,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1715327694.9249895,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
NA
journalctl -u containerd > containerd.log
containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
The text was updated successfully, but these errors were encountered: