-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured #413
Comments
@searce-aditya I see from the list of pods that you are adeploying the Could you add |
Hello @elezar we tried adding --set toolkit.enabled=false while deploying operator, still facing the same issue |
Is the driver ready on the node? Can you share the output of |
I've a different setup, but facing exactly the same error. I tried using chart 1.11.1 and 22.9.0 with same results:
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set version=22.9.0 --set driver.enabled=false --set operator.defaultRuntime=containerd
|
I'm running into the same problem on a Nvidia DGX Station A100 running microk8s 1.25.2 and following the process outlined in the GPU-operator manual for DGX systems. The container environment works as expected in Docker. |
Experiencing this same issue running K8s v1.21.14 and containerd. I have tried all suggestions in this issue and the issue has not resolved. |
Please ignore these errors and let us know the state of all pods running with GPU operator. If any of pod is still in |
It's definitely not a timing issue, these pods have been in this state for quite a while.
Containerd and docker configs have been set up as instructed in the documentation. Initially, I tried to install the operator via microk8s GPU add-on but ran into issues. Following the recommended Helm3 installation, all expected pods were created, but some of them got stuck in the init phase. Docker config:
Containerd:
More details on the feature-discovery pod:
Logs of the feature-discovery container are unavailable due to the init status. |
@LukasIAO Can you add following
|
Hi @shivamerla, Current config:
Updated config:
I've restarted the deamonset pod which is still running into the same issue.
Restarting the pod, unfortunately, did not create any nvidia-container-runtime or cli logs. However, here are the logs for the gpu-operator as well as the gpu-operator-namespace, perhaps they can be of some help. /var/log/gpu-operator.log:
/var/log/containers/gpu-operator*.logs:
|
@shivamerla Turns out an unrelated system reboot did lead to the creation of the
The |
@LukasIAO the other log file of interest may be |
Hi @elezar, thank you for taking the time. Here the
|
You need to define a runtimeClass named nvidia with handler nvidia.
|
@hholst80 in which file should I define a runtimeClass ? |
@captainsk7
|
@shivamerla thanks for reply, 1. Are drivers/container-toolkit pre-installed on the host or installed by the GPU operator?
2. OS version
|
@captainsk7 can you get output of |
@shivamerla ouput of
|
The config looks good, somehow containerd might not be picking up this change. Did you confirm that the right config file (/etc/k0s/containerd.toml) used by containerd is changed? Please try restart of containerd service as well to confirm. |
@shivamerla yes, the file (/etc/k0s/containerd.toml) is changed on both worker nodes
|
@shivamerla reply please |
@captainsk7 We have to debug couple of things.
I am not too familiar with K0s and not something we test internally. But above steps will ensure the runtime is setup correctly. |
I'd expect gpu-operator should deploy cc @shivamerla |
In our case, the cluster is up and running for a few days where everything with nvidia is working. But suddenly after a few days we get an issue where the pods dont see |
Getting same issue on a kubeadm setup. Need help |
I faced similar problems but using |
We are using GKE cluster(v1.23.8-gke.1900) with Nvidia multi-instance A100 Gpu nodes. We want to install Nvidia Gpu-operator on this cluster.
The default container-runtime in our case is containerd, so we followed the following steps to change the container runtime to nvidia i.e by adding below config to /etc/containerd/config.toml and restarting containerd service.:
Then restarting the containerd daemon:
systemctl restart containerd
We have followed the following steps to deploy Gpu-operator:
Note: The nvidia drivers and device-plugins are already present by default in the kube-system namespace.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
&& chmod 700 get_helm.sh
&& ./get_helm.sh
helm repo add nvidia https://nvidia.github.io/gpu-operator
&& helm repo update
helm install gpu-operator nvidia/gpu-operator --set operator.defaultRuntime=containerd --namespace kube-system --create-namespace --set driver.enabled=false --set mig.strategy=single
Now when I run kubectl get all -n kube-system command, all the below listed pods are not coming up:
When I tried to describe one of the dcgm pods by running kubectl describe pod nvidia-dcgm-exporter-49t58 -n kube-system it showing the following error:
Please suggest if we are missing something & how can we resolve this issue quickly.
The text was updated successfully, but these errors were encountered: