UnexpectedAdmissionError with Nvidia GPU Operator #38

malcolmlewis · 2024-05-15T13:33:36Z

Hi
Whilst I'm deploying via the open-webui helm chart, the issue I'm having is on a reboot of the bare-metal node is that the gpu-operator namespace is not running in time for the ollama deployment, so it bails, waits, then the gpu-operator is up and creates a new pod. This has lead to me loosing downloaded models, it seems to be a bit hit and miss...

I'm not sure about the Mount warning either?

Is it possible to add a delay/readiness check when using a gpu to not attempt starting until the gpu-operator is available?

Operating System: openSUSE Leap 15.5
Kubernetes version: v1.28.9+k3s1
Nvidia GPU: Tesla P4

Side note: The values.yaml file has false for the pvc, however the default value in the table shows true...

kubectl describe pod open-webui-ollama-758499c94d-lj6p9 -n open-webui 

Name:                open-webui-ollama-758499c94d-lj6p9
Namespace:           open-webui
Priority:            0
Runtime Class Name:  nvidia
Service Account:     open-webui-ollama
Node:                oscar/
Start Time:          Wed, 15 May 2024 07:35:22 -0500
Labels:              app.kubernetes.io/component=ollama
                     app.kubernetes.io/instance=open-webui
                     app.kubernetes.io/managed-by=Helm
                     app.kubernetes.io/version=0.1.34
                     helm.sh/chart=ollama-0.25.0
                     pod-template-hash=758499c94d
Annotations:         <none>
Status:              Failed
Reason:              UnexpectedAdmissionError
Message:             Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu, which is unexpected
IP:                  
IPs:                 <none>
Controlled By:       ReplicaSet/open-webui-ollama-758499c94d
Containers:
  ollama:
    Image:      ollama/ollama:0.1.34
    Port:       11434/TCP
    Host Port:  0/TCP
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Liveness:          http-get http://:http/ delay=60s timeout=5s period=10s #success=1 #failure=20
    Readiness:         http-get http://:http/ delay=30s timeout=3s period=5s #success=1 #failure=20
    Environment:
      PATH:                        /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      LD_LIBRARY_PATH:             /usr/local/nvidia/lib:/usr/local/nvidia/lib64
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
      NVIDIA_VISIBLE_DEVICES:      all
    Mounts:
      /root/.ollama from ollama-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2wckv (ro)
Volumes:
  ollama-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  open-webui-ollama
    ReadOnly:   false
  kube-api-access-2wckv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                    Age                From               Message
  ----     ------                    ----               ----               -------
  Normal   Scheduled                 37m                default-scheduler  Successfully assigned open-webui/open-webui-ollama-758499c94d-lj6p9 to oscar
  Normal   Pulling                   37m                kubelet            Pulling image "ollama/ollama:0.1.34"
  Normal   Pulled                    37m                kubelet            Successfully pulled image "ollama/ollama:0.1.34" in 16.519s (16.519s including waiting)
  Normal   Created                   37m                kubelet            Created container ollama
  Normal   Started                   37m                kubelet            Started container ollama
  Warning  UnexpectedAdmissionError  18m                kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu, which is unexpected
  Warning  FailedMount               18m (x2 over 18m)  kubelet            MountVolume.SetUp failed for volume "kube-api-access-2wckv" : object "open-webui"/"kube-root-ca.crt" not registered
  Warning  UnexpectedAdmissionError  17m                kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu, which is unexpected
  Warning  FailedMount               17m (x5 over 17m)  kubelet            MountVolume.SetUp failed for volume "kube-api-access-2wckv" : object "open-webui"/"kube-root-ca.crt" not registered

kubectl get cm -A | grep kube-root-ca.crt

default           kube-root-ca.crt                                         1      48m
kube-node-lease   kube-root-ca.crt                                         1      48m
kube-public       kube-root-ca.crt                                         1      48m
kube-system       kube-root-ca.crt                                         1      48m
metallb-system    kube-root-ca.crt                                         1      46m
gpu-operator      kube-root-ca.crt                                         1      44m
open-webui        kube-root-ca.crt                                         1      42m

jdetroyes · 2024-05-16T07:43:06Z

Hello @malcolmlewis

We can add an entry with "initContainers", like that everyone can defined specific requirements before starting the ollama container.

This initContainers example may help you: NVIDIA/gpu-operator#615 (comment)

I will update the value table for pvc, thanks you.

malcolmlewis · 2024-05-29T17:58:00Z

Hi
Thanks for adding the initContainer, it does resolve the issue in ability to set a delay and avoid the error.

So, on monitoring the /run/nvidia/validations directory the last file to be created is plugin-ready, however in the scheme of things this directory is not created early enough for the initContainer and it crashed...

On a warm reboot, it would work about 90% of the time, on a cold boot around 50% of the time.

My work around was to;

Create a script;

#!/usr/bin/bash

## Simple bash script to monitor the nvidia toolkit startup


rm -f /tmp/nvidia-gpu/ready

mkdir -p /tmp/nvidia-gpu

until [ -f /run/nvidia/validations/plugin-ready ];
   do echo Waiting for nvidia container stack to be setup;
   sleep 5;
done

touch /tmp/nvidia-gpu/ready
echo Nvidia stack setup complete

Create a systemd service;

# /etc/systemd/system/nvidia-gpu-check.service
#
[Unit]
Description=Systemd service for k3s gpu stack

[Service]
Type=oneshot
ExecStart=/usr/bin/gpu-check

[Install]
WantedBy=multi-user.target

Add the following to values.yaml;

# -- Additional volumes on the output Deployment definition.
volumes:
  - name: nvidia-validation
    hostPath:
      path: /tmp/nvidia-gpu

# -- Additional volumeMounts on the output Deployment definition.
volumeMounts:
   - mountPath: /tmp/nvidia-gpu
     name: nvidia-validation
     readOnly: true
     mountPropagation: HostToContainer

# -- Init containers to add to the pod
initContainers:
  - name: toolkit-validation
    image: busybox:stable
    volumeMounts:
      - mountPath: /tmp/nvidia-gpu
        name: nvidia-validation
        readOnly: true
        mountPropagation: HostToContainer
    command: ['sh', '-c']
    args:
      - until [ -f /tmp/nvidia-gpu/ready ]; do echo Waiting for nvidia container stack to be setup; sleep 5; done

I might have to go upstream and see if the /run/nvidia/validations/ directory could some how be created earlier in the process as I feel this would negate the need for the script/service?

Anyway, I thought I'd post what I did in case it's of use for others and can close this issue.

malcolmlewis closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnexpectedAdmissionError with Nvidia GPU Operator #38

UnexpectedAdmissionError with Nvidia GPU Operator #38

malcolmlewis commented May 15, 2024

jdetroyes commented May 16, 2024

malcolmlewis commented May 29, 2024

UnexpectedAdmissionError with Nvidia GPU Operator #38

UnexpectedAdmissionError with Nvidia GPU Operator #38

Comments

malcolmlewis commented May 15, 2024

jdetroyes commented May 16, 2024

malcolmlewis commented May 29, 2024