-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnexpectedAdmissionError with Nvidia GPU Operator #38
Comments
Hello @malcolmlewis We can add an entry with "initContainers", like that everyone can defined specific requirements before starting the ollama container. This initContainers example may help you: NVIDIA/gpu-operator#615 (comment) I will update the value table for pvc, thanks you. |
Hi So, on monitoring the On a warm reboot, it would work about 90% of the time, on a cold boot around 50% of the time. My work around was to;
#!/usr/bin/bash
## Simple bash script to monitor the nvidia toolkit startup
rm -f /tmp/nvidia-gpu/ready
mkdir -p /tmp/nvidia-gpu
until [ -f /run/nvidia/validations/plugin-ready ];
do echo Waiting for nvidia container stack to be setup;
sleep 5;
done
touch /tmp/nvidia-gpu/ready
echo Nvidia stack setup complete
# /etc/systemd/system/nvidia-gpu-check.service
#
[Unit]
Description=Systemd service for k3s gpu stack
[Service]
Type=oneshot
ExecStart=/usr/bin/gpu-check
[Install]
WantedBy=multi-user.target
# -- Additional volumes on the output Deployment definition.
volumes:
- name: nvidia-validation
hostPath:
path: /tmp/nvidia-gpu
# -- Additional volumeMounts on the output Deployment definition.
volumeMounts:
- mountPath: /tmp/nvidia-gpu
name: nvidia-validation
readOnly: true
mountPropagation: HostToContainer
# -- Init containers to add to the pod
initContainers:
- name: toolkit-validation
image: busybox:stable
volumeMounts:
- mountPath: /tmp/nvidia-gpu
name: nvidia-validation
readOnly: true
mountPropagation: HostToContainer
command: ['sh', '-c']
args:
- until [ -f /tmp/nvidia-gpu/ready ]; do echo Waiting for nvidia container stack to be setup; sleep 5; done I might have to go upstream and see if the Anyway, I thought I'd post what I did in case it's of use for others and can close this issue. |
Hi
Whilst I'm deploying via the open-webui helm chart, the issue I'm having is on a reboot of the bare-metal node is that the gpu-operator namespace is not running in time for the ollama deployment, so it bails, waits, then the gpu-operator is up and creates a new pod. This has lead to me loosing downloaded models, it seems to be a bit hit and miss...
I'm not sure about the Mount warning either?
Is it possible to add a delay/readiness check when using a gpu to not attempt starting until the gpu-operator is available?
Operating System: openSUSE Leap 15.5
Kubernetes version: v1.28.9+k3s1
Nvidia GPU: Tesla P4
Side note: The values.yaml file has false for the pvc, however the default value in the table shows true...
kubectl get cm -A | grep kube-root-ca.crt default kube-root-ca.crt 1 48m kube-node-lease kube-root-ca.crt 1 48m kube-public kube-root-ca.crt 1 48m kube-system kube-root-ca.crt 1 48m metallb-system kube-root-ca.crt 1 46m gpu-operator kube-root-ca.crt 1 44m open-webui kube-root-ca.crt 1 42m
The text was updated successfully, but these errors were encountered: