Unable to install dcgm, dcgm-exporter in kubevirt vm-passthrough mode #711

rokkiter · 2024-04-30T06:17:20Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
Kernel Version:
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.25.6
GPU Operator Version: latest

2. Issue or feature description

I want to monitor GPUs in kubevirt passthrough mode, but nodes set to vm-passthrough don't have dcgm, dcgm-export installed. is there any way to implement monitoring GPUs in kubevirt passthrough mode?

3. Steps to reproduce the issue

Refer to this document to build the kubevirt vm-passthrough environment.
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-05-02T21:13:54Z

@rokkiter we do not have a solution available today for gathering GPU metrics on nodes configured for GPU passthrough.

raghavendra-rafay · 2025-01-27T13:06:56Z

@cdesiniotis, Is there a plan to add this support anytime soon?

cdesiniotis added the feature issue/PR that proposes a new feature or functionality label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to install dcgm, dcgm-exporter in kubevirt vm-passthrough mode #711

Unable to install dcgm, dcgm-exporter in kubevirt vm-passthrough mode #711

rokkiter commented Apr 30, 2024

cdesiniotis commented May 2, 2024

raghavendra-rafay commented Jan 27, 2025

Unable to install dcgm, dcgm-exporter in kubevirt vm-passthrough mode #711

Unable to install dcgm, dcgm-exporter in kubevirt vm-passthrough mode #711

Comments

rokkiter commented Apr 30, 2024

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

cdesiniotis commented May 2, 2024

raghavendra-rafay commented Jan 27, 2025