You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on documentation found at: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html
Adding env variables to the DCGM Exporter to use a ConfigMap for custom collectors results in the following error on the DCGM Pod:
"Could not retrieve ConfigMap 'gpu-operator:metrics-config': configmaps "metrics-config" is forbidden: User "system:serviceaccount:gpu-operator:nvidia-dcgm-exporter" cannot get resource "configmaps" in API group "" in the namespace "gpu-operator""
Manually adding a new role with access to the configmap and binding that to the serviceaccount seems to work,
however the configmap made by following the older documentation does not seem to be accepted.
level=fatal msg="Malformed configmap contents. No metrics found"
1. Quick Debug Information
2. Issue or feature description
Based on documentation found at: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html
Adding env variables to the DCGM Exporter to use a ConfigMap for custom collectors results in the following error on the DCGM Pod:
"Could not retrieve ConfigMap 'gpu-operator:metrics-config': configmaps "metrics-config" is forbidden: User "system:serviceaccount:gpu-operator:nvidia-dcgm-exporter" cannot get resource "configmaps" in API group "" in the namespace "gpu-operator""
Sidenote, not sure what the configmap should look like, since the documentation doesn't mention anything about that.
I used the older documentation to create the configmap, assuming that still works
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/22.9.2/installation.html#gpu-telemetry
3. Steps to reproduce the issue
Create ConfigMap with dcgm collectors.
Deploy GPU Operator with following addition to values.yaml:
dcgmExporter:
env:
- name: DCGM_EXPORTER_CONFIGMAP_DATA
value: gpu-operator:metrics-config
4.
Modifying the nvidia-dcgm-exporter rbac role by giving it access to the configmap get's overwritten gpu-operator after a couple seconds.
The text was updated successfully, but these errors were encountered: