Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM-Exporter cannot access configmap, access denied #706

Closed
Bromhir84 opened this issue Apr 26, 2024 · 3 comments
Closed

DCGM-Exporter cannot access configmap, access denied #706

Bromhir84 opened this issue Apr 26, 2024 · 3 comments

Comments

@Bromhir84
Copy link

Bromhir84 commented Apr 26, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL 8.9
  • Kernel Version: kernel-version.full=4.18.0-513.24.1.el8_9.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): RKE2
  • GPU Operator Version: 23.9.2

2. Issue or feature description

Based on documentation found at: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html
Adding env variables to the DCGM Exporter to use a ConfigMap for custom collectors results in the following error on the DCGM Pod:
"Could not retrieve ConfigMap 'gpu-operator:metrics-config': configmaps "metrics-config" is forbidden: User "system:serviceaccount:gpu-operator:nvidia-dcgm-exporter" cannot get resource "configmaps" in API group "" in the namespace "gpu-operator""

Sidenote, not sure what the configmap should look like, since the documentation doesn't mention anything about that.
I used the older documentation to create the configmap, assuming that still works
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/22.9.2/installation.html#gpu-telemetry

3. Steps to reproduce the issue

Create ConfigMap with dcgm collectors.

Deploy GPU Operator with following addition to values.yaml:
dcgmExporter:
env:
- name: DCGM_EXPORTER_CONFIGMAP_DATA
value: gpu-operator:metrics-config

4.

Modifying the nvidia-dcgm-exporter rbac role by giving it access to the configmap get's overwritten gpu-operator after a couple seconds.

@Bromhir84
Copy link
Author

Manually adding a new role with access to the configmap and binding that to the serviceaccount seems to work,
however the configmap made by following the older documentation does not seem to be accepted.
level=fatal msg="Malformed configmap contents. No metrics found"

@tariq1890
Copy link
Contributor

Hi @Bromhir84 We'll be fixing this in the next release of gpu-operator (24.3.0)

@cdesiniotis
Copy link
Contributor

Hi @Bromhir84 -- GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants