Prometheus service discovery can't find DCGM exporter #363

TarekMSayed · 2022-06-28T12:34:35Z

Prometheus service discovery can't find DCGM exporter usually that needs to add extra labels/annotations to the pods ex: release: "prometheus-operator" or create service monitor with these labels/annotations.
I can't find a way to add it to the helm chart

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-06-29T00:58:28Z

@TarekMSayed This has been missing with operator for a while now, will consider adding this with next release(setting pod labels/annotations for dcgm-exporter).

TarekMSayed · 2022-06-29T12:56:09Z

@shivamerla Also adding an optional service monitor for the DCGM exporter metric service will be nice, I forked the helm chart and add one manually to the templates
dcgm-servicemonitor.yaml

{{- if .Values.dcgmExporter.metrics.serviceMonitor.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "gpu-operator.fullname" . }}-dcgm-exporter 
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespace }}
  namespace: {{ .Values.dcgmExporter.metrics.serviceMonitor.namespace | quote }}
{{- end }}
  labels:
    {{- include "gpu-operator.labels" . | nindent 4 }}
    app: nvidia-dcgm-exporter
  {{- if .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels }}
    {{- toYaml .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels | nindent 4 }}
  {{- end }}
spec:
  endpoints:
    - port: gpu-metrics
      interval: {{ .Values.dcgmExporter.metrics.serviceMonitor.scrapeInterval }}
    {{- if .Values.dcgmExporter.metrics.serviceMonitor.honorLabels }}
      honorLabels: true
    {{- end }}
    {{- if .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings }}
      metricRelabelings: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings | nindent 8 }}
    {{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.jobLabel }}
  jobLabel: {{ .Values.dcgmExporter.metrics.serviceMonitor.jobLabel | quote }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector }}
  namespaceSelector: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector | nindent 4 }}
{{ else }}
  namespaceSelector:
    matchNames:
      - {{ .Release.Namespace }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
  targetLabels:
  {{- range .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
    - {{ . }}
  {{- end }}
{{- end }}
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
{{- end }}

generated template

# Source: gpu-operator/templates/dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: RELEASE-NAME-gpu-operator-dcgm-exporter
  labels:
    app.kubernetes.io/name: gpu-operator
    helm.sh/chart: gpu-operator-v1.7.0
    app.kubernetes.io/instance: RELEASE-NAME
    app.kubernetes.io/version: "v1.7.0"
    app.kubernetes.io/managed-by: Helm
    app: nvidia-dcgm-exporter
    release: prometheus-operator
spec:
  endpoints:
    - port: gpu-metrics
      interval: 30s
  namespaceSelector: 
    any: true

  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

samavedulark · 2022-08-05T10:57:26Z

Hi,
After configuring Nvidia GPU operator like below on RKE2.
helm install gpu-operator
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set operator.defaultRuntime=containerd
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml
--set toolkit.env[1].name=CONTAINERD_SOCKET
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[2].value=nvidia
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
--set-string toolkit.env[3].value=true
--set dcgmExporter.config.name=metrics-config
--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS
--set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv

All the pods are running well, and i have installed Prometheus too.
helm install prometheus-community/kube-prometheus-stack
--create-namespace --namespace prometheus
--generate-name
--set prometheus.service.type=NodePort
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Pod status for nvidia-dcgm-exporter
nvidia-dcgm-exporter-dq5l2 0/1 CrashLoopBackOff 5 4m3s

Logs are below
time="2022-08-05T10:59:19Z" level=info msg="Starting dcgm-exporter"
time="2022-08-05T10:59:19Z" level=info msg="DCGM successfully initialized!"
time="2022-08-05T10:59:19Z" level=info msg="Collecting DCP Metrics"
time="2022-08-05T10:59:19Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2022-08-05T10:59:19Z" level=fatal msg="Could not find Prometheus metry type label"

shivamerla · 2022-08-08T18:11:58Z

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

samavedulark · 2022-08-08T18:25:52Z

@shivamerla
This is the ouput
apiVersion: v1
data:
dcgm-metrics.csv: |
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION,        label, Driver Version
# DCGM_FI_NVML_VERSION,          label, NVML Version
# DCGM_FI_DEV_BRAND,             label, Device Brand
# DCGM_FI_DEV_SERIAL,            label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
# DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device

kind: ConfigMap
metadata:
creationTimestamp: "2022-08-05T10:28:46Z"
name: metrics-config
namespace: gpu-operator
resourceVersion: "1609781"
uid: 28cad40f-a80c-4183-8e61-d5696175f39a

shivamerla · 2022-08-08T18:33:57Z

@glowkey This seems to be with following metric enabled.

# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION,        label, Driver Version

time="2022-08-05T10:59:19Z" level=fatal msg="Could not find Prometheus metry type label"

Do you know if any specific Prometheus settings that can cause this?

samavedulark · 2022-08-08T18:36:02Z

@shivamerla
Removing this line will make it work.

glowkey · 2022-08-08T18:47:54Z

@shivamerla The "label" type will be supported in the next release of DCGM Exporter, until that version is released that line should be commented out or removed.

shivamerla · 2022-08-08T18:52:51Z

thanks @glowkey for checking. @samavedulark you can continue to use it without label type metrics for now.

hcnhcn012 · 2024-05-10T09:14:30Z

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

@shivamerla hi, I've also met this problem after installing kube-prometheus and gpu-operator, but could not found metrics-config when i run this. I can find there is one ServiceMonitor gpu-operator in namespace gpu-operator by running kubectl get servicemonitors --all-namespaces, here is the result of kubectl describe servicemonitor gpu-operator -n gpu-operator:

Name:         gpu-operator
Namespace:    gpu-operator
Labels:       app=gpu-operator
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         ServiceMonitor
Metadata:
  Creation Timestamp:  2024-05-09T06:48:02Z
  Generation:          1
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   be489d14-66e8-46a6-a3e4-47a90396ddf7
  Resource Version:        1530789
  UID:                     d8f741f1-76b8-45c6-9ed4-74600037ffa0
Spec:
  Endpoints:
    Path:     /metrics
    Port:     gpu-operator-metrics
  Job Label:  operator
  Namespace Selector:
    Match Names:
      gpu-operator
  Selector:
    Match Labels:
      App:  gpu-operator
Events:     <none>

RodgerLZ · 2024-10-22T06:53:30Z

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

@shivamerla hi, I've also met this problem after installing kube-prometheus and gpu-operator, but could not found metrics-config when i run this. I can find there is one ServiceMonitor gpu-operator in namespace gpu-operator by running kubectl get servicemonitors --all-namespaces, here is the result of kubectl describe servicemonitor gpu-operator -n gpu-operator:
Name:         gpu-operator
Namespace:    gpu-operator
Labels:       app=gpu-operator
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         ServiceMonitor
Metadata:
  Creation Timestamp:  2024-05-09T06:48:02Z
  Generation:          1
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   be489d14-66e8-46a6-a3e4-47a90396ddf7
  Resource Version:        1530789
  UID:                     d8f741f1-76b8-45c6-9ed4-74600037ffa0
Spec:
  Endpoints:
    Path:     /metrics
    Port:     gpu-operator-metrics
  Job Label:  operator
  Namespace Selector:
    Match Names:
      gpu-operator
  Selector:
    Match Labels:
      App:  gpu-operator
Events:     <none>

Try this --set dcgmExporter.serviceMonitor.enabled=true

shivamerla added the enhancement label Jun 29, 2022

shivamerla self-assigned this Jun 29, 2022

shivamerla closed this as completed Aug 8, 2022

glowkey mentioned this issue Aug 19, 2022

Applying the latest dcgm-exporter some issues with the exporter container NVIDIA/dcgm-exporter#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus service discovery can't find DCGM exporter #363

Prometheus service discovery can't find DCGM exporter #363

TarekMSayed commented Jun 28, 2022

shivamerla commented Jun 29, 2022

TarekMSayed commented Jun 29, 2022

samavedulark commented Aug 5, 2022 •

edited

Loading

shivamerla commented Aug 8, 2022

samavedulark commented Aug 8, 2022

shivamerla commented Aug 8, 2022

samavedulark commented Aug 8, 2022

glowkey commented Aug 8, 2022

shivamerla commented Aug 8, 2022

hcnhcn012 commented May 10, 2024 •

edited

Loading

RodgerLZ commented Oct 22, 2024

Prometheus service discovery can't find DCGM exporter #363

Prometheus service discovery can't find DCGM exporter #363

Comments

TarekMSayed commented Jun 28, 2022

shivamerla commented Jun 29, 2022

TarekMSayed commented Jun 29, 2022

samavedulark commented Aug 5, 2022 • edited Loading

shivamerla commented Aug 8, 2022

samavedulark commented Aug 8, 2022

shivamerla commented Aug 8, 2022

samavedulark commented Aug 8, 2022

glowkey commented Aug 8, 2022

shivamerla commented Aug 8, 2022

hcnhcn012 commented May 10, 2024 • edited Loading

RodgerLZ commented Oct 22, 2024

samavedulark commented Aug 5, 2022 •

edited

Loading

hcnhcn012 commented May 10, 2024 •

edited

Loading