Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus service discovery can't find DCGM exporter #363

Closed
TarekMSayed opened this issue Jun 28, 2022 · 11 comments
Closed

Prometheus service discovery can't find DCGM exporter #363

TarekMSayed opened this issue Jun 28, 2022 · 11 comments
Assignees

Comments

@TarekMSayed
Copy link

Prometheus service discovery can't find DCGM exporter usually that needs to add extra labels/annotations to the pods ex: release: "prometheus-operator" or create service monitor with these labels/annotations.
I can't find a way to add it to the helm chart

@shivamerla
Copy link
Contributor

@TarekMSayed This has been missing with operator for a while now, will consider adding this with next release(setting pod labels/annotations for dcgm-exporter).

@shivamerla shivamerla self-assigned this Jun 29, 2022
@TarekMSayed
Copy link
Author

@shivamerla Also adding an optional service monitor for the DCGM exporter metric service will be nice, I forked the helm chart and add one manually to the templates
dcgm-servicemonitor.yaml

{{- if .Values.dcgmExporter.metrics.serviceMonitor.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ include "gpu-operator.fullname" . }}-dcgm-exporter 
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespace }}
  namespace: {{ .Values.dcgmExporter.metrics.serviceMonitor.namespace | quote }}
{{- end }}
  labels:
    {{- include "gpu-operator.labels" . | nindent 4 }}
    app: nvidia-dcgm-exporter
  {{- if .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels }}
    {{- toYaml .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels | nindent 4 }}
  {{- end }}
spec:
  endpoints:
    - port: gpu-metrics
      interval: {{ .Values.dcgmExporter.metrics.serviceMonitor.scrapeInterval }}
    {{- if .Values.dcgmExporter.metrics.serviceMonitor.honorLabels }}
      honorLabels: true
    {{- end }}
    {{- if .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings }}
      metricRelabelings: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings | nindent 8 }}
    {{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.jobLabel }}
  jobLabel: {{ .Values.dcgmExporter.metrics.serviceMonitor.jobLabel | quote }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector }}
  namespaceSelector: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector | nindent 4 }}
{{ else }}
  namespaceSelector:
    matchNames:
      - {{ .Release.Namespace }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
  targetLabels:
  {{- range .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
    - {{ . }}
  {{- end }}
{{- end }}
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
{{- end }}

generated template

# Source: gpu-operator/templates/dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: RELEASE-NAME-gpu-operator-dcgm-exporter
  labels:
    app.kubernetes.io/name: gpu-operator
    helm.sh/chart: gpu-operator-v1.7.0
    app.kubernetes.io/instance: RELEASE-NAME
    app.kubernetes.io/version: "v1.7.0"
    app.kubernetes.io/managed-by: Helm
    app: nvidia-dcgm-exporter
    release: prometheus-operator
spec:
  endpoints:
    - port: gpu-metrics
      interval: 30s
  namespaceSelector: 
    any: true

  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

@samavedulark
Copy link

samavedulark commented Aug 5, 2022

Hi,
After configuring Nvidia GPU operator like below on RKE2.
helm install gpu-operator
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set operator.defaultRuntime=containerd
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml
--set toolkit.env[1].name=CONTAINERD_SOCKET
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[2].value=nvidia
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
--set-string toolkit.env[3].value=true
--set dcgmExporter.config.name=metrics-config
--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS
--set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv

All the pods are running well, and i have installed Prometheus too.
helm install prometheus-community/kube-prometheus-stack
--create-namespace --namespace prometheus
--generate-name
--set prometheus.service.type=NodePort
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Pod status for nvidia-dcgm-exporter
nvidia-dcgm-exporter-dq5l2 0/1 CrashLoopBackOff 5 4m3s

Logs are below
time="2022-08-05T10:59:19Z" level=info msg="Starting dcgm-exporter"
time="2022-08-05T10:59:19Z" level=info msg="DCGM successfully initialized!"
time="2022-08-05T10:59:19Z" level=info msg="Collecting DCP Metrics"
time="2022-08-05T10:59:19Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2022-08-05T10:59:19Z" level=fatal msg="Could not find Prometheus metry type label"

@shivamerla
Copy link
Contributor

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

@samavedulark
Copy link

@shivamerla
This is the ouput
apiVersion: v1
data:
dcgm-metrics.csv: |
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION,        label, Driver Version
# DCGM_FI_NVML_VERSION,          label, NVML Version
# DCGM_FI_DEV_BRAND,             label, Device Brand
# DCGM_FI_DEV_SERIAL,            label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
# DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device

kind: ConfigMap
metadata:
creationTimestamp: "2022-08-05T10:28:46Z"
name: metrics-config
namespace: gpu-operator
resourceVersion: "1609781"
uid: 28cad40f-a80c-4183-8e61-d5696175f39a

@shivamerla
Copy link
Contributor

@glowkey This seems to be with following metric enabled.

# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION,        label, Driver Version
time="2022-08-05T10:59:19Z" level=fatal msg="Could not find Prometheus metry type label"

Do you know if any specific Prometheus settings that can cause this?

@samavedulark
Copy link

@shivamerla
Removing this line will make it work.

@glowkey
Copy link

glowkey commented Aug 8, 2022

@shivamerla The "label" type will be supported in the next release of DCGM Exporter, until that version is released that line should be commented out or removed.

@shivamerla
Copy link
Contributor

thanks @glowkey for checking. @samavedulark you can continue to use it without label type metrics for now.

@hcnhcn012
Copy link

hcnhcn012 commented May 10, 2024

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

@shivamerla hi, I've also met this problem after installing kube-prometheus and gpu-operator, but could not found metrics-config when i run this. I can find there is one ServiceMonitor gpu-operator in namespace gpu-operator by running kubectl get servicemonitors --all-namespaces, here is the result of kubectl describe servicemonitor gpu-operator -n gpu-operator:

Name:         gpu-operator
Namespace:    gpu-operator
Labels:       app=gpu-operator
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         ServiceMonitor
Metadata:
  Creation Timestamp:  2024-05-09T06:48:02Z
  Generation:          1
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   be489d14-66e8-46a6-a3e4-47a90396ddf7
  Resource Version:        1530789
  UID:                     d8f741f1-76b8-45c6-9ed4-74600037ffa0
Spec:
  Endpoints:
    Path:     /metrics
    Port:     gpu-operator-metrics
  Job Label:  operator
  Namespace Selector:
    Match Names:
      gpu-operator
  Selector:
    Match Labels:
      App:  gpu-operator
Events:     <none>

@RodgerLZ
Copy link

@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm dcgm-metrics.csv file is specified correctly.

@shivamerla hi, I've also met this problem after installing kube-prometheus and gpu-operator, but could not found metrics-config when i run this. I can find there is one ServiceMonitor gpu-operator in namespace gpu-operator by running kubectl get servicemonitors --all-namespaces, here is the result of kubectl describe servicemonitor gpu-operator -n gpu-operator:

Name:         gpu-operator
Namespace:    gpu-operator
Labels:       app=gpu-operator
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         ServiceMonitor
Metadata:
  Creation Timestamp:  2024-05-09T06:48:02Z
  Generation:          1
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   be489d14-66e8-46a6-a3e4-47a90396ddf7
  Resource Version:        1530789
  UID:                     d8f741f1-76b8-45c6-9ed4-74600037ffa0
Spec:
  Endpoints:
    Path:     /metrics
    Port:     gpu-operator-metrics
  Job Label:  operator
  Namespace Selector:
    Match Names:
      gpu-operator
  Selector:
    Match Labels:
      App:  gpu-operator
Events:     <none>

Try this --set dcgmExporter.serviceMonitor.enabled=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants