-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus service discovery can't find DCGM exporter #363
Comments
@TarekMSayed This has been missing with operator for a while now, will consider adding this with next release(setting pod labels/annotations for dcgm-exporter). |
@shivamerla Also adding an optional service monitor for the DCGM exporter metric service will be nice, I forked the helm chart and add one manually to the templates {{- if .Values.dcgmExporter.metrics.serviceMonitor.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "gpu-operator.fullname" . }}-dcgm-exporter
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespace }}
namespace: {{ .Values.dcgmExporter.metrics.serviceMonitor.namespace | quote }}
{{- end }}
labels:
{{- include "gpu-operator.labels" . | nindent 4 }}
app: nvidia-dcgm-exporter
{{- if .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels }}
{{- toYaml .Values.dcgmExporter.metrics.serviceMonitor.additionalLabels | nindent 4 }}
{{- end }}
spec:
endpoints:
- port: gpu-metrics
interval: {{ .Values.dcgmExporter.metrics.serviceMonitor.scrapeInterval }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.honorLabels }}
honorLabels: true
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings }}
metricRelabelings: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.metricRelabelings | nindent 8 }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.jobLabel }}
jobLabel: {{ .Values.dcgmExporter.metrics.serviceMonitor.jobLabel | quote }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector }}
namespaceSelector: {{ toYaml .Values.dcgmExporter.metrics.serviceMonitor.namespaceSelector | nindent 4 }}
{{ else }}
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
{{- end }}
{{- if .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
targetLabels:
{{- range .Values.dcgmExporter.metrics.serviceMonitor.targetLabels }}
- {{ . }}
{{- end }}
{{- end }}
selector:
matchLabels:
app: nvidia-dcgm-exporter
{{- end }} generated template # Source: gpu-operator/templates/dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: RELEASE-NAME-gpu-operator-dcgm-exporter
labels:
app.kubernetes.io/name: gpu-operator
helm.sh/chart: gpu-operator-v1.7.0
app.kubernetes.io/instance: RELEASE-NAME
app.kubernetes.io/version: "v1.7.0"
app.kubernetes.io/managed-by: Helm
app: nvidia-dcgm-exporter
release: prometheus-operator
spec:
endpoints:
- port: gpu-metrics
interval: 30s
namespaceSelector:
any: true
selector:
matchLabels:
app: nvidia-dcgm-exporter |
Hi, All the pods are running well, and i have installed Prometheus too. Pod status for nvidia-dcgm-exporter Logs are below |
@samavedulark Can you run "kubectl get configmap metrics-config -n gpu-operator -o yaml" to confirm |
@shivamerla
kind: ConfigMap |
@glowkey This seems to be with following metric enabled.
Do you know if any specific Prometheus settings that can cause this? |
@shivamerla |
@shivamerla The "label" type will be supported in the next release of DCGM Exporter, until that version is released that line should be commented out or removed. |
thanks @glowkey for checking. @samavedulark you can continue to use it without label type metrics for now. |
@shivamerla hi, I've also met this problem after installing kube-prometheus and gpu-operator, but could not found metrics-config when i run this. I can find there is one ServiceMonitor
|
Try this |
Prometheus service discovery can't find DCGM exporter usually that needs to add extra labels/annotations to the pods ex:
release: "prometheus-operator"
or create service monitor with these labels/annotations.I can't find a way to add it to the helm chart
The text was updated successfully, but these errors were encountered: