dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent#2262
Conversation
…present Signed-off-by: Vincent Gimenes <[email protected]>
|
Hi, just following up on this PR. The main goal here is to fix an inconsistency in the operator:
Since both rely on the same Prometheus Operator setup, this often leads to confusion and missing GPU metrics in practice, and significantly increases debugging time for users. The change is backward-compatible:
Happy to adjust the approach if you’d prefer a different direction (e.g. keeping default=false and only fixing the reconciliation behavior). Would appreciate your feedback when you have time. |
|
@rahulait hey, gentle ping on this PR, would truly appreciate your feedback when you have time |
|
Hey @VincentG1234, thanks for this contribution. I'll plan to review this next week. Please feel free to ping again if you don't hear back by next weekend. |
Problem
The DCGM Exporter
ServiceMonitorhas been opt-in (enabled: false) since the feature was introduced in 2022. Every user who wants GPU metrics scraped by Prometheus must remember to setserviceMonitor.enabled: true— a silent misconfiguration that causes hours of debugging (#305 #363) .On top of that, when
enabled: trueis set and the PrometheusServiceMonitorCRD is absent, the operator returnsNotReadyand blocks the entire reconcile loop (re-queuing every 5 seconds). This was documented in release 23.3 with added logging, but the underlying blocking behavior was never fixed. As a result, missing Prometheus CRDs prevent GFD pods from starting.Solution
Two minimal changes:
serviceMonitor.enabled: trueby default invalues.yaml.ServiceMonitorCRD is absent, the operator now returnsReadyand skips gracefully instead of blocking the reconcile loop. This alignsstate-dcgm-exporterwith the existing behavior ofstate-operator-metrics, which already handled this case correctly.An explicit
enabled: falsecontinues to disable and remove the resource.enabled: true+ CRD absentNotReady(blocks reconcile loop, GFD stalls)Ready(silent skip)enabled: true+ CRD presentenabled: falseexplicitDisabledDisabled(unchanged)Changes
deployments/gpu-operator/values.yaml—serviceMonitor.enabled: false→enabled: true.controllers/object_controls.go—ServiceMonitor(): CRD-absent path returnsReadyinstead ofNotReady, consistent withstate-operator-metrics.api/nvidia/v1/clusterpolicy_types.go—DCGMExporterServiceMonitorConfig.IsEnabled(): nil defaults totrue, consistent with the rest of the codebase.Testing
go test ./controllers/... -run TestServiceMonitor -v