feat: Add waiting duration metric to query gate#18378
feat: Add waiting duration metric to query gate#18378AdeshDeshmukh wants to merge 1 commit intoprometheus:mainfrom
Conversation
Fixes: prometheus#11365 The query gate limits concurrent requests but we had no visibility into how long requests wait when the limit is hit. This adds a histogram metric to track waiting duration, so operators can see if the gate is becoming a bottleneck and whether they need to increase the concurrency limit. The metric is named 'prometheus_query_gate_waiting_duration_seconds' and uses standard histogram buckets. Waiting time is measured from when Start() is called until the request acquires a gate slot. This includes comprehensive tests covering normal operation, context cancellation, and metric recording. Signed-off-by: Test User <[email protected]>
4b17969 to
92241cf
Compare
ogulcanaydogan
left a comment
There was a problem hiding this comment.
Hi @AdeshDeshmukh — I also have an open PR for this issue (#18355).
A few observations on this approach:
-
Global metric via
promauto: The histogram is a package-level singleton, which means it can't be customized per caller and is harder to test (can't verify observations through a test registry). #18355 uses theprometheus.Registererpattern (likeutil/notifications) so the caller controls naming and registration. -
No
New()signature change: This keeps backward compat, but it also means the metric is always registered — even if the gate is used in a context where metrics aren't wanted. -
Metric naming:
prometheus_query_gate_waiting_duration_secondsassumes the gate is only used for queries. The remote read handler also uses it, so a more generic name (or caller-provided prefix) might be better.
Happy to collaborate on converging the approaches — the core logic (measure time.Since(start) in Start()) is the same in both PRs.
Fixes #11365
The query gate limits concurrent requests but we had no visibility into how long requests wait when the limit is hit.
This adds a histogram metric to track waiting duration, so operators can see if the gate is becoming a bottleneck and whether they need to increase the concurrency limit.
The metric is named 'prometheus_query_gate_waiting_duration_seconds' and uses standard histogram buckets. Waiting time is measured from when Start() is called until the request acquires a gate slot.
This includes comprehensive tests covering normal operation, context cancellation, and metric recording.