Skip to content

util/gate: add waiting duration histogram metric#18509

Closed
cairon-ab wants to merge 1 commit intoprometheus:mainfrom
cairon-ab:gate-waiting-duration
Closed

util/gate: add waiting duration histogram metric#18509
cairon-ab wants to merge 1 commit intoprometheus:mainfrom
cairon-ab:gate-waiting-duration

Conversation

@cairon-ab
Copy link
Copy Markdown

What this PR does

Adds a gate_waiting_seconds histogram metric to the Gate utility so that operators can detect when the remote read concurrency limit is causing requests to queue.

Changes

  1. util/gate/gate.go: Added NewInstrumented(reg prometheus.Registerer, length int) constructor that wraps the gate with a histogram metric recording how long each Start() call waits for a free slot. The existing New() constructor is unchanged for backward compatibility.

  2. util/gate/gate_test.go: New test file covering both instrumented and non-instrumented gates, verifying metric observations are recorded correctly.

  3. storage/remote/read_handler.go: Updated NewReadHandler to use gate.NewInstrumented(r, ...) instead of gate.New(...) so that the remote read handler exposes the waiting duration metric.

How this solves the issue

When Prometheus is close to the --storage.remote.read-concurrent-limit, incoming remote read requests queue at the gate. Today there's no way to know this is happening. With this change, the gate_waiting_seconds histogram tracks how long each request waited — a spike in wait times or high-percentile values directly indicates gate contention. Operators can alert on histogram_quantile(0.99, rate(gate_waiting_seconds_bucket[5m])) to detect when the concurrency limit is becoming a bottleneck.

Design decisions

  • Following the pattern from util/notifications/notifications.go, metrics are registered via prometheus.Registerer passed at construction time, keeping the util package clean.
  • Uses promauto.With(reg) for safe registration.
  • Native histogram support is enabled for modern Prometheus setups.
  • The non-instrumented New() constructor remains for backward compatibility (no metrics overhead unless explicitly opted in).

Fixes #11365

[ENHANCEMENT] Remote Read: Add `gate_waiting_seconds` histogram metric to track time spent waiting for the remote read concurrency gate.

Add a NewInstrumented constructor to Gate that accepts a
prometheus.Registerer and records how long callers wait
for a free gate slot as a histogram (gate_waiting_seconds).

The existing New constructor is unchanged for backward
compatibility. The remote read handler is updated to use
NewInstrumented so that waiting durations are exposed.

This lets operators detect when the remote read concurrency
limit is causing requests to queue.

Fixes prometheus#11365

Signed-off-by: Cairon <[email protected]>
@bboreham
Copy link
Copy Markdown
Member

Please explain how this PR relates to #17024, #18098, #18355, #18378, #18450, #18491.

@cairon-ab
Copy link
Copy Markdown
Author

Thanks for the pointer — I missed these existing PRs covering the same functionality. #17024 and #18355 appear to address the same util/gate instrumentation. Closing this as a duplicate to avoid noise. Apologies for the extra PR.

@cairon-ab cairon-ab closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gate needs a waiting duration metric

2 participants