util/gate: add waiting duration histogram metric#18509
Closed
cairon-ab wants to merge 1 commit intoprometheus:mainfrom
Closed
util/gate: add waiting duration histogram metric#18509cairon-ab wants to merge 1 commit intoprometheus:mainfrom
cairon-ab wants to merge 1 commit intoprometheus:mainfrom
Conversation
Add a NewInstrumented constructor to Gate that accepts a prometheus.Registerer and records how long callers wait for a free gate slot as a histogram (gate_waiting_seconds). The existing New constructor is unchanged for backward compatibility. The remote read handler is updated to use NewInstrumented so that waiting durations are exposed. This lets operators detect when the remote read concurrency limit is causing requests to queue. Fixes prometheus#11365 Signed-off-by: Cairon <[email protected]>
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds a
gate_waiting_secondshistogram metric to the Gate utility so that operators can detect when the remote read concurrency limit is causing requests to queue.Changes
util/gate/gate.go: AddedNewInstrumented(reg prometheus.Registerer, length int)constructor that wraps the gate with a histogram metric recording how long eachStart()call waits for a free slot. The existingNew()constructor is unchanged for backward compatibility.util/gate/gate_test.go: New test file covering both instrumented and non-instrumented gates, verifying metric observations are recorded correctly.storage/remote/read_handler.go: UpdatedNewReadHandlerto usegate.NewInstrumented(r, ...)instead ofgate.New(...)so that the remote read handler exposes the waiting duration metric.How this solves the issue
When Prometheus is close to the
--storage.remote.read-concurrent-limit, incoming remote read requests queue at the gate. Today there's no way to know this is happening. With this change, thegate_waiting_secondshistogram tracks how long each request waited — a spike in wait times or high-percentile values directly indicates gate contention. Operators can alert onhistogram_quantile(0.99, rate(gate_waiting_seconds_bucket[5m]))to detect when the concurrency limit is becoming a bottleneck.Design decisions
util/notifications/notifications.go, metrics are registered viaprometheus.Registererpassed at construction time, keeping the util package clean.promauto.With(reg)for safe registration.New()constructor remains for backward compatibility (no metrics overhead unless explicitly opted in).Fixes #11365