feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869
feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869
Conversation
There was a problem hiding this comment.
Code Review
Reviewed the provisioner job queue metrics implementation. The changes add two new metrics:
coderd_provisioner_job_queue_wait_seconds- histogram tracking job queue wait timescoderd_workspace_builds_enqueued_total- counter tracking workspace build enqueue attempts
The implementation is generally solid with good test coverage. Found 3 issues:
Issues Found
- Potential nil pointer dereference in metrics recording - The code checks
job != nilbut doesn't verifyjob.Provisioneris valid before converting to string - Missing metric recording for prebuild claims - The prebuild claim flow doesn't record the enqueue metric, creating inconsistency
- Histogram bucket configuration could be optimized - The bucket ranges have gaps that may not capture typical queue wait patterns effectively
Positive Aspects
- Comprehensive test coverage including both direct workspace builds and autostart scenarios
- Proper nil checks before accessing metrics objects
- Consistent use of the new
BuildReasonPrebuildconstant for prebuild tracking - Native histogram support configured correctly
- Good separation of concerns with metrics recording in the appropriate layers
coderd/provisionerdserver/metrics.go
Outdated
| Help: "Total number of workspace build enqueue attempts.", | ||
| }, []string{"provisioner_type", "build_reason", "transition", "status"}), | ||
| } | ||
| } | ||
|
|
||
| func (m *Metrics) Register(reg prometheus.Registerer) error { | ||
| if err := reg.Register(m.workspaceCreationTimings); err != nil { | ||
| return err | ||
| } | ||
| return reg.Register(m.workspaceClaimTimings) | ||
| if err := reg.Register(m.workspaceClaimTimings); err != nil { | ||
| return err | ||
| } | ||
| if err := reg.Register(m.jobQueueWait); err != nil { |
There was a problem hiding this comment.
The histogram buckets have some gaps that might not capture typical queue patterns well:
- Large gap between 10ms and 25ms (150% increase)
- Jump from 25ms to 50ms (100% increase)
- Big leap from 10s to 30s
Consider adding intermediate buckets for more granular measurement:
| Help: "Total number of workspace build enqueue attempts.", | |
| }, []string{"provisioner_type", "build_reason", "transition", "status"}), | |
| } | |
| } | |
| func (m *Metrics) Register(reg prometheus.Registerer) error { | |
| if err := reg.Register(m.workspaceCreationTimings); err != nil { | |
| return err | |
| } | |
| return reg.Register(m.workspaceClaimTimings) | |
| if err := reg.Register(m.workspaceClaimTimings); err != nil { | |
| return err | |
| } | |
| if err := reg.Register(m.jobQueueWait); err != nil { | |
| Buckets: []float64{ | |
| 0.01, // 10ms | |
| 0.025, // 25ms | |
| 0.05, // 50ms | |
| 0.1, // 100ms | |
| 0.25, // 250ms | |
| 0.5, // 500ms | |
| 1, // 1s | |
| 2.5, // 2.5s | |
| 5, // 5s | |
| 10, // 10s | |
| 20, // 20s | |
| 30, // 30s | |
| 60, // 1m | |
| 120, // 2m | |
| 300, // 5m | |
| }, |
There was a problem hiding this comment.
We can probably get away without out a 5m bucket here, and add another sub 100ms bucket?
There was a problem hiding this comment.
I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.
| }, | ||
| audit.WorkspaceBuildBaggage{}, | ||
| ) | ||
| if c.provisionerdServerMetrics != nil && provisionerJob != nil { |
There was a problem hiding this comment.
The prebuild reconciliation records workspace build enqueued metrics, but the claim flow in claim.go doesn't. This creates inconsistency where some prebuild builds are tracked and others aren't.
Consider adding metric recording in the claim flow as well for complete coverage of all prebuild build paths.
There was a problem hiding this comment.
AFAICT the path in claim.go does not actually result in the enqueuing of a job, it just updates the owner information for a workspace. After the call to Claim in workspace.go we go through the regular process of calling builder.Build where we do track the new metrics.
We could check claimedWorkspace at the point where we update observe metric value here to set the build reason to prebuild claim instead of initiator (regular user initiated build).
|
All contributors have signed the CLA ✍️ ✅ |
Documentation CheckUpdates Needed
The metrics documentation table in Automated review via Coder Tasks |
427db01 to
12d06af
Compare
coderd/provisionerdserver/metrics.go
Outdated
| Help: "Total number of workspace build enqueue attempts.", | ||
| }, []string{"provisioner_type", "build_reason", "transition", "status"}), | ||
| } | ||
| } | ||
|
|
||
| func (m *Metrics) Register(reg prometheus.Registerer) error { | ||
| if err := reg.Register(m.workspaceCreationTimings); err != nil { | ||
| return err | ||
| } | ||
| return reg.Register(m.workspaceClaimTimings) | ||
| if err := reg.Register(m.workspaceClaimTimings); err != nil { | ||
| return err | ||
| } | ||
| if err := reg.Register(m.jobQueueWait); err != nil { |
There was a problem hiding this comment.
I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.
coderd/provisionerdserver/metrics.go
Outdated
| // operations. This is distinct from database.BuildReason values since prebuilds | ||
| // use BuildReasonInitiator in the database but we want to track them separately | ||
| // in metrics. | ||
| const BuildReasonPrebuild = "prebuild" |
There was a problem hiding this comment.
You only use this for workspace_builds_enqueued_total but use the same label name in both cases. It might be confusing that we get prebuilds separated out in one case but not the other.
It would be really nice to get prebuild information for the queue time histogram because non-prebuilds are prioritized over prebuilds, so they'll likely have different distributions. You can determine whether a build is a prebuild via the initiator ID. Prebuilds are always 'c42fdf75-3097-471c-8c33-fb52454d81c0'
coderd/provisionerdserver/metrics.go
Outdated
| NativeHistogramZeroThreshold: 0, | ||
| NativeHistogramMaxZeroThreshold: 0, | ||
| }, []string{"provisioner_type", "job_type", "transition", "build_reason"}), | ||
| workspaceBuildsEnqueued: prometheus.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
I'm not sure this belongs here. Provisionerdserver doesn't enqueue jobs.
In all the cases where you call RecordWorkspaceBuildEnqueued, it's right after wsbuilder.Build. That's the common code we have for the business logic of creating a workspace build, so the builder should get a reference to this metric (call it buildMetrics?), and it should be responsible for incrementing it on every build.
afbbea2 to
4f81dca
Compare
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
4f81dca to
8cd349b
Compare
The wsbuilder.Metrics were created but never registered with the prometheus registry in the production path. This meant the workspace_builds_enqueued_total counter was never exported. Register in the main server flow (not enablePrometheus) so metrics are always available, matching the pattern used by notifications. Co-Authored-By: Claude Opus 4.6 <[email protected]>
8cd349b to
4df44f1
Compare
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
c4111a0 to
9b55fc6
Compare
This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: coder/internal#1209
I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the
transitionlabel, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter.Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1