feat: add provisioner job queue wait time histogram and jobs enqueued counter by cstyan · Pull Request #21869 · coder/coder

cstyan · 2026-02-03T00:55:34Z

This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: coder/internal#1209

I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the transition label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter.

Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them.

Tested briefly in a workspace, can see metric values like the following:

coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1
coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1

coder-tasks

Code Review

Reviewed the provisioner job queue metrics implementation. The changes add two new metrics:

coderd_provisioner_job_queue_wait_seconds - histogram tracking job queue wait times
coderd_workspace_builds_enqueued_total - counter tracking workspace build enqueue attempts

The implementation is generally solid with good test coverage. Found 3 issues:

Issues Found

Potential nil pointer dereference in metrics recording - The code checks job != nil but doesn't verify job.Provisioner is valid before converting to string
Missing metric recording for prebuild claims - The prebuild claim flow doesn't record the enqueue metric, creating inconsistency
Histogram bucket configuration could be optimized - The bucket ranges have gaps that may not capture typical queue wait patterns effectively

Positive Aspects

Comprehensive test coverage including both direct workspace builds and autostart scenarios
Proper nil checks before accessing metrics objects
Consistent use of the new BuildReasonPrebuild constant for prebuild tracking
Native histogram support configured correctly
Good separation of concerns with metrics recording in the appropriate layers

coderd/autobuild/lifecycle_executor.go

coder-tasks · 2026-02-03T00:57:14Z

coderd/provisionerdserver/metrics.go

+			Help:      "Total number of workspace build enqueue attempts.",
+		}, []string{"provisioner_type", "build_reason", "transition", "status"}),
 	}
 }

 func (m *Metrics) Register(reg prometheus.Registerer) error {
 	if err := reg.Register(m.workspaceCreationTimings); err != nil {
 		return err
 	}
-	return reg.Register(m.workspaceClaimTimings)
+	if err := reg.Register(m.workspaceClaimTimings); err != nil {
+		return err
+	}
+	if err := reg.Register(m.jobQueueWait); err != nil {


The histogram buckets have some gaps that might not capture typical queue patterns well:

Large gap between 10ms and 25ms (150% increase)

Jump from 25ms to 50ms (100% increase)

Big leap from 10s to 30s

Consider adding intermediate buckets for more granular measurement:

Suggested change

Help: "Total number of workspace build enqueue attempts.",

}, []string{"provisioner_type", "build_reason", "transition", "status"}),

}

}

func (m *Metrics) Register(reg prometheus.Registerer) error {

if err := reg.Register(m.workspaceCreationTimings); err != nil {

return err

}

return reg.Register(m.workspaceClaimTimings)

if err := reg.Register(m.workspaceClaimTimings); err != nil {

return err

}

if err := reg.Register(m.jobQueueWait); err != nil {

Buckets: []float64{

0.01, // 10ms

0.025, // 25ms

0.05, // 50ms

0.1, // 100ms

0.25, // 250ms

0.5, // 500ms

1, // 1s

2.5, // 2.5s

5, // 5s

10, // 10s

20, // 20s

30, // 30s

60, // 1m

120, // 2m

300, // 5m

},

We can probably get away without out a 5m bucket here, and add another sub 100ms bucket?

I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.

coder-tasks · 2026-02-03T00:57:14Z

enterprise/coderd/prebuilds/reconcile.go

 		},
 		audit.WorkspaceBuildBaggage{},
 	)
+	if c.provisionerdServerMetrics != nil && provisionerJob != nil {


The prebuild reconciliation records workspace build enqueued metrics, but the claim flow in claim.go doesn't. This creates inconsistency where some prebuild builds are tracked and others aren't.

Consider adding metric recording in the claim flow as well for complete coverage of all prebuild build paths.

AFAICT the path in claim.go does not actually result in the enqueuing of a job, it just updates the owner information for a workspace. After the call to Claim in workspace.go we go through the regular process of calling builder.Build where we do track the new metrics.

We could check claimedWorkspace at the point where we update observe metric value here to set the build reason to prebuild claim instead of initiator (regular user initiated build).

github-actions · 2026-02-03T01:11:10Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

coder-tasks · 2026-02-03T01:13:18Z

Documentation Check

Updates Needed

Run make docs/admin/integrations/prometheus.md to regenerate the metrics table with the new metrics:
- coderd_provisioner_job_queue_wait_seconds - Histogram tracking time from job creation to acquisition by provisioner daemon
- coderd_workspace_builds_enqueued_total - Counter for workspace build enqueue attempts

The metrics documentation table in docs/admin/integrations/prometheus.md has been updated in commit 6d88dc9 to include the new provisioner queue metrics with their labels (provisioner_type, job_type, transition, build_reason, status).

Automated review via Coder Tasks

spikecurtis · 2026-02-04T11:59:42Z

coderd/provisionerdserver/metrics.go

+			Help:      "Total number of workspace build enqueue attempts.",
+		}, []string{"provisioner_type", "build_reason", "transition", "status"}),
 	}
 }

 func (m *Metrics) Register(reg prometheus.Registerer) error {
 	if err := reg.Register(m.workspaceCreationTimings); err != nil {
 		return err
 	}
-	return reg.Register(m.workspaceClaimTimings)
+	if err := reg.Register(m.workspaceClaimTimings); err != nil {
+		return err
+	}
+	if err := reg.Register(m.jobQueueWait); err != nil {


I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.

spikecurtis · 2026-02-04T12:21:05Z

coderd/provisionerdserver/metrics.go

+// operations. This is distinct from database.BuildReason values since prebuilds
+// use BuildReasonInitiator in the database but we want to track them separately
+// in metrics.
+const BuildReasonPrebuild = "prebuild"


You only use this for workspace_builds_enqueued_total but use the same label name in both cases. It might be confusing that we get prebuilds separated out in one case but not the other.

It would be really nice to get prebuild information for the queue time histogram because non-prebuilds are prioritized over prebuilds, so they'll likely have different distributions. You can determine whether a build is a prebuild via the initiator ID. Prebuilds are always 'c42fdf75-3097-471c-8c33-fb52454d81c0'

spikecurtis · 2026-02-04T12:25:58Z

coderd/provisionerdserver/metrics.go

+			NativeHistogramZeroThreshold:    0,
+			NativeHistogramMaxZeroThreshold: 0,
+		}, []string{"provisioner_type", "job_type", "transition", "build_reason"}),
+		workspaceBuildsEnqueued: prometheus.NewCounterVec(prometheus.CounterOpts{


I'm not sure this belongs here. Provisionerdserver doesn't enqueue jobs.

In all the cases where you call RecordWorkspaceBuildEnqueued, it's right after wsbuilder.Build. That's the common code we have for the business logic of creating a workspace build, so the builder should get a reference to this metric (call it buildMetrics?), and it should be responsible for incrementing it on every build.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Signed-off-by: Callum Styan <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

The wsbuilder.Metrics were created but never registered with the prometheus registry in the production path. This meant the workspace_builds_enqueued_total counter was never exported. Register in the main server flow (not enablePrometheus) so metrics are always available, matching the pattern used by notifications. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Signed-off-by: Callum Styan <[email protected]>

github-actions bot assigned cstyan Feb 3, 2026

coder-tasks bot reviewed Feb 3, 2026

View reviewed changes

cstyan force-pushed the callum/job-queue-metrics branch 3 times, most recently from 427db01 to 12d06af Compare February 3, 2026 18:45

cstyan requested a review from spikecurtis February 4, 2026 07:24

spikecurtis reviewed Feb 4, 2026

View reviewed changes

cstyan force-pushed the callum/job-queue-metrics branch 3 times, most recently from afbbea2 to 4f81dca Compare February 10, 2026 17:59

cstyan and others added 3 commits February 10, 2026 22:46

add provisioner job queue wait time histogram and jobs enqueued counter

a9db91f

Co-Authored-By: Claude Opus 4.5 <[email protected]>

update metrics docs

6d88dc9

Signed-off-by: Callum Styan <[email protected]>

address review feedback

15730c7

Co-Authored-By: Claude Opus 4.5 <[email protected]>

cstyan force-pushed the callum/job-queue-metrics branch from 4f81dca to 8cd349b Compare February 10, 2026 22:47

cstyan force-pushed the callum/job-queue-metrics branch from 8cd349b to 4df44f1 Compare February 11, 2026 02:31

cstyan added 2 commits February 11, 2026 20:22

do not make unrelated style changes

8eab0b8

Signed-off-by: Callum Styan <[email protected]>

remove duplicate definition of the prebuild build label value

9b55fc6

Signed-off-by: Callum Styan <[email protected]>

cstyan force-pushed the callum/job-queue-metrics branch from c4111a0 to 9b55fc6 Compare February 11, 2026 20:22

cstyan mentioned this pull request Feb 12, 2026

feat: add WatchAllWorkspaceBuilds endpoint for autostart scaletests #22057

Merged

spikecurtis approved these changes Feb 12, 2026

View reviewed changes

cstyan merged commit 5f3be6b into main Feb 12, 2026
29 checks passed

cstyan deleted the callum/job-queue-metrics branch February 12, 2026 21:40

github-actions bot locked and limited conversation to collaborators Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869

feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869
cstyan merged 6 commits intomainfrom
callum/job-queue-metrics

cstyan commented Feb 3, 2026 •

edited

Loading

Uh oh!

coder-tasks bot left a comment

Uh oh!

Uh oh!

coder-tasks bot Feb 3, 2026

Uh oh!

cstyan Feb 3, 2026

Uh oh!

spikecurtis Feb 4, 2026

Uh oh!

coder-tasks bot Feb 3, 2026

Uh oh!

cstyan Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

coder-tasks bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

spikecurtis Feb 4, 2026

Uh oh!

spikecurtis Feb 4, 2026

Uh oh!

spikecurtis Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-			Help:      "Total number of workspace build enqueue attempts.",
-		}, []string{"provisioner_type", "build_reason", "transition", "status"}),
-	}
-}
-func (m *Metrics) Register(reg prometheus.Registerer) error {
-	if err := reg.Register(m.workspaceCreationTimings); err != nil {
-		return err
-	}
-	return reg.Register(m.workspaceClaimTimings)
-	if err := reg.Register(m.workspaceClaimTimings); err != nil {
-		return err
-	}
-	if err := reg.Register(m.jobQueueWait); err != nil {
+Buckets: []float64{
+.01,  // 10ms
+.025, // 25ms
+.05,  // 50ms
+.1,   // 100ms
+.25,  // 250ms
+.5,   // 500ms
+,     // 1s
+.5,   // 2.5s
+,     // 5s
+,    // 10s
+,    // 20s
+,    // 30s
+,    // 1m
+,   // 2m
+,   // 5m
+},

Conversation

cstyan commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coder-tasks bot left a comment

Choose a reason for hiding this comment

Code Review

Issues Found

Positive Aspects

Uh oh!

Uh oh!

coder-tasks bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cstyan Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

spikecurtis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coder-tasks bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cstyan Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coder-tasks bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Check

Updates Needed

Uh oh!

spikecurtis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

spikecurtis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

spikecurtis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cstyan commented Feb 3, 2026 •

edited

Loading

github-actions bot commented Feb 3, 2026 •

edited

Loading

coder-tasks bot commented Feb 3, 2026 •

edited

Loading