Skip to content

feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869

Merged
cstyan merged 6 commits intomainfrom
callum/job-queue-metrics
Feb 12, 2026
Merged

feat: add provisioner job queue wait time histogram and jobs enqueued counter#21869
cstyan merged 6 commits intomainfrom
callum/job-queue-metrics

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Feb 3, 2026

This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: coder/internal#1209

I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the transition label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter.

Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them.

Tested briefly in a workspace, can see metric values like the following:

  • coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1
  • coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1

Copy link
Contributor

@coder-tasks coder-tasks bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Reviewed the provisioner job queue metrics implementation. The changes add two new metrics:

  1. coderd_provisioner_job_queue_wait_seconds - histogram tracking job queue wait times
  2. coderd_workspace_builds_enqueued_total - counter tracking workspace build enqueue attempts

The implementation is generally solid with good test coverage. Found 3 issues:

Issues Found

  1. Potential nil pointer dereference in metrics recording - The code checks job != nil but doesn't verify job.Provisioner is valid before converting to string
  2. Missing metric recording for prebuild claims - The prebuild claim flow doesn't record the enqueue metric, creating inconsistency
  3. Histogram bucket configuration could be optimized - The bucket ranges have gaps that may not capture typical queue wait patterns effectively

Positive Aspects

  • Comprehensive test coverage including both direct workspace builds and autostart scenarios
  • Proper nil checks before accessing metrics objects
  • Consistent use of the new BuildReasonPrebuild constant for prebuild tracking
  • Native histogram support configured correctly
  • Good separation of concerns with metrics recording in the appropriate layers

Comment on lines +133 to +145
Help: "Total number of workspace build enqueue attempts.",
}, []string{"provisioner_type", "build_reason", "transition", "status"}),
}
}

func (m *Metrics) Register(reg prometheus.Registerer) error {
if err := reg.Register(m.workspaceCreationTimings); err != nil {
return err
}
return reg.Register(m.workspaceClaimTimings)
if err := reg.Register(m.workspaceClaimTimings); err != nil {
return err
}
if err := reg.Register(m.jobQueueWait); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The histogram buckets have some gaps that might not capture typical queue patterns well:

  • Large gap between 10ms and 25ms (150% increase)
  • Jump from 25ms to 50ms (100% increase)
  • Big leap from 10s to 30s

Consider adding intermediate buckets for more granular measurement:

Suggested change
Help: "Total number of workspace build enqueue attempts.",
}, []string{"provisioner_type", "build_reason", "transition", "status"}),
}
}
func (m *Metrics) Register(reg prometheus.Registerer) error {
if err := reg.Register(m.workspaceCreationTimings); err != nil {
return err
}
return reg.Register(m.workspaceClaimTimings)
if err := reg.Register(m.workspaceClaimTimings); err != nil {
return err
}
if err := reg.Register(m.jobQueueWait); err != nil {
Buckets: []float64{
0.01, // 10ms
0.025, // 25ms
0.05, // 50ms
0.1, // 100ms
0.25, // 250ms
0.5, // 500ms
1, // 1s
2.5, // 2.5s
5, // 5s
10, // 10s
20, // 20s
30, // 30s
60, // 1m
120, // 2m
300, // 5m
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably get away without out a 5m bucket here, and add another sub 100ms bucket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.

},
audit.WorkspaceBuildBaggage{},
)
if c.provisionerdServerMetrics != nil && provisionerJob != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prebuild reconciliation records workspace build enqueued metrics, but the claim flow in claim.go doesn't. This creates inconsistency where some prebuild builds are tracked and others aren't.

Consider adding metric recording in the claim flow as well for complete coverage of all prebuild build paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the path in claim.go does not actually result in the enqueuing of a job, it just updates the owner information for a workspace. After the call to Claim in workspace.go we go through the regular process of calling builder.Build where we do track the new metrics.

We could check claimedWorkspace at the point where we update observe metric value here to set the build reason to prebuild claim instead of initiator (regular user initiated build).

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@coder-tasks
Copy link
Contributor

coder-tasks bot commented Feb 3, 2026

Documentation Check

Updates Needed

  • Run make docs/admin/integrations/prometheus.md to regenerate the metrics table with the new metrics:
    • coderd_provisioner_job_queue_wait_seconds - Histogram tracking time from job creation to acquisition by provisioner daemon
    • coderd_workspace_builds_enqueued_total - Counter for workspace build enqueue attempts

The metrics documentation table in docs/admin/integrations/prometheus.md has been updated in commit 6d88dc9 to include the new provisioner queue metrics with their labels (provisioner_type, job_type, transition, build_reason, status).


Automated review via Coder Tasks

@cstyan cstyan force-pushed the callum/job-queue-metrics branch 3 times, most recently from 427db01 to 12d06af Compare February 3, 2026 18:45
@cstyan cstyan requested a review from spikecurtis February 4, 2026 07:24
Comment on lines +133 to +145
Help: "Total number of workspace build enqueue attempts.",
}, []string{"provisioner_type", "build_reason", "transition", "status"}),
}
}

func (m *Metrics) Register(reg prometheus.Registerer) error {
if err := reg.Register(m.workspaceCreationTimings); err != nil {
return err
}
return reg.Register(m.workspaceClaimTimings)
if err := reg.Register(m.workspaceClaimTimings); err != nil {
return err
}
if err := reg.Register(m.jobQueueWait); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we want stuff on the high end more. Many customers have builds that exceed 15 minutes; I've heard as high as 30 minutes. In that context, 5 minutes of waiting time is not a big deal, and they'd want to know whether 95% of builds start within 30m or whatever.

// operations. This is distinct from database.BuildReason values since prebuilds
// use BuildReasonInitiator in the database but we want to track them separately
// in metrics.
const BuildReasonPrebuild = "prebuild"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only use this for workspace_builds_enqueued_total but use the same label name in both cases. It might be confusing that we get prebuilds separated out in one case but not the other.

It would be really nice to get prebuild information for the queue time histogram because non-prebuilds are prioritized over prebuilds, so they'll likely have different distributions. You can determine whether a build is a prebuild via the initiator ID. Prebuilds are always 'c42fdf75-3097-471c-8c33-fb52454d81c0'

NativeHistogramZeroThreshold: 0,
NativeHistogramMaxZeroThreshold: 0,
}, []string{"provisioner_type", "job_type", "transition", "build_reason"}),
workspaceBuildsEnqueued: prometheus.NewCounterVec(prometheus.CounterOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this belongs here. Provisionerdserver doesn't enqueue jobs.

In all the cases where you call RecordWorkspaceBuildEnqueued, it's right after wsbuilder.Build. That's the common code we have for the business logic of creating a workspace build, so the builder should get a reference to this metric (call it buildMetrics?), and it should be responsible for incrementing it on every build.

@cstyan cstyan force-pushed the callum/job-queue-metrics branch 3 times, most recently from afbbea2 to 4f81dca Compare February 10, 2026 17:59
@cstyan cstyan force-pushed the callum/job-queue-metrics branch from 4f81dca to 8cd349b Compare February 10, 2026 22:47
The wsbuilder.Metrics were created but never registered with the
prometheus registry in the production path. This meant the
workspace_builds_enqueued_total counter was never exported.

Register in the main server flow (not enablePrometheus) so metrics
are always available, matching the pattern used by notifications.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@cstyan cstyan force-pushed the callum/job-queue-metrics branch from 8cd349b to 4df44f1 Compare February 11, 2026 02:31
@cstyan cstyan merged commit 5f3be6b into main Feb 12, 2026
29 checks passed
@cstyan cstyan deleted the callum/job-queue-metrics branch February 12, 2026 21:40
@github-actions github-actions bot locked and limited conversation to collaborators Feb 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants