-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Summary
The coderd_template_workspace_build_duration_seconds histogram (added in #21739) records spurious, inflated observations when a devcontainer is rebuilt within a running workspace. Rebuilding a devcontainer causes the sub-agent to restart, which triggers the metric to re-emit with a duration calculated from the original build creation time to the sub-agent's new ready_at — producing a massively inflated value.
This is the primary real-world trigger for this bug: devcontainer rebuilds are a normal, frequent user action.
Root Cause
Two issues compound:
1. sync.Once is per-connection, not per-build
LifecycleAPI holds emitMetricsOnce sync.Once, and is embedded in API which is "instantiated once per agent connection." When a devcontainer is rebuilt, the sub-agent reconnects via a new DRPC connection → new LifecycleAPI → fresh sync.Once → the metric can fire again for the same build.
2. SQL uses MAX(wa.ready_at) across all agents
The GetWorkspaceBuildMetricsByResourceID query computes MAX(wa.ready_at) across all workspace_agents joined to the build. It does not filter out sub-agents (parent_id IS NOT NULL). When a devcontainer rebuild occurs:
- The sub-agent transitions to
starting→ itsready_atis cleared - The sub-agent transitions to
ready→ready_atis set to now emitMetricsOnce.Do()fires on the newLifecycleAPIstructMAX(wa.ready_at)picks up the rebuilt sub-agent's new timestampduration = MAX(ready_at) - build.created_at= inflated value
How devcontainer rebuild triggers this
When a user rebuilds a devcontainer (handleDevcontainerRecreate):
- The existing sub-agent process is stopped
- The devcontainer is recreated via
devcontainer up - A new sub-agent is injected into the container (
maybeInjectSubAgentIntoContainerLocked) - The sub-agent runs
coder agentinside the container, connecting back to coderd - This new connection gets a fresh
LifecycleAPIwith a newsync.Once - The sub-agent goes through
starting→readylifecycle, triggering the metric
The same bug also applies to any agent restart within a build (e.g., agent process crash/restart), but devcontainer rebuilds are the most common real-world trigger.
Reproduction
- Create and start a workspace with a devcontainer. Wait for the build duration metric to emit (
count=1). - Query metrics — note the initial observation (e.g.,
count=1,sum=66.58s). - Rebuild the devcontainer (or kill the agent process within the workspace to simulate).
- Wait for the agent to reach
readyagain. - Query metrics —
counthas incremented with an inflated duration.
Observed data
| Event | count | sum | This observation | Notes |
|---|---|---|---|---|
| Initial build | 1 | 66.58s | 66.58s | Legitimate |
| Workspace restart (new build) | 2 | 91.51s | 24.93s | Legitimate (different build) |
| Agent kill within build #2 | 3 | 576.22s | 484.70s | Bug — same build, ~8min after build creation |
The third observation is entirely spurious — it's the time from the build's created_at to the agent's restart ready_at, not an actual build duration.
Possible Fixes
- Track emission per-build rather than per-connection (e.g., a build-scoped deduplication map keyed by build ID) so agent reconnections/sub-agent restarts don't re-emit.
- Only emit on the first time all agents reach a terminal state for a given build, ignoring subsequent agent restarts or devcontainer rebuilds.
- Filter sub-agents from the metric query — add
AND wa.parent_id IS NULLto the SQL, since sub-agent lifecycle is independent of the build's initial readiness. - Don't update
ready_aton restart (or use a separate column) so the metric calculation reflects the original build completion time.
Created on behalf of @rowansmithau