Skip to content

bug: coderd_template_workspace_build_duration_seconds emits inflated duplicate observation on devcontainer rebuild #22696

@blinkagent

Description

@blinkagent

Summary

The coderd_template_workspace_build_duration_seconds histogram (added in #21739) records spurious, inflated observations when a devcontainer is rebuilt within a running workspace. Rebuilding a devcontainer causes the sub-agent to restart, which triggers the metric to re-emit with a duration calculated from the original build creation time to the sub-agent's new ready_at — producing a massively inflated value.

This is the primary real-world trigger for this bug: devcontainer rebuilds are a normal, frequent user action.

Root Cause

Two issues compound:

1. sync.Once is per-connection, not per-build

LifecycleAPI holds emitMetricsOnce sync.Once, and is embedded in API which is "instantiated once per agent connection." When a devcontainer is rebuilt, the sub-agent reconnects via a new DRPC connection → new LifecycleAPI → fresh sync.Once → the metric can fire again for the same build.

2. SQL uses MAX(wa.ready_at) across all agents

The GetWorkspaceBuildMetricsByResourceID query computes MAX(wa.ready_at) across all workspace_agents joined to the build. It does not filter out sub-agents (parent_id IS NOT NULL). When a devcontainer rebuild occurs:

  1. The sub-agent transitions to starting → its ready_at is cleared
  2. The sub-agent transitions to readyready_at is set to now
  3. emitMetricsOnce.Do() fires on the new LifecycleAPI struct
  4. MAX(wa.ready_at) picks up the rebuilt sub-agent's new timestamp
  5. duration = MAX(ready_at) - build.created_at = inflated value

How devcontainer rebuild triggers this

When a user rebuilds a devcontainer (handleDevcontainerRecreate):

  1. The existing sub-agent process is stopped
  2. The devcontainer is recreated via devcontainer up
  3. A new sub-agent is injected into the container (maybeInjectSubAgentIntoContainerLocked)
  4. The sub-agent runs coder agent inside the container, connecting back to coderd
  5. This new connection gets a fresh LifecycleAPI with a new sync.Once
  6. The sub-agent goes through startingready lifecycle, triggering the metric

The same bug also applies to any agent restart within a build (e.g., agent process crash/restart), but devcontainer rebuilds are the most common real-world trigger.

Reproduction

  1. Create and start a workspace with a devcontainer. Wait for the build duration metric to emit (count=1).
  2. Query metrics — note the initial observation (e.g., count=1, sum=66.58s).
  3. Rebuild the devcontainer (or kill the agent process within the workspace to simulate).
  4. Wait for the agent to reach ready again.
  5. Query metrics — count has incremented with an inflated duration.

Observed data

Event count sum This observation Notes
Initial build 1 66.58s 66.58s Legitimate
Workspace restart (new build) 2 91.51s 24.93s Legitimate (different build)
Agent kill within build #2 3 576.22s 484.70s Bug — same build, ~8min after build creation

The third observation is entirely spurious — it's the time from the build's created_at to the agent's restart ready_at, not an actual build duration.

Possible Fixes

  • Track emission per-build rather than per-connection (e.g., a build-scoped deduplication map keyed by build ID) so agent reconnections/sub-agent restarts don't re-emit.
  • Only emit on the first time all agents reach a terminal state for a given build, ignoring subsequent agent restarts or devcontainer rebuilds.
  • Filter sub-agents from the metric query — add AND wa.parent_id IS NULL to the SQL, since sub-agent lifecycle is independent of the build's initial readiness.
  • Don't update ready_at on restart (or use a separate column) so the metric calculation reflects the original build completion time.

Created on behalf of @rowansmithau

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions