bug: coderd_template_workspace_build_duration_seconds emits inflated duplicate observation on devcontainer rebuild

## Summary

The `coderd_template_workspace_build_duration_seconds` histogram (added in #21739) records spurious, inflated observations when a devcontainer is rebuilt within a running workspace. Rebuilding a devcontainer causes the sub-agent to restart, which triggers the metric to re-emit with a duration calculated from the original build creation time to the sub-agent's new `ready_at` — producing a massively inflated value.

This is the primary real-world trigger for this bug: devcontainer rebuilds are a normal, frequent user action.

## Root Cause

Two issues compound:

### 1. `sync.Once` is per-connection, not per-build

[`LifecycleAPI`](https://github.com/coder/coder/blob/main/coderd/agentapi/lifecycle.go#L29-L37) holds `emitMetricsOnce sync.Once`, and is embedded in [`API`](https://github.com/coder/coder/blob/main/coderd/agentapi/api.go#L43-L44) which is "instantiated once per agent connection." When a devcontainer is rebuilt, the sub-agent reconnects via a new DRPC connection → new `LifecycleAPI` → fresh `sync.Once` → the metric can fire again for the same build.

### 2. SQL uses `MAX(wa.ready_at)` across all agents

The [`GetWorkspaceBuildMetricsByResourceID`](https://github.com/coder/coder/blob/main/coderd/database/queries/workspacebuilds.sql#L258) query computes `MAX(wa.ready_at)` across all `workspace_agents` joined to the build. It does not filter out sub-agents (`parent_id IS NOT NULL`). When a devcontainer rebuild occurs:

1. The sub-agent transitions to `starting` → its `ready_at` is [cleared](https://github.com/coder/coder/blob/main/coderd/agentapi/lifecycle.go#L98-L99)
2. The sub-agent transitions to `ready` → `ready_at` is [set to now](https://github.com/coder/coder/blob/main/coderd/agentapi/lifecycle.go#L107)
3. `emitMetricsOnce.Do()` fires on the new `LifecycleAPI` struct
4. `MAX(wa.ready_at)` picks up the rebuilt sub-agent's new timestamp
5. `duration = MAX(ready_at) - build.created_at` = inflated value

## How devcontainer rebuild triggers this

When a user rebuilds a devcontainer ([`handleDevcontainerRecreate`](https://github.com/coder/coder/blob/main/agent/agentcontainers/api.go#L1384)):

1. The existing sub-agent process is stopped
2. The devcontainer is recreated via `devcontainer up`
3. A new sub-agent is injected into the container ([`maybeInjectSubAgentIntoContainerLocked`](https://github.com/coder/coder/blob/main/agent/agentcontainers/api.go#L1662))
4. The sub-agent runs `coder agent` inside the container, connecting back to coderd
5. This new connection gets a fresh `LifecycleAPI` with a new `sync.Once`
6. The sub-agent goes through `starting` → `ready` lifecycle, triggering the metric

The same bug also applies to any agent restart within a build (e.g., agent process crash/restart), but devcontainer rebuilds are the most common real-world trigger.

## Reproduction

1. Create and start a workspace with a devcontainer. Wait for the build duration metric to emit (`count=1`).
2. Query metrics — note the initial observation (e.g., `count=1`, `sum=66.58s`).
3. Rebuild the devcontainer (or kill the agent process within the workspace to simulate).
4. Wait for the agent to reach `ready` again.
5. Query metrics — `count` has incremented with an inflated duration.

### Observed data

| Event | count | sum | This observation | Notes |
|---|---|---|---|---|
| Initial build | 1 | 66.58s | 66.58s | Legitimate |
| Workspace restart (new build) | 2 | 91.51s | 24.93s | Legitimate (different build) |
| Agent kill within build #2 | 3 | 576.22s | **484.70s** | **Bug** — same build, ~8min after build creation |

The third observation is entirely spurious — it's the time from the build's `created_at` to the agent's restart `ready_at`, not an actual build duration.

## Possible Fixes

- **Track emission per-build** rather than per-connection (e.g., a build-scoped deduplication map keyed by build ID) so agent reconnections/sub-agent restarts don't re-emit.
- **Only emit on the first time all agents reach a terminal state for a given build**, ignoring subsequent agent restarts or devcontainer rebuilds.
- **Filter sub-agents from the metric query** — add `AND wa.parent_id IS NULL` to the SQL, since sub-agent lifecycle is independent of the build's initial readiness.
- **Don't update `ready_at` on restart** (or use a separate column) so the metric calculation reflects the original build completion time.

Created on behalf of @rowansmithau

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: coderd_template_workspace_build_duration_seconds emits inflated duplicate observation on devcontainer rebuild #22696

Summary

Root Cause

1. `sync.Once` is per-connection, not per-build

2. SQL uses `MAX(wa.ready_at)` across all agents

How devcontainer rebuild triggers this

Reproduction

Observed data

Possible Fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Event	count	sum	This observation	Notes
Initial build	1	66.58s	66.58s	Legitimate
Workspace restart (new build)	2	91.51s	24.93s	Legitimate (different build)
Agent kill within build #2	3	576.22s	484.70s	Bug — same build, ~8min after build creation

bug: coderd_template_workspace_build_duration_seconds emits inflated duplicate observation on devcontainer rebuild #22696

Description

Summary

Root Cause

1. sync.Once is per-connection, not per-build

2. SQL uses MAX(wa.ready_at) across all agents

How devcontainer rebuild triggers this

Reproduction

Observed data

Possible Fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `sync.Once` is per-connection, not per-build

2. SQL uses `MAX(wa.ready_at)` across all agents