fix: make boundary usage telemetry collection atomic by zedkipp · Pull Request #21907 · coder/coder

zedkipp · 2026-02-03T23:25:44Z

Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush boundary usage stats between the read and reset operations.

The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.

#21770

Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush stats between the read and reset operations. The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.

hugodutka · 2026-02-05T11:59:59Z

I added a temporary stress test to check if UpsertBoundaryUsageStats and GetAndResetBoundaryUsageSummary could conflict with one another, and it looks like they can run into a deadlock if used concurrently:

$ go test ./coderd/telemetry -run TestTelemetry_BoundaryUsageSummary
--- FAIL: TestTelemetry_BoundaryUsageSummary (0.00s)
    --- FAIL: TestTelemetry_BoundaryUsageSummary/ConcurrentGetAndResetWithWriters (5.77s)
        t.go:111: 2026-02-05 11:50:11.264 [debu]  pubsub: pubsub dialing postgres  network=tcp  address=127.0.0.1:5432  timeout_ms=0
        t.go:111: 2026-02-05 11:50:11.264 [debu]  pubsub: pubsub postgres TCP connection established  network=tcp  address=127.0.0.1:5432  timeout_ms=0  elapsed_ms=0
        t.go:111: 2026-02-05 11:50:11.267 [debu]  pubsub: pubsub connected to postgres
        t.go:111: 2026-02-05 11:50:11.267 [debu]  pubsub: pubsub has started
        telemetry_test.go:1056: 
                Error Trace:    /home/coder/coder/coderd/telemetry/telemetry_test.go:1056
                Error:          Received unexpected error:
                                pq: deadlock detected
                Test:           TestTelemetry_BoundaryUsageSummary/ConcurrentGetAndResetWithWriters
        t.go:111: 2026-02-05 11:50:16.932 [info]  pubsub: pubsub is closing
        t.go:111: 2026-02-05 11:50:16.932 [info]  pubsub: pubsub listen stopped receiving notify
        t.go:111: 2026-02-05 11:50:16.932 [debu]  pubsub: pubsub closed
FAIL
FAIL    github.com/coder/coder/v2/coderd/telemetry      5.814s
FAIL

I'm not a deep PG expert, so I asked Codex to explain what's causing the deadlock:

GetAndResetBoundaryUsageSummary runs DELETE FROM boundary_usage_stats … RETURNING (no WHERE), which takes row locks as it scans/deletes.

Writers are doing INSERT … ON CONFLICT DO UPDATE on the same rows, which takes a unique index lock and then waits on the row lock.

It’s possible for the DELETE to hold the row lock while waiting on an index lock held by the UPSERT, and the UPSERT is waiting on the row lock → cycle → Postgres aborts one
transaction with pq: deadlock detected (SQLSTATE 40P01).

I think one way to avoid this would be to run both UpsertBoundaryUsageStats and GetAndResetBoundaryUsageSummary in transactions and take explicit exclusive table locks on the boundary_usage_stats table at transaction start.

UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and GetAndResetBoundaryUsageSummary (DELETE...RETURNING) can race during telemetry period cutover. Without serialization, an upsert concurrent with the delete could lose data (deleted right after being written) or commit after the delete (miscounted in the next period). Both operations now acquire LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.

zedkipp · 2026-02-05T22:17:44Z

I ended up going down a bit of a rabbit hole on how that deadlock happens. I think it would not happen in practice because the table has N replica writers (each writing to a single unique row), and 1 replica doing the read+delete. I think they key part is there are never more than 1 side locking multiple rows within a single transaction i.e. only GetAndResetBoundaryUsageSummary() locks multiple rows.

The stress test has 5 writers and 5 read+delete'ers, which exposes a deadlock condition. If I set readerCount in the test to 1, the deadlock doesn't happen.

That said, I think the use of an advisory lock is still needed because the row/table locks acquired by INSERT ... ON CONFLICT DO UPDATE and DELETE are not strong enough to create a clean cutover between periods. If an upsert runs during the delete, the upserted data could be lost (deleted right after being written) or it might commit after the delete (counted in the next period unexpectedly).

github-actions bot assigned zedkipp Feb 3, 2026

zedkipp requested review from evgeniy-scherbina February 4, 2026 20:52

zedkipp marked this pull request as ready for review February 4, 2026 20:52

zedkipp requested review from hugodutka February 4, 2026 20:52

hugodutka approved these changes Feb 6, 2026

View reviewed changes

zedkipp merged commit a31e476 into main Feb 6, 2026
27 of 28 checks passed

zedkipp deleted the zedkipp/snapshot-race branch February 6, 2026 16:52

github-actions bot locked and limited conversation to collaborators Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make boundary usage telemetry collection atomic#21907

fix: make boundary usage telemetry collection atomic#21907
zedkipp merged 2 commits intomainfrom
zedkipp/snapshot-race

zedkipp commented Feb 3, 2026 •

edited

Loading

Uh oh!

hugodutka commented Feb 5, 2026 •

edited

Loading

Uh oh!

zedkipp commented Feb 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zedkipp commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hugodutka commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zedkipp commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zedkipp commented Feb 3, 2026 •

edited

Loading

hugodutka commented Feb 5, 2026 •

edited

Loading

zedkipp commented Feb 5, 2026 •

edited

Loading