Skip to content

fix: make boundary usage telemetry collection atomic#21907

Merged
zedkipp merged 2 commits intomainfrom
zedkipp/snapshot-race
Feb 6, 2026
Merged

fix: make boundary usage telemetry collection atomic#21907
zedkipp merged 2 commits intomainfrom
zedkipp/snapshot-race

Conversation

@zedkipp
Copy link
Contributor

@zedkipp zedkipp commented Feb 3, 2026

Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush boundary usage stats between the read and reset operations.

The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.

#21770

Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats
calls with a single atomic GetAndResetBoundaryUsageSummary query. This
uses DELETE...RETURNING in a common table expression to ensure the rows
we sum are exactly the rows we delete, eliminating the race condition
where a replica could flush stats between the read and reset operations.

The race could cause one flush interval (1 minute) of usage data to be
lost when a replica flushed between GetBoundaryUsageSummary and
ResetBoundaryUsageStats, as the newly written data would be deleted before
the next telemetry collection.
@zedkipp zedkipp marked this pull request as ready for review February 4, 2026 20:52
@zedkipp zedkipp requested review from hugodutka February 4, 2026 20:52
@hugodutka
Copy link
Contributor

hugodutka commented Feb 5, 2026

I added a temporary stress test to check if UpsertBoundaryUsageStats and GetAndResetBoundaryUsageSummary could conflict with one another, and it looks like they can run into a deadlock if used concurrently:

$ go test ./coderd/telemetry -run TestTelemetry_BoundaryUsageSummary
--- FAIL: TestTelemetry_BoundaryUsageSummary (0.00s)
    --- FAIL: TestTelemetry_BoundaryUsageSummary/ConcurrentGetAndResetWithWriters (5.77s)
        t.go:111: 2026-02-05 11:50:11.264 [debu]  pubsub: pubsub dialing postgres  network=tcp  address=127.0.0.1:5432  timeout_ms=0
        t.go:111: 2026-02-05 11:50:11.264 [debu]  pubsub: pubsub postgres TCP connection established  network=tcp  address=127.0.0.1:5432  timeout_ms=0  elapsed_ms=0
        t.go:111: 2026-02-05 11:50:11.267 [debu]  pubsub: pubsub connected to postgres
        t.go:111: 2026-02-05 11:50:11.267 [debu]  pubsub: pubsub has started
        telemetry_test.go:1056: 
                Error Trace:    /home/coder/coder/coderd/telemetry/telemetry_test.go:1056
                Error:          Received unexpected error:
                                pq: deadlock detected
                Test:           TestTelemetry_BoundaryUsageSummary/ConcurrentGetAndResetWithWriters
        t.go:111: 2026-02-05 11:50:16.932 [info]  pubsub: pubsub is closing
        t.go:111: 2026-02-05 11:50:16.932 [info]  pubsub: pubsub listen stopped receiving notify
        t.go:111: 2026-02-05 11:50:16.932 [debu]  pubsub: pubsub closed
FAIL
FAIL    github.com/coder/coder/v2/coderd/telemetry      5.814s
FAIL

I'm not a deep PG expert, so I asked Codex to explain what's causing the deadlock:

  • GetAndResetBoundaryUsageSummary runs DELETE FROM boundary_usage_stats … RETURNING (no WHERE), which takes row locks as it scans/deletes.
  • Writers are doing INSERT … ON CONFLICT DO UPDATE on the same rows, which takes a unique index lock and then waits on the row lock.
  • It’s possible for the DELETE to hold the row lock while waiting on an index lock held by the UPSERT, and the UPSERT is waiting on the row lock → cycle → Postgres aborts one
    transaction with pq: deadlock detected (SQLSTATE 40P01).

I think one way to avoid this would be to run both UpsertBoundaryUsageStats and GetAndResetBoundaryUsageSummary in transactions and take explicit exclusive table locks on the boundary_usage_stats table at transaction start.

UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and
GetAndResetBoundaryUsageSummary (DELETE...RETURNING) can race during
telemetry period cutover. Without serialization, an upsert concurrent
with the delete could lose data (deleted right after being written) or
commit after the delete (miscounted in the next period). Both operations
now acquire LockIDBoundaryUsageStats within a transaction to ensure a
clean cutover.
@zedkipp
Copy link
Contributor Author

zedkipp commented Feb 5, 2026

I ended up going down a bit of a rabbit hole on how that deadlock happens. I think it would not happen in practice because the table has N replica writers (each writing to a single unique row), and 1 replica doing the read+delete. I think they key part is there are never more than 1 side locking multiple rows within a single transaction i.e. only GetAndResetBoundaryUsageSummary() locks multiple rows.

The stress test has 5 writers and 5 read+delete'ers, which exposes a deadlock condition. If I set readerCount in the test to 1, the deadlock doesn't happen.

That said, I think the use of an advisory lock is still needed because the row/table locks acquired by INSERT ... ON CONFLICT DO UPDATE and DELETE are not strong enough to create a clean cutover between periods. If an upsert runs during the delete, the upserted data could be lost (deleted right after being written) or it might commit after the delete (counted in the next period unexpectedly).

@zedkipp zedkipp merged commit a31e476 into main Feb 6, 2026
27 of 28 checks passed
@zedkipp zedkipp deleted the zedkipp/snapshot-race branch February 6, 2026 16:52
@github-actions github-actions bot locked and limited conversation to collaborators Feb 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants