fix: make boundary usage telemetry collection atomic#21907
Conversation
Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush stats between the read and reset operations. The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.
|
I added a temporary stress test to check if I'm not a deep PG expert, so I asked Codex to explain what's causing the deadlock:
I think one way to avoid this would be to run both |
UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and GetAndResetBoundaryUsageSummary (DELETE...RETURNING) can race during telemetry period cutover. Without serialization, an upsert concurrent with the delete could lose data (deleted right after being written) or commit after the delete (miscounted in the next period). Both operations now acquire LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.
|
I ended up going down a bit of a rabbit hole on how that deadlock happens. I think it would not happen in practice because the table has N replica writers (each writing to a single unique row), and 1 replica doing the read+delete. I think they key part is there are never more than 1 side locking multiple rows within a single transaction i.e. only The stress test has 5 writers and 5 read+delete'ers, which exposes a deadlock condition. If I set That said, I think the use of an advisory lock is still needed because the row/table locks acquired by |
Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush boundary usage stats between the read and reset operations.
The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.
#21770