-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
When reporting the boundary usage summary for telemetry, there's a race that can result in usage data loss.
The race is If a replica calls UpsertBoundaryUsageStats after GetBoundaryUsageSummary but before ResetBoundaryUsageStats. When this happens, The telemetry would under-report by whatever delta accumulated between GetBoundaryUsageSummary and ResetBoundaryUsageStats. Impact: approx 3% of the usage interval data for the replica could be lost.
Detailed race steps:
- T1: GetBoundaryUsageSummary reads stats (e.g., 100 workspaces, 1000 requests)
- T2: Replica flushes → UPDATE → writes (105, 1050) → newPeriod=false → in-memory preserved
- T3: ResetBoundaryUsageStats deletes all rows
- T4: Replica flushes → INSERT (row gone) → writes (105, 1050) → newPeriod=true → in-memory reset to zero
- T5 (1 minute later): Replica flushes → UPDATE → writes (2, 10) ← only ~1 minute of new activity
- T6 (29 minutes later): Next telemetry collection reads (2, 10)
The stats written at T4 are overwritten at T5 before telemetry can collect them. With flushes every minute and telemetry snapshots every 30 minutes, there's ~29 flushes that will overwrite the preserved data.
Reactions are currently unavailable