Skip to content

Boundary usage telemetry data loss #21770

@zedkipp

Description

@zedkipp

When reporting the boundary usage summary for telemetry, there's a race that can result in usage data loss.

The race is If a replica calls UpsertBoundaryUsageStats after GetBoundaryUsageSummary but before ResetBoundaryUsageStats. When this happens, The telemetry would under-report by whatever delta accumulated between GetBoundaryUsageSummary and ResetBoundaryUsageStats. Impact: approx 3% of the usage interval data for the replica could be lost.

Detailed race steps:

  1. T1: GetBoundaryUsageSummary reads stats (e.g., 100 workspaces, 1000 requests)
  2. T2: Replica flushes → UPDATE → writes (105, 1050) → newPeriod=false → in-memory preserved
  3. T3: ResetBoundaryUsageStats deletes all rows
  4. T4: Replica flushes → INSERT (row gone) → writes (105, 1050) → newPeriod=true → in-memory reset to zero
  5. T5 (1 minute later): Replica flushes → UPDATE → writes (2, 10) ← only ~1 minute of new activity
  6. T6 (29 minutes later): Next telemetry collection reads (2, 10)

The stats written at T4 are overwritten at T5 before telemetry can collect them. With flushes every minute and telemetry snapshots every 30 minutes, there's ~29 flushes that will overwrite the preserved data.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions