Skip to content

improve performance supporting 30k+ workspace connections#23398

Draft
sreya wants to merge 11 commits intomainfrom
agent-connections
Draft

improve performance supporting 30k+ workspace connections#23398
sreya wants to merge 11 commits intomainfrom
agent-connections

Conversation

@sreya
Copy link
Collaborator

@sreya sreya commented Mar 20, 2026

not worth reviewing by anyone yet

sreya added 11 commits March 20, 2026 21:33
Add PGCoordOptions struct to configure the number of workers for
querier, binder, tunneler, and handshaker components. Zero values
fall back to the existing hardcoded defaults.

Wire four new hidden deployment options:
- CODER_TAILNET_QUERIER_WORKERS
- CODER_TAILNET_BINDER_WORKERS
- CODER_TAILNET_TUNNELER_WORKERS
- CODER_TAILNET_HANDSHAKER_WORKERS
…peer updates

Add a reverse tunnel index (tunnelsByPeer) to the pgcoord querier that
tracks which remote peers have tunnels with local mappers. This allows
listenPeer to skip peer updates for peers with no local interest,
avoiding unnecessary DB queries and work queue items.

The index is maintained in:
- listenTunnel: when a tunnel update arrives, register each peer as a
  tunnel partner for the other if the other has a local mapper.
- peerUpdate: when querying tunnel peers from DB, populate the index
  for peers that have local mappers.
- cleanupConn: remove the mapper from all index entries when cleaned up.
- resyncPeerMappings: clear and let the index repopulate organically.
Add GetTailnetTunnelPeerIDsBatch and GetTailnetTunnelPeerBindingsBatch
SQL queries that accept UUID arrays, reducing DB round trips when many
peers need tunnel lookups simultaneously.

Add workQ.acquireBatch() to opportunistically grab up to 49 additional
same-type pending keys after the initial blocking acquire(). The worker
loop now batches peerUpdate and mappingQuery work items, falling back to
single-key queries when only one item is pending.

At 10k agents this reduces per-update DB round trips from O(n) to
O(n/50), significantly lowering coordinator DB load.
Introduces a new agentconnectionbatcher package that batches agent
connection heartbeat updates (lastConnectedAt) into a single
BatchUpdateWorkspaceAgentConnections query instead of one
UpdateWorkspaceAgentConnectionByID per agent per tick.

Initial connect and disconnect writes remain as direct DB calls to
ensure immediate state visibility. Only periodic heartbeat updates
are batched, since they are high-frequency and losing one is
acceptable.

The batcher follows the same pattern as the existing metadata batcher:
buffered channel, dedup by agent ID, periodic flush with capacity
overflow, and a final flush on shutdown.
Add BatchUpsertConnectionLogs SQL query using the unnest pattern to
batch multiple connection log upserts into a single query, reducing
DB lock contention at scale (30k+ connections).

Create connectionlogbatcher package following the same pattern as
agentconnectionbatcher. The batcher buffers UpsertConnectionLogParams
and flushes them every second (or when batch size reaches 500).

Wire the batcher into the enterprise ConnectionLogger as a new
batchDBBackend, replacing direct per-event DB writes.

Changes:
- coderd/database/queries/connectionlogs.sql: new BatchUpsertConnectionLogs query
- coderd/database/dbauthz/dbauthz.go: implement auth wrapper (not panic stub)
- coderd/connectionlogbatcher/batcher.go: new batcher package
- enterprise/coderd/connectionlog/connectionlog.go: add NewBatchDBBackend
- enterprise/coderd/coderd.go: create batcher and use batch backend
- Add workspace identity (name, ID, agent name) to runner logs
- Add trace context injection to WebSocket and workspace app connections
- Wire up --trace-propagate flag to enable trace propagation to coderd
- Copy Trace flag in DupClientCopyingHeaders for runner clients
- Add PrintSummary to harness.Results (limits output to N failures for
  container log visibility)
- Write results summary to stderr (kubectl logs captures stderr)
- Print results before any error return for consistent visibility
- Assert bytes read > 0 on test success to catch silent connection deaths
- Return error when context cancelled before deadline completes
- Log traffic summary (actual bytes read/written) on all exit paths
- Add workspace identity (name, ID, agent name) to runner logs
- Add trace context injection to WebSocket and workspace app connections
- Wire up --trace-propagate flag to enable trace propagation to coderd
- Copy Trace flag in DupClientCopyingHeaders for runner clients
- Add PrintSummary to harness.Results (limits output to N failures for
  container log visibility)
- Write results summary to stderr (kubectl logs captures stderr)
- Print results before any error return for consistent visibility
- Assert bytes read > 0 on test success to catch silent connection deaths
- Return error when context cancelled before deadline completes
- Log traffic summary (actual bytes read/written) on all exit paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant