improve performance supporting 30k+ workspace connections by sreya · Pull Request #23398 · coder/coder

sreya · 2026-03-20T22:21:29Z

not worth reviewing by anyone yet

… mapper

Add PGCoordOptions struct to configure the number of workers for querier, binder, tunneler, and handshaker components. Zero values fall back to the existing hardcoded defaults. Wire four new hidden deployment options: - CODER_TAILNET_QUERIER_WORKERS - CODER_TAILNET_BINDER_WORKERS - CODER_TAILNET_TUNNELER_WORKERS - CODER_TAILNET_HANDSHAKER_WORKERS

…peer updates Add a reverse tunnel index (tunnelsByPeer) to the pgcoord querier that tracks which remote peers have tunnels with local mappers. This allows listenPeer to skip peer updates for peers with no local interest, avoiding unnecessary DB queries and work queue items. The index is maintained in: - listenTunnel: when a tunnel update arrives, register each peer as a tunnel partner for the other if the other has a local mapper. - peerUpdate: when querying tunnel peers from DB, populate the index for peers that have local mappers. - cleanupConn: remove the mapper from all index entries when cleaned up. - resyncPeerMappings: clear and let the index repopulate organically.

Add GetTailnetTunnelPeerIDsBatch and GetTailnetTunnelPeerBindingsBatch SQL queries that accept UUID arrays, reducing DB round trips when many peers need tunnel lookups simultaneously. Add workQ.acquireBatch() to opportunistically grab up to 49 additional same-type pending keys after the initial blocking acquire(). The worker loop now batches peerUpdate and mappingQuery work items, falling back to single-key queries when only one item is pending. At 10k agents this reduces per-update DB round trips from O(n) to O(n/50), significantly lowering coordinator DB load.

Introduces a new agentconnectionbatcher package that batches agent connection heartbeat updates (lastConnectedAt) into a single BatchUpdateWorkspaceAgentConnections query instead of one UpdateWorkspaceAgentConnectionByID per agent per tick. Initial connect and disconnect writes remain as direct DB calls to ensure immediate state visibility. Only periodic heartbeat updates are batched, since they are high-frequency and losing one is acceptable. The batcher follows the same pattern as the existing metadata batcher: buffered channel, dedup by agent ID, periodic flush with capacity overflow, and a final flush on shutdown.

…ry wrappers

Add BatchUpsertConnectionLogs SQL query using the unnest pattern to batch multiple connection log upserts into a single query, reducing DB lock contention at scale (30k+ connections). Create connectionlogbatcher package following the same pattern as agentconnectionbatcher. The batcher buffers UpsertConnectionLogParams and flushes them every second (or when batch size reaches 500). Wire the batcher into the enterprise ConnectionLogger as a new batchDBBackend, replacing direct per-event DB writes. Changes: - coderd/database/queries/connectionlogs.sql: new BatchUpsertConnectionLogs query - coderd/database/dbauthz/dbauthz.go: implement auth wrapper (not panic stub) - coderd/connectionlogbatcher/batcher.go: new batcher package - enterprise/coderd/connectionlog/connectionlog.go: add NewBatchDBBackend - enterprise/coderd/coderd.go: create batcher and use batch backend

- Add workspace identity (name, ID, agent name) to runner logs - Add trace context injection to WebSocket and workspace app connections - Wire up --trace-propagate flag to enable trace propagation to coderd - Copy Trace flag in DupClientCopyingHeaders for runner clients - Add PrintSummary to harness.Results (limits output to N failures for container log visibility) - Write results summary to stderr (kubectl logs captures stderr) - Print results before any error return for consistent visibility - Assert bytes read > 0 on test success to catch silent connection deaths - Return error when context cancelled before deadline completes - Log traffic summary (actual bytes read/written) on all exit paths

sreya added 11 commits March 20, 2026 21:33

perf(enterprise/tailnet): skip DB query in mappingQuery when no local…

c849d9b

… mapper

perf(agent): add jitter to reconnection retry to prevent thundering herd

7899bd6

perf(enterprise/tailnet): use O(1) set for workQ dedup

48c17af

fix(coderd/database/dbauthz): implement batch tailnet tunnel peer que…

a3e6685

…ry wrappers

github-actions bot assigned sreya Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance supporting 30k+ workspace connections#23398

improve performance supporting 30k+ workspace connections#23398
sreya wants to merge 11 commits intomainfrom
agent-connections

sreya commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sreya commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant