improve performance supporting 30k+ workspace connections#23398
Draft
improve performance supporting 30k+ workspace connections#23398
Conversation
Add PGCoordOptions struct to configure the number of workers for querier, binder, tunneler, and handshaker components. Zero values fall back to the existing hardcoded defaults. Wire four new hidden deployment options: - CODER_TAILNET_QUERIER_WORKERS - CODER_TAILNET_BINDER_WORKERS - CODER_TAILNET_TUNNELER_WORKERS - CODER_TAILNET_HANDSHAKER_WORKERS
…peer updates Add a reverse tunnel index (tunnelsByPeer) to the pgcoord querier that tracks which remote peers have tunnels with local mappers. This allows listenPeer to skip peer updates for peers with no local interest, avoiding unnecessary DB queries and work queue items. The index is maintained in: - listenTunnel: when a tunnel update arrives, register each peer as a tunnel partner for the other if the other has a local mapper. - peerUpdate: when querying tunnel peers from DB, populate the index for peers that have local mappers. - cleanupConn: remove the mapper from all index entries when cleaned up. - resyncPeerMappings: clear and let the index repopulate organically.
Add GetTailnetTunnelPeerIDsBatch and GetTailnetTunnelPeerBindingsBatch SQL queries that accept UUID arrays, reducing DB round trips when many peers need tunnel lookups simultaneously. Add workQ.acquireBatch() to opportunistically grab up to 49 additional same-type pending keys after the initial blocking acquire(). The worker loop now batches peerUpdate and mappingQuery work items, falling back to single-key queries when only one item is pending. At 10k agents this reduces per-update DB round trips from O(n) to O(n/50), significantly lowering coordinator DB load.
Introduces a new agentconnectionbatcher package that batches agent connection heartbeat updates (lastConnectedAt) into a single BatchUpdateWorkspaceAgentConnections query instead of one UpdateWorkspaceAgentConnectionByID per agent per tick. Initial connect and disconnect writes remain as direct DB calls to ensure immediate state visibility. Only periodic heartbeat updates are batched, since they are high-frequency and losing one is acceptable. The batcher follows the same pattern as the existing metadata batcher: buffered channel, dedup by agent ID, periodic flush with capacity overflow, and a final flush on shutdown.
Add BatchUpsertConnectionLogs SQL query using the unnest pattern to batch multiple connection log upserts into a single query, reducing DB lock contention at scale (30k+ connections). Create connectionlogbatcher package following the same pattern as agentconnectionbatcher. The batcher buffers UpsertConnectionLogParams and flushes them every second (or when batch size reaches 500). Wire the batcher into the enterprise ConnectionLogger as a new batchDBBackend, replacing direct per-event DB writes. Changes: - coderd/database/queries/connectionlogs.sql: new BatchUpsertConnectionLogs query - coderd/database/dbauthz/dbauthz.go: implement auth wrapper (not panic stub) - coderd/connectionlogbatcher/batcher.go: new batcher package - enterprise/coderd/connectionlog/connectionlog.go: add NewBatchDBBackend - enterprise/coderd/coderd.go: create batcher and use batch backend
- Add workspace identity (name, ID, agent name) to runner logs - Add trace context injection to WebSocket and workspace app connections - Wire up --trace-propagate flag to enable trace propagation to coderd - Copy Trace flag in DupClientCopyingHeaders for runner clients - Add PrintSummary to harness.Results (limits output to N failures for container log visibility) - Write results summary to stderr (kubectl logs captures stderr) - Print results before any error return for consistent visibility - Assert bytes read > 0 on test success to catch silent connection deaths - Return error when context cancelled before deadline completes - Log traffic summary (actual bytes read/written) on all exit paths
- Add workspace identity (name, ID, agent name) to runner logs - Add trace context injection to WebSocket and workspace app connections - Wire up --trace-propagate flag to enable trace propagation to coderd - Copy Trace flag in DupClientCopyingHeaders for runner clients - Add PrintSummary to harness.Results (limits output to N failures for container log visibility) - Write results summary to stderr (kubectl logs captures stderr) - Print results before any error return for consistent visibility - Assert bytes read > 0 on test success to catch silent connection deaths - Return error when context cancelled before deadline completes - Log traffic summary (actual bytes read/written) on all exit paths
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
not worth reviewing by anyone yet