fix(tailnet): retry after transport dial timeouts#22977
Merged
ethanndickson merged 1 commit intomainfrom Mar 12, 2026
Merged
Conversation
This fixes the desktop reconnect bug where the vpn daemon can stay
"connected" after a VPN or Wi-Fi transition but silently stop talking
to coderd until the user toggles Coder Connect off and on.
Problem
The tailnet controller owns a long-lived retry loop in Controller.Run.
That loop already had an important graceful-shutdown guard added in
November 2024 to prevent a phantom redial after cancellation:
if c.ctx.Err() != nil { return }
That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.
But the post-dial error path had since grown a broader terminal check:
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
That turns out to be too broad for desktop reconnects. A dial attempt can
fail with a wrapped context.DeadlineExceeded even while the controller's
own context is still live.
Why that happens
The workspace tailnet dialer uses the SDK HTTP client, which inherits
http.DefaultTransport. That transport uses a net.Dialer with a 30s
Timeout. Go implements that timeout by creating an internal deadline-
bound sub-context for the TCP connect.
So during a control-plane blackhole, the reconnect path can look like
this:
- the existing control-plane connection dies,
- Controller.Run re-enters the retry path,
- the next websocket/TCP dial hangs against unreachable coderd,
- net.Dialer times out the connect after ~30s,
- the returned error unwraps to context.DeadlineExceeded,
- Controller.Run treats that as terminal and returns,
- the retry goroutine exits forever even though c.ctx is still alive.
At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.
How this was found
We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with pfctl, and captured multiple
goroutine dumps while the daemon was wedged.
Those dumps showed:
- manageGracefulTimeout was still blocked on <-c.ctx.Done(), proving the
controller context was not canceled,
- the Controller.Run retry goroutine was missing from later dumps,
- control-plane consumers stayed idle longer over time,
- and once coderd became reachable again the daemon still did not dial
it.
That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through http.DefaultTransport and net.Dialer
explained why a transport timeout was being mistaken for controller
shutdown.
Fix
Keep the November 2024 graceful-shutdown guard exactly as-is, but narrow
the post-dial exit condition so it keys off the controller's own context
instead of the error unwrap chain.
Before:
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
After:
if c.ctx.Err() != nil {
return
}
Why this is the right behavior
This preserves the original graceful-shutdown invariant from the 2024
fix while restoring retryability for transient transport failures:
- if c.ctx is canceled before dialing, the pre-dial guard still prevents
a phantom redial,
- if c.ctx is canceled during a dial attempt, the error path still exits
cleanly because c.ctx.Err() is non-nil,
- if a live controller hits a wrapped transport timeout, the loop no
longer dies and instead retries as intended.
In other words, controller state remains the only authoritative signal
for loop shutdown.
Non-regression coverage
This also preserves the earlier flaky-test fix from ba21ba8:
- pipeDialer still returns errors instead of asserting from background
goroutines,
- TestController_Disconnects still waits for uut.Closed() before the
test exits.
On top of that, this change adds focused controller tests that assert:
- a wrapped net.OpError(context.DeadlineExceeded) under a live
controller causes another dial attempt instead of closing the
controller, and
- cancellation still shuts the controller down without an extra redial.
Validation
Ran focused validation in the touched package:
- go test ./tailnet -run 'TestController_(Disconnects|RetriesWrappedDeadlineExceeded|DoesNotRedialAfterCancel)$'
- go test ./tailnet
Manual validation also matched the intended behavior: after blocking
coderd long enough to force the controller into the retry path,
unblocking coderd allowed the daemon to recover on its own without
toggling Coder Connect.
deansheather
approved these changes
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generated with mux but reviewed by a human
This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned.
This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable.
The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening.
For that reason, I’m marking this as:
Closes coder/coder-desktop-macos#227
Problem
The tailnet controller owns a long-lived retry loop in
Controller.Run. That loop already had an important graceful-shutdown guard added inba21ba87to prevent a phantom redial after cancellation:That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races.
But the post-dial error path had since grown a broader terminal check:
That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped
context.DeadlineExceededeven while the controller's own context is still live.Why that happens
The workspace tailnet dialer uses the SDK HTTP client, which inherits
http.DefaultTransport. That transport uses anet.Dialerwith a 30sTimeout. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect.So during a control-plane blackhole, the reconnect path can look like this:
Controller.Runre-enters the retry pathnet.Dialertimes out the connect after ~30scontext.DeadlineExceededController.Runtreats that as terminal and returnsc.ctxis still aliveAt that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial.
How this was found
We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with
pfctl, and captured multiple goroutine dumps while the daemon was wedged.Those dumps showed:
manageGracefulTimeoutwas still blocked on<-c.ctx.Done(), proving the controller context was not canceledController.Runretry goroutine was missing from later dumpsThat narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through
http.DefaultTransportandnet.Dialerexplained why a transport timeout was being mistaken for controller shutdown.Fix
Keep the graceful-shutdown guard from
ba21ba87exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain.Before:
After:
Why this is the right behavior
This preserves the original graceful-shutdown invariant from
ba21ba87while restoring retryability for transient transport failures:c.ctxis canceled before dialing, the pre-dial guard still prevents a phantom redialc.ctxis canceled during a dial attempt, the error path still exits cleanly becausec.ctx.Err()is non-nilIn other words, controller state remains the only authoritative signal for loop shutdown.
Non-regression coverage
This also preserves the earlier flaky-test fix from
ba21ba87:pipeDialerstill returns errors instead of asserting from background goroutinesTestController_Disconnectsstill waits foruut.Closed()before the test exitsOn top of that, this change adds focused controller tests that assert:
net.OpError(context.DeadlineExceeded)under a live controller causes another dial attempt instead of closing the controllerValidation
After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.