Skip to content

fix(tailnet): retry after transport dial timeouts#22977

Merged
ethanndickson merged 1 commit intomainfrom
vpn-reconnect-5yna-fork-1
Mar 12, 2026
Merged

fix(tailnet): retry after transport dial timeouts#22977
ethanndickson merged 1 commit intomainfrom
vpn-reconnect-5yna-fork-1

Conversation

@ethanndickson
Copy link
Member

@ethanndickson ethanndickson commented Mar 12, 2026

Generated with mux but reviewed by a human

This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned.

This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable.

The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening.

For that reason, I’m marking this as:

Closes coder/coder-desktop-macos#227

Problem

The tailnet controller owns a long-lived retry loop in Controller.Run. That loop already had an important graceful-shutdown guard added in ba21ba87 to prevent a phantom redial after cancellation:

if c.ctx.Err() != nil {
    return
}

That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races.

But the post-dial error path had since grown a broader terminal check:

if xerrors.Is(err, context.Canceled) ||
   xerrors.Is(err, context.DeadlineExceeded) {
    return
}

That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped context.DeadlineExceeded even while the controller's own context is still live.

Why that happens

The workspace tailnet dialer uses the SDK HTTP client, which inherits http.DefaultTransport. That transport uses a net.Dialer with a 30s Timeout. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect.

So during a control-plane blackhole, the reconnect path can look like this:

  • the existing control-plane connection dies
  • Controller.Run re-enters the retry path
  • the next websocket/TCP dial hangs against unreachable coderd
  • net.Dialer times out the connect after ~30s
  • the returned error unwraps to context.DeadlineExceeded
  • Controller.Run treats that as terminal and returns
  • the retry goroutine exits forever even though c.ctx is still alive

At that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial.

How this was found

We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with pfctl, and captured multiple goroutine dumps while the daemon was wedged.

Those dumps showed:

  • manageGracefulTimeout was still blocked on <-c.ctx.Done(), proving the controller context was not canceled
  • the Controller.Run retry goroutine was missing from later dumps
  • control-plane consumers stayed idle longer over time
  • once coderd became reachable again the daemon still did not dial it

That narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through http.DefaultTransport and net.Dialer explained why a transport timeout was being mistaken for controller shutdown.

Fix

Keep the graceful-shutdown guard from ba21ba87 exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain.

Before:

if xerrors.Is(err, context.Canceled) ||
   xerrors.Is(err, context.DeadlineExceeded) {
    return
}

After:

if c.ctx.Err() != nil {
    return
}

Why this is the right behavior

This preserves the original graceful-shutdown invariant from ba21ba87 while restoring retryability for transient transport failures:

  • if c.ctx is canceled before dialing, the pre-dial guard still prevents a phantom redial
  • if c.ctx is canceled during a dial attempt, the error path still exits cleanly because c.ctx.Err() is non-nil
  • if a live controller hits a wrapped transport timeout, the loop no longer dies and instead retries as intended

In other words, controller state remains the only authoritative signal for loop shutdown.

Non-regression coverage

This also preserves the earlier flaky-test fix from ba21ba87:

  • pipeDialer still returns errors instead of asserting from background goroutines
  • TestController_Disconnects still waits for uut.Closed() before the test exits

On top of that, this change adds focused controller tests that assert:

  • a wrapped net.OpError(context.DeadlineExceeded) under a live controller causes another dial attempt instead of closing the controller
  • cancellation still shuts the controller down without an extra redial

Validation

After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.

This fixes the desktop reconnect bug where the vpn daemon can stay
"connected" after a VPN or Wi-Fi transition but silently stop talking
to coderd until the user toggles Coder Connect off and on.

Problem

The tailnet controller owns a long-lived retry loop in Controller.Run.
That loop already had an important graceful-shutdown guard added in
November 2024 to prevent a phantom redial after cancellation:

    if c.ctx.Err() != nil { return }

That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.

But the post-dial error path had since grown a broader terminal check:

    if xerrors.Is(err, context.Canceled) ||
       xerrors.Is(err, context.DeadlineExceeded) {
        return
    }

That turns out to be too broad for desktop reconnects. A dial attempt can
fail with a wrapped context.DeadlineExceeded even while the controller's
own context is still live.

Why that happens

The workspace tailnet dialer uses the SDK HTTP client, which inherits
http.DefaultTransport. That transport uses a net.Dialer with a 30s
Timeout. Go implements that timeout by creating an internal deadline-
bound sub-context for the TCP connect.

So during a control-plane blackhole, the reconnect path can look like
this:

- the existing control-plane connection dies,
- Controller.Run re-enters the retry path,
- the next websocket/TCP dial hangs against unreachable coderd,
- net.Dialer times out the connect after ~30s,
- the returned error unwraps to context.DeadlineExceeded,
- Controller.Run treats that as terminal and returns,
- the retry goroutine exits forever even though c.ctx is still alive.

At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.

How this was found

We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with pfctl, and captured multiple
goroutine dumps while the daemon was wedged.

Those dumps showed:

- manageGracefulTimeout was still blocked on <-c.ctx.Done(), proving the
  controller context was not canceled,
- the Controller.Run retry goroutine was missing from later dumps,
- control-plane consumers stayed idle longer over time,
- and once coderd became reachable again the daemon still did not dial
  it.

That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through http.DefaultTransport and net.Dialer
explained why a transport timeout was being mistaken for controller
shutdown.

Fix

Keep the November 2024 graceful-shutdown guard exactly as-is, but narrow
the post-dial exit condition so it keys off the controller's own context
instead of the error unwrap chain.

Before:

    if xerrors.Is(err, context.Canceled) ||
       xerrors.Is(err, context.DeadlineExceeded) {
        return
    }

After:

    if c.ctx.Err() != nil {
        return
    }

Why this is the right behavior

This preserves the original graceful-shutdown invariant from the 2024
fix while restoring retryability for transient transport failures:

- if c.ctx is canceled before dialing, the pre-dial guard still prevents
  a phantom redial,
- if c.ctx is canceled during a dial attempt, the error path still exits
  cleanly because c.ctx.Err() is non-nil,
- if a live controller hits a wrapped transport timeout, the loop no
  longer dies and instead retries as intended.

In other words, controller state remains the only authoritative signal
for loop shutdown.

Non-regression coverage

This also preserves the earlier flaky-test fix from ba21ba8:

- pipeDialer still returns errors instead of asserting from background
  goroutines,
- TestController_Disconnects still waits for uut.Closed() before the
  test exits.

On top of that, this change adds focused controller tests that assert:

- a wrapped net.OpError(context.DeadlineExceeded) under a live
  controller causes another dial attempt instead of closing the
  controller, and
- cancellation still shuts the controller down without an extra redial.

Validation

Ran focused validation in the touched package:

- go test ./tailnet -run 'TestController_(Disconnects|RetriesWrappedDeadlineExceeded|DoesNotRedialAfterCancel)$'
- go test ./tailnet

Manual validation also matched the intended behavior: after blocking
coderd long enough to force the controller into the retry path,
unblocking coderd allowed the daemon to recover on its own without
toggling Coder Connect.
@ethanndickson ethanndickson merged commit 5130404 into main Mar 12, 2026
60 checks passed
@ethanndickson ethanndickson deleted the vpn-reconnect-5yna-fork-1 branch March 12, 2026 07:05
@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workspaces unreachable after switching networks

2 participants