Skip to content

Silent outbound connection failure — addnode returns success but connection never established #6

@b3y0urs3lf

Description

@b3y0urs3lf

Tested on main branch locally commit 285d9ad

Summary

When a node initiates an outbound connection via the addnode RPC, the RPC returns {"success": true} immediately — before the TCP connection actually completes. If the async connect
subsequently fails (due to system load, unreachable peer, timing), the failure is completely silent: no log message at any level, no retry attempt, no error surfaced to the user. The node
appears healthy but has 0 peers and never syncs.

Environment

  • OS: Linux 6.17.0-19-generic
  • unicity-node: regtest and testnet

Steps to Reproduce

Manual reproduction

  1. Start a node
    ./build/bin/unicityd --regtest &

  2. Connect to a port where nothing is listening
    ./build/bin/unicity-cli --regtest addnode 127.0.0.1:99999 add

  3. Observe: RPC returns success!
    Output: {"success": true, "message": "Connection initiated to 127.0.0.1:99999"}

  4. Wait a few seconds, then check peers
    sleep 3
    ./build/bin/unicity-cli --regtest getconnectioncount
    Output: 0 (no peers connected)

  5. Check debug.log for any error about the failed connection
    grep -i "fail|error|connect" ~/.unicity/regtest/debug.log
    Nothing — the failure is completely invisible

Automated reproduction

Image

A comprehensive test is available

bug_silent_connection_failure.py

cd test/functional
copy test there
python3 bug_silent_connection_failure.py

The test runs 3 phases:

  1. addnode to unreachable port — confirms RPC returns success, node has 0 peers, no error in debug.log
  2. 3-node fork resolution under CPU load — reproduces the intermittent sync failure from feature_fork_resolution.py (Node1 at height 10 connects to Node2 at height 15 but connection
    silently fails, Node1 never syncs)
  3. Post-IBD stall timeout verification — confirms the stall timeout in ProcessTimers() is disabled post-IBD, leaving no recovery mechanism

Example output:

=== Test 1: addnode reports success before connection completes ===

BUG CONFIRMED: addnode returned success for unreachable address
Result: {'success': true, 'message': 'Connection initiated to 127.0.0.1:54139'}

Node0 connections after 3s: 0
BUG CONFIRMED: Node has 0 peers but no error was reported

BUG CONFIRMED: No failure message in debug.log at any level
The connection failure is completely invisible.

Root Cause

Three issues combine to create this bug:

1. Async connect failure is silent

File: src/network/connection_manager.cpp:1192-1196

if (!success || !connection_cb) {
// Connection failed - no peer created, no ID allocated
++metrics_outbound_failures_; // Only increments a counter
return; // No log, no callback, nothing visible
}

When the async TCP connect callback receives success=false, the only action is incrementing a metric counter. No log message is emitted at any level (not even DEBUG). The failure is
completely invisible to operators.

2. addnode RPC returns before connection completes

File: src/network/rpc_server.cpp:987-991

LOG_INFO("RPC addnode: calling connect_to() with Manual|NoBan flags");
auto result = network_manager_.connect_to(addr, ...);
LOG_INFO("RPC addnode: connect_to() returned result");
if (result != network::ConnectionResult::Success) {
return util::JsonError("Failed to connect to node");
}
// Returns {"success": true} — but connection hasn't completed yet!

connect_to() returns ConnectionResult::Success after initiating the async TCP connect, not after it completes. The RPC tells the user "success" when the connection may fail moments later
with no notification.

3. No retry mechanism

addnode with command "add" initiates exactly one connection attempt. If the async connect fails, no retry is scheduled. The node permanently fails to connect to that peer.

4. Post-IBD stall timeout disabled

File: src/network/header_sync_manager.cpp

// ProcessTimers() — stall detection:
if (last_us > 0 && (now_us - last_us) > kHeadersSyncTimeoutUs
&& chainstate_manager_.IsInitialBlockDownload()) { // Only during IBD!

On regtest, nodes exit IBD after mining just 1 block (because nMinimumChainWork = 0). Post-IBD, there is no stall detection or recovery. If all peers disconnect or connections silently
fail, the node has no mechanism to detect the problem or retry.

Proposed Fix

Priority 1: Log async connection failures (one-line fix)

File: src/network/connection_manager.cpp:1192-1196

if (!success || !connection_cb) {
    ++metrics_outbound_failures_;
  • LOG_NET_INFO("Outbound connection to {}:{} failed asynchronously", address, port);
    return;
    
    }

Priority 2: Retry logic for addnode connections

When addnode "add" initiates a connection and the async callback fails, schedule a retry (e.g., 3 attempts with exponential backoff). Currently a single async failure permanently prevents
the connection.

Priority 3: Consider post-IBD stall detection

Add a lighter post-IBD check that detects when no outbound peers are connected for an extended period and triggers reconnection attempts.

Impact

- Operator confusion: Node appears healthy (no errors) but has 0 peers and never syncs. No indication of what's wrong.
- Flaky test: feature_fork_resolution.py intermittently fails because Node1's connection to Node2 silently fails under system load. The test passed 5/5 in isolation but failed during the full 34-test suite run.

Image

- Testnet/production: If a node's initial peer connections silently fail, it will mine blocks in isolation that are never relayed, wasting energy and diverging from the network.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions