-
Notifications
You must be signed in to change notification settings - Fork 0
Silent outbound connection failure — addnode returns success but connection never established #6
Description
Tested on main branch locally commit 285d9ad
Summary
When a node initiates an outbound connection via the addnode RPC, the RPC returns {"success": true} immediately — before the TCP connection actually completes. If the async connect
subsequently fails (due to system load, unreachable peer, timing), the failure is completely silent: no log message at any level, no retry attempt, no error surfaced to the user. The node
appears healthy but has 0 peers and never syncs.
Environment
- OS: Linux 6.17.0-19-generic
- unicity-node: regtest and testnet
Steps to Reproduce
Manual reproduction
-
Start a node
./build/bin/unicityd --regtest & -
Connect to a port where nothing is listening
./build/bin/unicity-cli --regtest addnode 127.0.0.1:99999 add -
Observe: RPC returns success!
Output: {"success": true, "message": "Connection initiated to 127.0.0.1:99999"} -
Wait a few seconds, then check peers
sleep 3
./build/bin/unicity-cli --regtest getconnectioncount
Output: 0 (no peers connected) -
Check debug.log for any error about the failed connection
grep -i "fail|error|connect" ~/.unicity/regtest/debug.log
Nothing — the failure is completely invisible
Automated reproduction
A comprehensive test is available
bug_silent_connection_failure.py
cd test/functional
copy test there
python3 bug_silent_connection_failure.py
The test runs 3 phases:
- addnode to unreachable port — confirms RPC returns success, node has 0 peers, no error in debug.log
- 3-node fork resolution under CPU load — reproduces the intermittent sync failure from feature_fork_resolution.py (Node1 at height 10 connects to Node2 at height 15 but connection
silently fails, Node1 never syncs) - Post-IBD stall timeout verification — confirms the stall timeout in ProcessTimers() is disabled post-IBD, leaving no recovery mechanism
Example output:
=== Test 1: addnode reports success before connection completes ===
BUG CONFIRMED: addnode returned success for unreachable address
Result: {'success': true, 'message': 'Connection initiated to 127.0.0.1:54139'}
Node0 connections after 3s: 0
BUG CONFIRMED: Node has 0 peers but no error was reported
BUG CONFIRMED: No failure message in debug.log at any level
The connection failure is completely invisible.
Root Cause
Three issues combine to create this bug:
1. Async connect failure is silent
File: src/network/connection_manager.cpp:1192-1196
if (!success || !connection_cb) {
// Connection failed - no peer created, no ID allocated
++metrics_outbound_failures_; // Only increments a counter
return; // No log, no callback, nothing visible
}
When the async TCP connect callback receives success=false, the only action is incrementing a metric counter. No log message is emitted at any level (not even DEBUG). The failure is
completely invisible to operators.
2. addnode RPC returns before connection completes
File: src/network/rpc_server.cpp:987-991
LOG_INFO("RPC addnode: calling connect_to() with Manual|NoBan flags");
auto result = network_manager_.connect_to(addr, ...);
LOG_INFO("RPC addnode: connect_to() returned result");
if (result != network::ConnectionResult::Success) {
return util::JsonError("Failed to connect to node");
}
// Returns {"success": true} — but connection hasn't completed yet!
connect_to() returns ConnectionResult::Success after initiating the async TCP connect, not after it completes. The RPC tells the user "success" when the connection may fail moments later
with no notification.
3. No retry mechanism
addnode with command "add" initiates exactly one connection attempt. If the async connect fails, no retry is scheduled. The node permanently fails to connect to that peer.
4. Post-IBD stall timeout disabled
File: src/network/header_sync_manager.cpp
// ProcessTimers() — stall detection:
if (last_us > 0 && (now_us - last_us) > kHeadersSyncTimeoutUs
&& chainstate_manager_.IsInitialBlockDownload()) { // Only during IBD!
On regtest, nodes exit IBD after mining just 1 block (because nMinimumChainWork = 0). Post-IBD, there is no stall detection or recovery. If all peers disconnect or connections silently
fail, the node has no mechanism to detect the problem or retry.
Proposed Fix
Priority 1: Log async connection failures (one-line fix)
File: src/network/connection_manager.cpp:1192-1196
if (!success || !connection_cb) {
++metrics_outbound_failures_;
-
}
LOG_NET_INFO("Outbound connection to {}:{} failed asynchronously", address, port); return;
Priority 2: Retry logic for addnode connections
When addnode "add" initiates a connection and the async callback fails, schedule a retry (e.g., 3 attempts with exponential backoff). Currently a single async failure permanently prevents
the connection.
Priority 3: Consider post-IBD stall detection
Add a lighter post-IBD check that detects when no outbound peers are connected for an extended period and triggers reconnection attempts.
Impact
- Operator confusion: Node appears healthy (no errors) but has 0 peers and never syncs. No indication of what's wrong.
- Flaky test: feature_fork_resolution.py intermittently fails because Node1's connection to Node2 silently fails under system load. The test passed 5/5 in isolation but failed during the full 34-test suite run.
- Testnet/production: If a node's initial peer connections silently fail, it will mine blocks in isolation that are never relayed, wasting energy and diverging from the network.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status