Silent outbound connection failure — addnode returns success but connection never established

Tested on main branch locally commit https://github.com/unicitynetwork/unicity-node/commit/285d9ada7426a574284fa4d37d6d0d7e81f6b5d3

**Summary**                                 
                                                                                                                                                                                              
  When a node initiates an outbound connection via the addnode RPC, the RPC returns {"success": true} immediately — before the TCP connection actually completes. If the async connect        
  subsequently fails (due to system load, unreachable peer, timing), the failure is completely silent: no log message at any level, no retry attempt, no error surfaced to the user. The node
  appears healthy but has 0 peers and never syncs.                                                                                                                                            
                                                            
  **Environment**                                                                                                                                                                                 
                                          
  - **OS:** Linux 6.17.0-19-generic                                                                                                                                                               
  - **unicity-node:** regtest and testnet                                                                                                                                                         
                                              
  **Steps to Reproduce**                                                                                                                                                                          
                                                            
  **_Manual reproduction_**                                                                                                                                                                         
                                          
  1. Start a node                                                                                                                                                                           
  ./build/bin/unicityd --regtest &                                                                                                                                                            
                                              
   2. Connect to a port where nothing is listening                                                                                                                                           
  ./build/bin/unicity-cli --regtest addnode 127.0.0.1:99999 add
                                                                                                                                                                                              
  3. Observe: RPC returns success!          
  Output: {"success": true, "message": "Connection initiated to 127.0.0.1:99999"}                                                                                                           
                                                                                                                                                                                              
  4. Wait a few seconds, then check peers                                                                                                                                                   
  sleep 3                                                                                                                                                                                     
  ./build/bin/unicity-cli --regtest getconnectioncount                                                                                                                                        
  Output: 0  (no peers connected)                         
                                                                                                                                                                                              
  5. Check debug.log for any error about the failed connection                                                                                                                              
  grep -i "fail\|error\|connect" ~/.unicity/regtest/debug.log
 Nothing — the failure is completely invisible  

**Automated reproduction**

<img width="1010" height="935" alt="Image" src="https://github.com/user-attachments/assets/047c19f1-6a72-4fc8-b76b-81ca9a0c74ec" />

  A comprehensive test is available 

[bug_silent_connection_failure.py](https://github.com/user-attachments/files/26416677/bug_silent_connection_failure.py)

  cd test/functional
copy test there 
  python3 bug_silent_connection_failure.py

  The test runs 3 phases:

  1. **addnode to unreachable port** — confirms RPC returns success, node has 0 peers, no error in debug.log
  2. **3-node fork resolution under CPU load** — reproduces the intermittent sync failure from feature_fork_resolution.py (Node1 at height 10 connects to Node2 at height 15 but connection
  silently fails, Node1 never syncs)
  3. **Post-IBD stall timeout verification** — confirms the stall timeout in ProcessTimers() is disabled post-IBD, leaving no recovery mechanism

  **Example output:**

  === Test 1: addnode reports success before connection completes ===

    BUG CONFIRMED: addnode returned success for unreachable address
    Result: {'success': true, 'message': 'Connection initiated to 127.0.0.1:54139'}

    Node0 connections after 3s: 0
    BUG CONFIRMED: Node has 0 peers but no error was reported

    BUG CONFIRMED: No failure message in debug.log at any level
    The connection failure is completely invisible.

  **Root Cause**

  Three issues combine to create this bug:

  **1. Async connect failure is silent**

  File: src/network/connection_manager.cpp:1192-1196

  if (!success || !connection_cb) {
      // Connection failed - no peer created, no ID allocated
      ++metrics_outbound_failures_;    // Only increments a counter
      return;                          // No log, no callback, nothing visible
  }

  When the async TCP connect callback receives success=false, the only action is incrementing a metric counter. No log message is emitted at any level (not even DEBUG). The failure is
  completely invisible to operators.

  **2. addnode RPC returns before connection completes**

  File: src/network/rpc_server.cpp:987-991

  LOG_INFO("RPC addnode: calling connect_to() with Manual|NoBan flags");
  auto result = network_manager_.connect_to(addr, ...);
  LOG_INFO("RPC addnode: connect_to() returned result");
  if (result != network::ConnectionResult::Success) {
      return util::JsonError("Failed to connect to node");
  }
  // Returns {"success": true} — but connection hasn't completed yet!

  connect_to() returns ConnectionResult::Success after initiating the async TCP connect, not after it completes. The RPC tells the user "success" when the connection may fail moments later
  with no notification.

  **3. No retry mechanism**

  addnode with command "add" initiates exactly one connection attempt. If the async connect fails, no retry is scheduled. The node permanently fails to connect to that peer.

  **4. Post-IBD stall timeout disabled**

  File: src/network/header_sync_manager.cpp

  // ProcessTimers() — stall detection:
  if (last_us > 0 && (now_us - last_us) > kHeadersSyncTimeoutUs
      && chainstate_manager_.IsInitialBlockDownload()) {  // Only during IBD!

  On regtest, nodes exit IBD after mining just 1 block (because nMinimumChainWork = 0). Post-IBD, there is no stall detection or recovery. If all peers disconnect or connections silently
  fail, the node has no mechanism to detect the problem or retry.

  **Proposed Fix**

  **Priority 1: Log async connection failures (one-line fix)**

  File: src/network/connection_manager.cpp:1192-1196

    if (!success || !connection_cb) {
        ++metrics_outbound_failures_;
  +     LOG_NET_INFO("Outbound connection to {}:{} failed asynchronously", address, port);
        return;
    }

  **Priority 2: Retry logic for addnode connections**

  When addnode "add" initiates a connection and the async callback fails, schedule a retry (e.g., 3 attempts with exponential backoff). Currently a single async failure permanently prevents
  the connection.

  **Priority 3: Consider post-IBD stall detection**

  Add a lighter post-IBD check that detects when no outbound peers are connected for an extended period and triggers reconnection attempts.

  **Impact**

  **- Operator confusion:** Node appears healthy (no errors) but has 0 peers and never syncs. No indication of what's wrong.
  **- Flaky test:** feature_fork_resolution.py intermittently fails because Node1's connection to Node2 silently fails under system load. The test passed 5/5 in isolation but failed during the full 34-test suite run. 

<img width="775" height="567" alt="Image" src="https://github.com/user-attachments/assets/49a50055-41c6-4707-adbf-f13efa09952a" />

  **- Testnet/production:** If a node's initial peer connections silently fail, it will mine blocks in isolation that are never relayed, wasting energy and diverging from the network.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent outbound connection failure — addnode returns success but connection never established #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Silent outbound connection failure — addnode returns success but connection never established #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions