IPC Socket drops connections under concurrent access (listen backlog overflow)

Tested on main branch locally commit 285d9ada7426a574284fa4d37d6d0d7e81f6b5d3 

Multiple clients connecting to the unicityd Unix domain socket (node.sock) experience connection reset by peer and broken pipe errors under concurrent access. The kernel rejects
  connections when the listen() backlog overflows because the single-threaded accept() loop can't drain the queue fast enough, especially when the miner is consuming CPU.

  Measured 2.5% failure rate (5/200 connections) against a live testnet node with mining active.

  This bug is the root trigger for a cascading failure in the Finality Gadget Partition (FGP) cluster, where a failed IPC call causes the T1 timer to not reschedule, permanently stalling the
   cluster.

  Environment

  - OS: Linux 6.17.0-19-generic
  - unicity-node: testnet, mining active (~36 H/s, ~90% CPU)
  - Clients: 3 FGP nodes querying same node.sock concurrently (Go net/http over Unix socket, no connection pooling)

  **Steps to Reproduce:**

  **_Manual reproduction:_**
 1. Start node and mining
  ./build/bin/unicityd --testnet &
  ./build/bin/unicity-cli startmining
2. Burst 20 simultaneous connections to the live socket (repeat 10 rounds)
```
python3 -c "
  import socket, json, threading, time

  SOCKET_PATH = '$HOME/.unicity/node.sock'

  def call(cid, results):
      req = json.dumps({'jsonrpc':'2.0','method':'getbestblockhash','params':[],'id':1})
      http = f'POST / HTTP/1.1\r\nHost: localhost\r\nContent-Type: application/json\r\nContent-Length: {len(req)}\r\n\r\n{req}'
      try:
          s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
          s.settimeout(5)
          s.connect(SOCKET_PATH)
          s.sendall(http.encode())
          s.recv(4096)
          s.close()
          results.append('OK')
      except Exception as e:
          results.append(str(e))

  for r in range(10):
      barrier = threading.Barrier(20)
      results = []
      def worker(cid):
          barrier.wait()
          call(cid, results)
      threads = [threading.Thread(target=worker, args=(i,)) for i in range(20)]
      for t in threads: t.start()
      for t in threads: t.join(10)
      ok = sum(1 for x in results if x == 'OK')
      fail = len(results) - ok
      print(f'Round {r+1}: {ok}/20 OK' + (f', {fail} FAILED' if fail else ''))
      time.sleep(0.3)
```

_**Automated reproduction:**_
A comprehensive test is available:

cd test/functional
copy test there and run it

[bug_ipc_concurrent_connections.py](https://github.com/user-attachments/files/26416573/bug_ipc_concurrent_connections.py)

python3 bug_ipc_concurrent_connections.py

  The test runs 5 phases:
  1. Single client baseline (should always pass)
  2. 3 concurrent FGP nodes x 4 calls (real deployment scenario)
  3. 6 concurrent clients x 4 calls (stress test)
  4. 10 simultaneous burst connections
  5. Live testnet node — 20 burst connections x 10 rounds against ~/.unicity/node.sock (requires running node with mining)

  Test 5 is where the bug reproduces. Example output:

  Round 1: 18/20 OK, 2 FAILED
    call 2: broken pipe
    call 14: connection reset by peer
  Round 8: 19/20 OK, 1 FAILED
    call 13: broken pipe
  Round 9: 18/20 OK, 2 FAILED
    call 0: broken pipe
    call 15: connection reset by peer

  TOTAL: 195/200 OK (5 failed = 2.5%)
  BUG CONFIRMED

<img width="749" height="858" alt="Image" src="https://github.com/user-attachments/assets/6518efc0-b402-4f82-a96d-ced2e797e05f" />

  Real-world reproduction (FGP cluster)

  The FGP team observed failures within 1-4 rounds consistently:

  time=18:24:40 level=INFO msg="new PoW block certified" fgp_round=1 pow_height=827   ✓
  time=18:25:40 level=WARN msg="Leader failed to apply block state transition"          ✗
    err="...read unix @->/home/dmytro/.unicity/node.sock: read: connection reset by peer"

  Root Cause

  File: src/network/rpc_server.cpp:209

  listen(server_fd_, 20);  // Backlog of 20 — insufficient under mining CPU load

  The RPC server architecture:

  1. Single-threaded accept loop (ServerThread() at line ~253) blocks on accept()
  2. Thread-per-request — spawns a detached std::thread for each connection (line ~279)
  3. Thread creation overhead delays the return to accept()
  4. Mining thread consumes ~90% CPU (RandomX), starving the accept loop via scheduler
  5. When the backlog fills, the kernel RSTs new connections before accept() picks them up

  The application-level DoS check (MAX_CONCURRENT_REQUESTS = 10) at line ~267 is never reached — connections are rejected at the kernel level. The client sees connection reset by peer
  instead of the graceful "Server busy" JSON error.

  Evidence: The error is a kernel-level RST (ECONNRESET/EPIPE), not an application-level rejection. This only happens under concurrent access with mining active.

  Proposed Fix

  Priority 1: Increase listen backlog (one-line fix)

  - if (listen(server_fd_, 20) < 0) {
  + if (listen(server_fd_, 128) < 0) {

  128 (or SOMAXCONN) is standard for servers expecting concurrent connections and handles bursts from 10+ FGP validators.

  Priority 2: Replace detached threads with thread pool

  // Current (rpc_server.cpp:~279):
  std::thread([this, client_fd]() {
      HandleClient(client_fd);
      close(client_fd);
      active_requests_--;
  }).detach();

  // Proposed: pre-allocated thread pool of 4-8 workers
  // Eliminates thread creation latency, returns to accept() faster

  Priority 3: Support HTTP keep-alive

  Currently each RPC call opens a new connection. With keep-alive, FGP clients could reuse a single connection for all 6 calls per round — reducing connection count from ~18 to ~3 per round.

  Impact

  - FGP clusters: Connection reset causes leader proposal failure. Combined with a separate FGP bug (T1 timer not rescheduled after failure), this permanently stalls the entire FGP cluster (bug registered https://github.com/unicitynetwork/finality-gadget/issues/6#issuecomment-4171306994)  requiring manual restart.
  - Production: With 2.4-hour block intervals, a wasted round = 2.4 hours of delay.
  - Scaling: With 10+ validators, burst load increases proportionally, making the issue worse.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPC Socket drops connections under concurrent access (listen backlog overflow) #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IPC Socket drops connections under concurrent access (listen backlog overflow) #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions