-
Notifications
You must be signed in to change notification settings - Fork 0
IPC Socket drops connections under concurrent access (listen backlog overflow) #5
Description
Tested on main branch locally commit 285d9ad
Multiple clients connecting to the unicityd Unix domain socket (node.sock) experience connection reset by peer and broken pipe errors under concurrent access. The kernel rejects
connections when the listen() backlog overflows because the single-threaded accept() loop can't drain the queue fast enough, especially when the miner is consuming CPU.
Measured 2.5% failure rate (5/200 connections) against a live testnet node with mining active.
This bug is the root trigger for a cascading failure in the Finality Gadget Partition (FGP) cluster, where a failed IPC call causes the T1 timer to not reschedule, permanently stalling the
cluster.
Environment
- OS: Linux 6.17.0-19-generic
- unicity-node: testnet, mining active (~36 H/s, ~90% CPU)
- Clients: 3 FGP nodes querying same node.sock concurrently (Go net/http over Unix socket, no connection pooling)
Steps to Reproduce:
Manual reproduction:
- Start node and mining
./build/bin/unicityd --testnet &
./build/bin/unicity-cli startmining - Burst 20 simultaneous connections to the live socket (repeat 10 rounds)
python3 -c "
import socket, json, threading, time
SOCKET_PATH = '$HOME/.unicity/node.sock'
def call(cid, results):
req = json.dumps({'jsonrpc':'2.0','method':'getbestblockhash','params':[],'id':1})
http = f'POST / HTTP/1.1\r\nHost: localhost\r\nContent-Type: application/json\r\nContent-Length: {len(req)}\r\n\r\n{req}'
try:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.settimeout(5)
s.connect(SOCKET_PATH)
s.sendall(http.encode())
s.recv(4096)
s.close()
results.append('OK')
except Exception as e:
results.append(str(e))
for r in range(10):
barrier = threading.Barrier(20)
results = []
def worker(cid):
barrier.wait()
call(cid, results)
threads = [threading.Thread(target=worker, args=(i,)) for i in range(20)]
for t in threads: t.start()
for t in threads: t.join(10)
ok = sum(1 for x in results if x == 'OK')
fail = len(results) - ok
print(f'Round {r+1}: {ok}/20 OK' + (f', {fail} FAILED' if fail else ''))
time.sleep(0.3)
Automated reproduction:
A comprehensive test is available:
cd test/functional
copy test there and run it
bug_ipc_concurrent_connections.py
python3 bug_ipc_concurrent_connections.py
The test runs 5 phases:
- Single client baseline (should always pass)
- 3 concurrent FGP nodes x 4 calls (real deployment scenario)
- 6 concurrent clients x 4 calls (stress test)
- 10 simultaneous burst connections
- Live testnet node — 20 burst connections x 10 rounds against ~/.unicity/node.sock (requires running node with mining)
Test 5 is where the bug reproduces. Example output:
Round 1: 18/20 OK, 2 FAILED
call 2: broken pipe
call 14: connection reset by peer
Round 8: 19/20 OK, 1 FAILED
call 13: broken pipe
Round 9: 18/20 OK, 2 FAILED
call 0: broken pipe
call 15: connection reset by peer
TOTAL: 195/200 OK (5 failed = 2.5%)
BUG CONFIRMED
Real-world reproduction (FGP cluster)
The FGP team observed failures within 1-4 rounds consistently:
time=18:24:40 level=INFO msg="new PoW block certified" fgp_round=1 pow_height=827 ✓
time=18:25:40 level=WARN msg="Leader failed to apply block state transition" ✗
err="...read unix @->/home/dmytro/.unicity/node.sock: read: connection reset by peer"
Root Cause
File: src/network/rpc_server.cpp:209
listen(server_fd_, 20); // Backlog of 20 — insufficient under mining CPU load
The RPC server architecture:
- Single-threaded accept loop (ServerThread() at line ~253) blocks on accept()
- Thread-per-request — spawns a detached std::thread for each connection (line ~279)
- Thread creation overhead delays the return to accept()
- Mining thread consumes ~90% CPU (RandomX), starving the accept loop via scheduler
- When the backlog fills, the kernel RSTs new connections before accept() picks them up
The application-level DoS check (MAX_CONCURRENT_REQUESTS = 10) at line ~267 is never reached — connections are rejected at the kernel level. The client sees connection reset by peer
instead of the graceful "Server busy" JSON error.
Evidence: The error is a kernel-level RST (ECONNRESET/EPIPE), not an application-level rejection. This only happens under concurrent access with mining active.
Proposed Fix
Priority 1: Increase listen backlog (one-line fix)
- if (listen(server_fd_, 20) < 0) {
- if (listen(server_fd_, 128) < 0) {
128 (or SOMAXCONN) is standard for servers expecting concurrent connections and handles bursts from 10+ FGP validators.
Priority 2: Replace detached threads with thread pool
// Current (rpc_server.cpp:~279):
std::thread(this, client_fd {
HandleClient(client_fd);
close(client_fd);
active_requests_--;
}).detach();
// Proposed: pre-allocated thread pool of 4-8 workers
// Eliminates thread creation latency, returns to accept() faster
Priority 3: Support HTTP keep-alive
Currently each RPC call opens a new connection. With keep-alive, FGP clients could reuse a single connection for all 6 calls per round — reducing connection count from ~18 to ~3 per round.
Impact
- FGP clusters: Connection reset causes leader proposal failure. Combined with a separate FGP bug (T1 timer not rescheduled after failure), this permanently stalls the entire FGP cluster (bug registered IPC client has no timeout — hung PoW node blocks finality gadget indefinitely finality-gadget#6 (comment)) requiring manual restart.
- Production: With 2.4-hour block intervals, a wasted round = 2.4 hours of delay.
- Scaling: With 10+ validators, burst load increases proportionally, making the issue worse.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status