Fix ping confusion#8344
Conversation
40d111b to
1640fd5
Compare
1640fd5 to
255a291
Compare
255a291 to
b18929a
Compare
|
Rebasing to run through CI again as the old logs are expired. |
|
Unfortunately this PR did not have any affect on what i observed previously: |
connectd/multiplex.c
Outdated
| * *non-channel* messages (gossip, pings) either. So as a compromise, | ||
| * we allow a handful of messages to be queued before we ignore the | ||
| * peer until we've drained the outgoing queue. */ | ||
| if (msg_queue_length(subd->outq) > 5) { |
There was a problem hiding this comment.
I think this code here is breaking the emergency_recover test. Previously each message was followed by an io_wait, but now we only io_wait after 5 messages have been queued. So while the messages stay in order, the timing expectations of the test are different.
There was a problem hiding this comment.
@rustyrussell I think I was able to reproduce the issue where we respond to a different ping mentioned here in #8308. Here's how I set it up:
- 20 concurrent workers
- 5 pings per worker
- Different ping sizes
- Very short delays (1 - 10ms) between pings
So for each ping we expected the pong to be of the same (similar-ish) size. We receive the corresponding pong (with the corect size) most of the time (98/100 times) however we encountered a race condition the other 2 times where we received the pong for a separate ping and the CLN protection did not kick in?
So the 3 scenarios were:
Clean Success:
- Worker 19.1 sends ping → CLN processes → correct pong returned
Race Condition:
- Worker 12.1 sends ping requesting 2000 bytes
- Worker X sends ping requesting 1000 bytes
- Pongs get mixed up
- Worker 12.1 gets Worker X's pong (1004 bytes instead of 2004)
Complete Rejection
- Worker 1.2 sends ping command
- CLN's protection finally kicks in: "expecting_pong != PONG_UNEXPECTED"
- lightning-cli gets failure response
I think the issue could be that the check on line 1423 is not atomic, so multiple workers can pass this check?
if (peer->expecting_pong != PONG_UNEXPECTED) {
daemon_conn_send(daemon->master,
take(towire_connectd_ping_reply(NULL,
false, 0)));
return;
}
And while the message queue itself is thread safe in inject_peer_msg(peer, take(ping)); I imagine multiple pings get queued to the same peer, but since we don't track which ping belongs to which pong we just deliver the first pong that arrives to whoever is waiting?
There was a problem hiding this comment.
No, that's fine. This is the "we're already sending a ping, refuse to send another" case.
I have also reproduced this issue. I cannot (and neither can ChatGPT) see where we would dispatch more than one ping at once. I will have to put some data into the pings to try to trace what is happening!
There was a problem hiding this comment.
Found it. subd_req() is inappropriate here, since replies from connectd are NOT FIFO. See final commit.
Signed-off-by: Rusty Russell <[email protected]>
b18929a to
1289e69
Compare
One reason why ping processing could be slow is that, once we receive a message from the peer to send to a subdaemon, we don't listen for others until we've drained that subdaemon queue entirely. This can happens for reestablish: slow machines can take a while to set that subdaemon up. Signed-off-by: Rusty Russell <[email protected]>
We cannot use subd_req() here: replies will come out of order, and the we should not simply assign the reponses in FIFO order. Changelog-Fixed: lightningd: don't get confused with parallel ping commands. Signed-off-by: Rusty Russell <[email protected]>
1289e69 to
19c174a
Compare
|
I can report that it is working greatly now, good job rusty! |
First patch was to help me test pings across the network. But the real fix is the final one: we would get confused with sending out multiple pings.
See-also: #7742
See-also: #8308