Fix ping confusion by rustyrussell · Pull Request #8344 · ElementsProject/lightning

rustyrussell · 2025-06-11T09:23:52Z

First patch was to help me test pings across the network. But the real fix is the final one: we would get confused with sending out multiple pings.

See-also: #7742
See-also: #8308

endothermicdev · 2025-08-08T20:52:39Z

Rebasing to run through CI again as the old logs are expired.

daywalker90 · 2025-08-09T09:40:39Z

Unfortunately this PR did not have any affect on what i observed previously:
CLN ping times get high very quickly while the control group, normal tcp handshakes, stay fast.

sangbida · 2025-08-09T21:50:21Z

connectd/multiplex.c

+	* *non-channel* messages (gossip, pings) either.  So as a compromise,
+	* we allow a handful of messages to be queued before we ignore the
+	* peer until we've drained the outgoing queue. */
+       if (msg_queue_length(subd->outq) > 5) {


I think this code here is breaking the emergency_recover test. Previously each message was followed by an io_wait, but now we only io_wait after 5 messages have been queued. So while the messages stay in order, the timing expectations of the test are different.

@rustyrussell I think I was able to reproduce the issue where we respond to a different ping mentioned here in #8308. Here's how I set it up:

20 concurrent workers

5 pings per worker

Different ping sizes

Very short delays (1 - 10ms) between pings

So for each ping we expected the pong to be of the same (similar-ish) size. We receive the corresponding pong (with the corect size) most of the time (98/100 times) however we encountered a race condition the other 2 times where we received the pong for a separate ping and the CLN protection did not kick in?

So the 3 scenarios were:

Clean Success:

Worker 19.1 sends ping → CLN processes → correct pong returned

Race Condition:

Worker 12.1 sends ping requesting 2000 bytes

Worker X sends ping requesting 1000 bytes

Pongs get mixed up

Worker 12.1 gets Worker X's pong (1004 bytes instead of 2004)

Complete Rejection

Worker 1.2 sends ping command

CLN's protection finally kicks in: "expecting_pong != PONG_UNEXPECTED"

lightning-cli gets failure response

I think the issue could be that the check on line 1423 is not atomic, so multiple workers can pass this check?

if (peer->expecting_pong != PONG_UNEXPECTED) { daemon_conn_send(daemon->master, take(towire_connectd_ping_reply(NULL, false, 0))); return; }

And while the message queue itself is thread safe in inject_peer_msg(peer, take(ping)); I imagine multiple pings get queued to the same peer, but since we don't track which ping belongs to which pong we just deliver the first pong that arrives to whoever is waiting?

No, that's fine. This is the "we're already sending a ping, refuse to send another" case.

I have also reproduced this issue. I cannot (and neither can ChatGPT) see where we would dispatch more than one ping at once. I will have to put some data into the pings to try to trace what is happening!

Found it. subd_req() is inappropriate here, since replies from connectd are NOT FIFO. See final commit.

Signed-off-by: Rusty Russell <[email protected]>

One reason why ping processing could be slow is that, once we receive a message from the peer to send to a subdaemon, we don't listen for others until we've drained that subdaemon queue entirely. This can happens for reestablish: slow machines can take a while to set that subdaemon up. Signed-off-by: Rusty Russell <[email protected]>

We cannot use subd_req() here: replies will come out of order, and the we should not simply assign the reponses in FIFO order. Changelog-Fixed: lightningd: don't get confused with parallel ping commands. Signed-off-by: Rusty Russell <[email protected]>

daywalker90 · 2025-08-14T05:52:43Z

I can report that it is working greatly now, good job rusty!

037310ee65744f9c4a37cba7dae742a881f7bcc57f6be70b69a437860ffec410ee@170.75.166.48:9735                                 ln-ping:   588ms tcp-ping: true    110ms
0251c1c3c167ff1bcdaef55cbfb61cf49f1c0212e3122aaf9493aa3d06351cdea4@93.46.54.4:9736                                    ln-ping:    44ms tcp-ping: true     47ms
0202e7727ee46d943a44c7bf0dce14220a839630dee67d03446bf935178b55a0df@65.21.175.152:9735                                 ln-ping:    33ms tcp-ping: true     34ms
03b0c12adf10591e05deb3afd150dec9d123ac7238f30bd3f1dddeec2c35f6dbf8@216.219.95.22:9735                                 ln-ping:   478ms tcp-ping: true    164ms
032312e5e15e89211df36e473d76af60672e3efebfe42a9b2113acbb456050b502@16.170.28.230:9735                                 ln-ping:    25ms tcp-ping: true     27ms
02fc6bf390208dda96f489491b5c79830c39dfc07c998cc5f469ce3801cd01c5b2@54.161.230.113:9735                                ln-ping:    95ms tcp-ping: true     97ms
035736692722a47389d6be965d73a50b1ba63233a1b975aa55db7f90446f5708cb@91.205.175.156:9735                                ln-ping:   312ms tcp-ping: true     26ms
03d6f80df785288de2fe5de19f24ba8a1db3d20647a88d0a903be9de3e7bb8fce1@146.19.143.201:62143                               ln-ping:   237ms tcp-ping: true    241ms
025f1456582e70c4c06b61d5c8ed3ce229e6d0db538be337a2dc6d163b0ebc05a5@52.86.210.65:9735                                  ln-ping:    96ms tcp-ping: true     96ms
020003b9499a97c8dfbbab6b196319db37ba9c37bccb60477f3c867175f417988e@143.202.162.130:9735                               ln-ping:   185ms tcp-ping: true    181ms

sangbida

Nice work Rusty!

rustyrussell added this to the v25.09 milestone Jun 11, 2025

rustyrussell force-pushed the guilt/debug-ping branch from 40d111b to 1640fd5 Compare June 12, 2025 03:25

rustyrussell force-pushed the guilt/debug-ping branch from 1640fd5 to 255a291 Compare June 30, 2025 04:02

madelinevibes added Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. PLEASE clear CI 🫠 labels Aug 6, 2025

endothermicdev self-requested a review August 8, 2025 19:14

endothermicdev force-pushed the guilt/debug-ping branch from 255a291 to b18929a Compare August 8, 2025 20:51

sangbida reviewed Aug 9, 2025

View reviewed changes

gossipwith: add flag to insist that we receive all messages.

4e1f322

Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/debug-ping branch from b18929a to 1289e69 Compare August 14, 2025 00:58

rustyrussell added 2 commits August 14, 2025 14:34

rustyrussell force-pushed the guilt/debug-ping branch from 1289e69 to 19c174a Compare August 14, 2025 05:04

rustyrussell changed the title ~~Reduce ping latency~~ Fix ping confusion Aug 14, 2025

rustyrussell enabled auto-merge (rebase) August 14, 2025 06:54

rustyrussell removed Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. PLEASE clear CI 🫠 labels Aug 14, 2025

rustyrussell merged commit 5f54403 into ElementsProject:master Aug 14, 2025
75 of 78 checks passed

sangbida approved these changes Aug 14, 2025

View reviewed changes

sj-fisher mentioned this pull request Oct 17, 2025

Possible problem with online monitoring via lightning pings ksedgwic/clboss#276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ping confusion#8344

Fix ping confusion#8344
rustyrussell merged 3 commits intoElementsProject:masterfrom
rustyrussell:guilt/debug-ping

rustyrussell commented Jun 11, 2025 •

edited

Loading

Uh oh!

endothermicdev commented Aug 8, 2025

Uh oh!

daywalker90 commented Aug 9, 2025

Uh oh!

sangbida Aug 9, 2025

Uh oh!

sangbida Aug 10, 2025 •

edited

Loading

Uh oh!

rustyrussell Aug 13, 2025

Uh oh!

rustyrussell Aug 14, 2025

Uh oh!

daywalker90 commented Aug 14, 2025

Uh oh!

Uh oh!

sangbida left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rustyrussell commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

endothermicdev commented Aug 8, 2025

Uh oh!

daywalker90 commented Aug 9, 2025

Uh oh!

sangbida Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

sangbida Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rustyrussell Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

rustyrussell Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

daywalker90 commented Aug 14, 2025

Uh oh!

Uh oh!

sangbida left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rustyrussell commented Jun 11, 2025 •

edited

Loading

sangbida Aug 10, 2025 •

edited

Loading