Skip to content

Fix ping confusion#8344

Merged
rustyrussell merged 3 commits intoElementsProject:masterfrom
rustyrussell:guilt/debug-ping
Aug 14, 2025
Merged

Fix ping confusion#8344
rustyrussell merged 3 commits intoElementsProject:masterfrom
rustyrussell:guilt/debug-ping

Conversation

@rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Jun 11, 2025

First patch was to help me test pings across the network. But the real fix is the final one: we would get confused with sending out multiple pings.

See-also: #7742
See-also: #8308

@rustyrussell rustyrussell added this to the v25.09 milestone Jun 11, 2025
@madelinevibes madelinevibes added Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. PLEASE clear CI 🫠 labels Aug 6, 2025
@endothermicdev endothermicdev self-requested a review August 8, 2025 19:14
@endothermicdev
Copy link
Collaborator

Rebasing to run through CI again as the old logs are expired.

@daywalker90
Copy link
Collaborator

Unfortunately this PR did not have any affect on what i observed previously:
CLN ping times get high very quickly while the control group, normal tcp handshakes, stay fast.

* *non-channel* messages (gossip, pings) either. So as a compromise,
* we allow a handful of messages to be queued before we ignore the
* peer until we've drained the outgoing queue. */
if (msg_queue_length(subd->outq) > 5) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code here is breaking the emergency_recover test. Previously each message was followed by an io_wait, but now we only io_wait after 5 messages have been queued. So while the messages stay in order, the timing expectations of the test are different.

Copy link
Collaborator

@sangbida sangbida Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rustyrussell I think I was able to reproduce the issue where we respond to a different ping mentioned here in #8308. Here's how I set it up:

  • 20 concurrent workers
  • 5 pings per worker
  • Different ping sizes
  • Very short delays (1 - 10ms) between pings

So for each ping we expected the pong to be of the same (similar-ish) size. We receive the corresponding pong (with the corect size) most of the time (98/100 times) however we encountered a race condition the other 2 times where we received the pong for a separate ping and the CLN protection did not kick in?

So the 3 scenarios were:

Clean Success:

  • Worker 19.1 sends ping → CLN processes → correct pong returned

Race Condition:

  • Worker 12.1 sends ping requesting 2000 bytes
  • Worker X sends ping requesting 1000 bytes
  • Pongs get mixed up
  • Worker 12.1 gets Worker X's pong (1004 bytes instead of 2004)

Complete Rejection

  • Worker 1.2 sends ping command
  • CLN's protection finally kicks in: "expecting_pong != PONG_UNEXPECTED"
  • lightning-cli gets failure response

I think the issue could be that the check on line 1423 is not atomic, so multiple workers can pass this check?

	if (peer->expecting_pong != PONG_UNEXPECTED) {
		daemon_conn_send(daemon->master,
				 take(towire_connectd_ping_reply(NULL,
								 false, 0)));
		return;
	}

And while the message queue itself is thread safe in inject_peer_msg(peer, take(ping)); I imagine multiple pings get queued to the same peer, but since we don't track which ping belongs to which pong we just deliver the first pong that arrives to whoever is waiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's fine. This is the "we're already sending a ping, refuse to send another" case.

I have also reproduced this issue. I cannot (and neither can ChatGPT) see where we would dispatch more than one ping at once. I will have to put some data into the pings to try to trace what is happening!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it. subd_req() is inappropriate here, since replies from connectd are NOT FIFO. See final commit.

One reason why ping processing could be slow is that, once we receive
a message from the peer to send to a subdaemon, we don't listen for
others until we've drained that subdaemon queue entirely.

This can happens for reestablish: slow machines can take a while to
set that subdaemon up.

Signed-off-by: Rusty Russell <[email protected]>
We cannot use subd_req() here: replies will come out of order, and the
we should not simply assign the reponses in FIFO order.

Changelog-Fixed: lightningd: don't get confused with parallel ping commands.
Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell changed the title Reduce ping latency Fix ping confusion Aug 14, 2025
@daywalker90
Copy link
Collaborator

I can report that it is working greatly now, good job rusty!

037310ee65744f9c4a37cba7dae742a881f7bcc57f6be70b69a437860ffec410ee@170.75.166.48:9735                                 ln-ping:   588ms tcp-ping: true    110ms
0251c1c3c167ff1bcdaef55cbfb61cf49f1c0212e3122aaf9493aa3d06351cdea4@93.46.54.4:9736                                    ln-ping:    44ms tcp-ping: true     47ms
0202e7727ee46d943a44c7bf0dce14220a839630dee67d03446bf935178b55a0df@65.21.175.152:9735                                 ln-ping:    33ms tcp-ping: true     34ms
03b0c12adf10591e05deb3afd150dec9d123ac7238f30bd3f1dddeec2c35f6dbf8@216.219.95.22:9735                                 ln-ping:   478ms tcp-ping: true    164ms
032312e5e15e89211df36e473d76af60672e3efebfe42a9b2113acbb456050b502@16.170.28.230:9735                                 ln-ping:    25ms tcp-ping: true     27ms
02fc6bf390208dda96f489491b5c79830c39dfc07c998cc5f469ce3801cd01c5b2@54.161.230.113:9735                                ln-ping:    95ms tcp-ping: true     97ms
035736692722a47389d6be965d73a50b1ba63233a1b975aa55db7f90446f5708cb@91.205.175.156:9735                                ln-ping:   312ms tcp-ping: true     26ms
03d6f80df785288de2fe5de19f24ba8a1db3d20647a88d0a903be9de3e7bb8fce1@146.19.143.201:62143                               ln-ping:   237ms tcp-ping: true    241ms
025f1456582e70c4c06b61d5c8ed3ce229e6d0db538be337a2dc6d163b0ebc05a5@52.86.210.65:9735                                  ln-ping:    96ms tcp-ping: true     96ms
020003b9499a97c8dfbbab6b196319db37ba9c37bccb60477f3c867175f417988e@143.202.162.130:9735                               ln-ping:   185ms tcp-ping: true    181ms

@rustyrussell rustyrussell enabled auto-merge (rebase) August 14, 2025 06:54
@rustyrussell rustyrussell removed Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. PLEASE clear CI 🫠 labels Aug 14, 2025
@rustyrussell rustyrussell merged commit 5f54403 into ElementsProject:master Aug 14, 2025
75 of 78 checks passed
Copy link
Collaborator

@sangbida sangbida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Rusty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants