Correct msgs_sent_since_pong tracking for gossip forwards#4093
Conversation
When we forward gossip messages, we use the already-encoded copy we have, pushing it onto the `gossip_broadcast_buffer`. When we have free socket buffer space, and there are no non-gossip messages pending, we'll remove it from the `gossip_broadcast_buffer` and push it onto the normal `pending_outbound_buffer` socket queue. Separately, we use `msgs_sent_since_pong` to ensure that our peer is being responsive and our messages are getting through with minimal latency. After we send 32 messages (if we've already received the `pong` for the last `ping` wensent), we send an extra `ping` and will only queue up an additional 32 forwarded gossip messages before we start dropping forwarded gossip on the floor. Previously, we (arguably) incorrectly incremented `msgs_sent_since_pong` when we pushed a message onto the `gossip_broadcast_buffer`, not when it was actually queued up in our `pending_outbound_buffer` immediate socket queue. This is fairly strange - the point of `msgs_sent_since_pong` is to keep peer latency low, we already independently enforce memory limits to ensure we drop forwarded gossip messages if a peer's message queues have grown too large. This means we potentially disconnect a peer for not draining the backlog forwarded-gossip socket queue fast enough, rather than simply dropping gossip. While this shouldn't be a big deal in practice, I do see (mostly Tor) peers disconnect on my node occasionally, and while its likely due to Tor circuits hanging and a disconnect is needed, the incorrect tracking here makes state analysis difficult as there are nearly always a flow of enough gossip messages to "send to the peer". Its also possible that this allows incredibly low bandwidth connections to stay afloat more durably.
|
👋 I see @tankyleo was un-assigned. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4093 +/- ##
==========================================
- Coverage 88.61% 88.60% -0.02%
==========================================
Files 176 176
Lines 132099 132120 +21
Branches 132099 132120 +21
==========================================
+ Hits 117060 117065 +5
- Misses 12365 12386 +21
+ Partials 2674 2669 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tnull
left a comment
There was a problem hiding this comment.
Makes sense, simple enough, landing.
Do you happen to recall why we originally incremented it when adding to the gossip_broadcast_buffer?
I do not. I assume it was just convenient, the message is now in a "message queue" so it looks kinda like we should increment the "messages queued" counter. |
When we forward gossip messages, we use the already-encoded copy we have, pushing it onto the
gossip_broadcast_buffer. When we have free socket buffer space, and there are no non-gossip messages pending, we'll remove it from thegossip_broadcast_bufferand push it onto the normalpending_outbound_buffersocket queue.Separately, we use
msgs_sent_since_pongto ensure that our peer is being responsive and our messages are getting through with minimal latency. After we send 32 messages (if we've already received thepongfor the lastpingwensent), we send an extrapingand will only queue up an additional 32 forwarded gossip messages before we start dropping forwarded gossip on the floor.Previously, we (arguably) incorrectly incremented
msgs_sent_since_pongwhen we pushed a message onto thegossip_broadcast_buffer, not when it was actually queued up in ourpending_outbound_bufferimmediate socket queue. This is fairly strange - the point ofmsgs_sent_since_pongis to keep peer latency low, we already independently enforce memory limits to ensure we drop forwarded gossip messages if a peer's message queues have grown too large. This means we potentially disconnect a peer for not draining the backlog forwarded-gossip socket queue fast enough, rather than simply dropping gossip.While this shouldn't be a big deal in practice, I do see (mostly Tor) peers disconnect on my node occasionally, and while its likely due to Tor circuits hanging and a disconnect is needed, the incorrect tracking here makes state analysis difficult as there are nearly always a flow of enough gossip messages to "send to the peer". Its also possible that this allows incredibly low bandwidth connections to stay afloat more durably.