Improve access control in peer connections#9880
Improve access control in peer connections#9880Roasbeef merged 17 commits intolightningnetwork:masterfrom
Conversation
|
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
c82c790 to
3b089cf
Compare
|
@ziggie1984: review reminder |
|
@Crypt-iQ could you take a look at this new design where we remove outbound restrictions, are there some downsides with the current LND codebase which we do not see, referring to your comment here: |
ziggie1984
left a comment
There was a problem hiding this comment.
Thank you for working on this, left some comments.
I think we also need to think about the two functions:
notifyOpenChannelPeerEventnotifyPendingOpenChannelPeerEvent
These are now coupled with the accessManager however I think other subsystems are not aware that these functions can now fail when peers are not connected. So I would highly recommend making those calls independent for the accessMan.
Example: THe funding manager calls the newOpenchannel notification when a channel hits the required amount of confirmations. Now in case the peer is disconnected we won't notify all the dependent services that a new channel opened. And the funding manager will not send it again. So I think we should not couple those two things.
| // peer. | ||
| acsmLog.DebugS(ctx, "Peer has no channels, assigning restricted access") | ||
|
|
||
| // If this is an existing peer, there's no need to check for slot limit. |
There was a problem hiding this comment.
why do we call this after the disconnect check, I think it makes sense to do it before ?
There was a problem hiding this comment.
if the peer is messing with us we should disconnect?
There was a problem hiding this comment.
Yeah I was more looking into the shouldDisconnect logic and saw there that we are already do not disconnect if we have a channel so yeah maybe it is better for the accessMan not assume stuff happening in other subsystems:
Line 3710 in 31c74f2
Just wanted to bring it up.
There was a problem hiding this comment.
it is better for the accessMan not assume stuff happening in other subsystems
yeah good thought - that's exactly the decoupling we are aiming for, which means we will need a design doc for this, to explicitly picture how these subsystems interact with each other.
server.go
Outdated
| nodeKeyECDH, listenAddr.String(), | ||
| // TODO(yy): remove this check and unify the inbound | ||
| // connection check inside `InboundPeerConnected`. | ||
| s.peerAccessMan.checkIncomingConnBanScore, |
There was a problem hiding this comment.
@Crypt-iQ what was our intention again to put this check earlier into the HandShake method rather than into the InboundPeerConnected ?
| // | ||
| // TODO(yy): Consider perform this check in | ||
| // `peerAccessMan.addPeerAccess`. | ||
| access, err := s.peerAccessMan.assignPeerPerms(pubKey) |
There was a problem hiding this comment.
I think we need to unify addPeerAccess and assignPeerPerms because peerConnected can be called concurrently for many peers and we should increment the restricted Peers number if we are connecting the peer otherwise because we need to hold thebanMutex it can be possible that we allow more restricted peers than possible.
2 peers call peerConnected both of these calls succeed with assignPeerPerms before one of them has called addPeerAccess to increase the counter of the restictive peer.
There was a problem hiding this comment.
I think whenever we review PRs it's good to keep in mind of two things - first, what is the scope of the PR, which the author usually states it in the PR description. Second, what is the issue identified - is it preexisting, or it's introduced by this PR. If a new PR introduced a buggy behavior, that def needs to be addressed. If it's preexisting, then there's no need to fix everything in one PR, as long as the original scope is achieved. This is to make sure we move forward with progress, and not putting all the task under one author.
As for this issue, I think it's preexisting, otherwise I won't bother creating that TODO. In fact, since we are now moving the assignPeerPerms here instead of outside of peerConnected, it already makes the situation better.
There was a problem hiding this comment.
Agree we should not fix it in this PR. But I think reviews like this are still a good forum to discuss potential problems of the current code. I was not aware of this race condition beforehand that's why I brought it up and wanted to get your opinion here as well.
There was a problem hiding this comment.
Yeah def - we need to bring awareness so it's good to discuss the potential issue. Tho based on offline discussion it seems to me that this is a blocking comment, which is why I brought it up.
Cool think this is now fixed. |
This flag is added so we can use it in the itest to mimic racing inbound and outbound connections.
When a peer already has a connection with us, there's no need to check for available slots as we will either close the old conn or refuse the new conn.
We need to also check this map to make sure the peer exists or not.
When a peer is already existing, we should skip incrementing the `numRestricted` count. Also patched unit test for method `addPeerAccess`.
ziggie1984
left a comment
There was a problem hiding this comment.
Just have a final question otherwise it is gtg:
- Should we remove the Peer from the PeerCount map in case it has not open/pending channel. Otherwise Peers will always be able to reconnect as inbound although we disconnected to them ?
| // peer. | ||
| acsmLog.DebugS(ctx, "Peer has no channels, assigning restricted access") | ||
|
|
||
| // If this is an existing peer, there's no need to check for slot limit. |
There was a problem hiding this comment.
Yeah I was more looking into the shouldDisconnect logic and saw there that we are already do not disconnect if we have a channel so yeah maybe it is better for the accessMan not assume stuff happening in other subsystems:
Line 3710 in 31c74f2
Just wanted to bring it up.
| } | ||
|
|
||
| // Assigning to peer1 should not return an error. | ||
| status, err := a.assignPeerPerms(peer1) |
There was a problem hiding this comment.
Ok now I understand it, but shouldn't we make sure we remove peers from the peerCount as well which have no channels in case we remove the peer in removePeerAccess.
Otherwise all peers we ever connected outbound wise will have access when connecting inbound ?
| // | ||
| // TODO(yy): Consider perform this check in | ||
| // `peerAccessMan.addPeerAccess`. | ||
| access, err := s.peerAccessMan.assignPeerPerms(pubKey) |
There was a problem hiding this comment.
Agree we should not fix it in this PR. But I think reviews like this are still a good forum to discuss potential problems of the current code. I was not aware of this race condition beforehand that's why I brought it up and wanted to get your opinion here as well.
| } | ||
|
|
||
| // Assigning to peer1 should not return an error. | ||
| status, err := a.assignPeerPerms(peer1) |
There was a problem hiding this comment.
IIUC, we remove a peer's score when they disconnect:
Lines 4960 to 4961 in 168d63c
So if we connect to a peer that we have no channels with, then we disconnect and they reconnect inbound, they'll start with a blank score.
If there's an error occured when updating the peer's status after the channel status is changed, we now make sure we log the error instead of letting it interrupt the channel open/close flow.
Yeah good catch - added a commit plus some tests to remove it, but only if the peer has no channels with us. |
ziggie1984
left a comment
There was a problem hiding this comment.
LGTM 🥽
sample.conf would need an update now as well which currently states:
; The number of restricted slots the server will allocate for peers.
; num-restricted-slots=100
Something along the lines like:
This restriction only counts for inbound however outbound are still counted as restricted, so you can have more than 100 restricted peers, but not more inbound peers if the number exceeded the max-restricted.
|
|
||
| // numRestrictedExpected specifies the final value to expect once the | ||
| // test finishes. | ||
| var numRestrictedExpected int |
There was a problem hiding this comment.
Nit: Maybe start from 1 here so we do not decrement to -1 ?
There was a problem hiding this comment.
yeah i kinda wanna show that this num can be negative as we need underflow protection.
We now remove the peer from `peerChanInfo` if this peer doesn't have channels with us. Also patched a unit test for `removePeerAccess`.
@ziggie1984 Updated - note that outbound connections are not restricted, so they should not increment the slots count |
Roasbeef
left a comment
There was a problem hiding this comment.
Reviewed 2 of 11 files at r1, 9 of 9 files at r2, 8 of 11 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 26 unresolved discussions (waiting on @Crypt-iQ, @yyforyongyu, and @ziggie1984)
|
Accidentally dismissed the existing approval, merging bsaed on that. |
We now remove the restriction placed on outbound connections - previously we would account them for
numRestricted, which is no longer the case, as for user-initiated connections, it should not be put under the same restriction as the inbound ones - we may choose to eventually put a restriction on the outbound too, but it should be a very different category from the inbound.In addition more tests are patched to fix the case that an existing connection may be counted multiple times, causing the
numRestrictedto be falsely reached.Depends on,
TODOs:
This change is