SWIM membership: failure detector, gossip dissemination, UDP transport#34
Merged
farhan-syah merged 5 commits intomainfrom Apr 15, 2026
Merged
SWIM membership: failure detector, gossip dissemination, UDP transport#34farhan-syah merged 5 commits intomainfrom
farhan-syah merged 5 commits intomainfrom
Conversation
Add the core SWIM/Lifeguard failure detection runtime to nodedb-cluster. The detector drives a tokio::select! probe loop that separates I/O from logic via a Transport trait, allowing deterministic in-process testing with InMemoryTransport and edge-drop/partition injection. - FailureDetector: main runtime task owning the probe loop, suspicion timer, and inflight-probe registry - ProbeRound: Lifeguard ping → ping-req indirect probe sequence with per-probe-id oneshot correlation - ProbeScheduler: random-permutation epoch scheduler (every peer probed once per epoch, per Lifeguard §4.3) - SuspicionTimer: timeout math per Lifeguard §3.1 (max(min, mult * log2(n) * probe_interval)) with drain_expired polling - Transport / InMemoryTransport / TransportFabric: Send+Sync async trait with mpsc-backed in-memory impl for unit tests - SwimError variants TransportClosed and ProbeInflightOverflow async-trait added to nodedb-cluster dependencies to support the Transport trait definition.
Reflects the async-trait addition to nodedb-cluster and advances all workspace member versions from 0.0.2 to 0.0.3.
Introduce the dissemination module (DisseminationQueue, PendingUpdate, apply_and_disseminate) that carries membership deltas as piggyback payloads on every outgoing probe datagram. Each outgoing Ping, PingReq, and Ack now attaches up to max_piggyback rumours selected from the queue. A rumour is retired after it has been forwarded ceil(lambda * log2(n+1)) times, matching the bound from the SWIM:GREED paper (Das et al. §4.3) that guarantees with high probability every live member receives each delta. Inbound piggyback is ingested before dispatching the message so that a self-refutation incarnation bump is reflected in the outgoing Ack of the same round-trip. SwimConfig gains max_piggyback and fanout_lambda fields with validation; ProbeRound and DetectorRunner are wired to pass them through. Integration tests cover cross-node delta propagation and self-refutation via piggyback.
Extract `InMemoryTransport` into its own file and add a real `UdpTransport` alongside it, with the shared `Transport` trait and `TransportFabric` living in a `transport/mod.rs` hub. The old monolithic `transport.rs` held the trait, the in-memory fabric, and placeholder comments for UDP; keeping them together made the file grow as soon as UDP landed. The new layout mirrors the established one-concern-per-file rule: transport/mod.rs — re-exports + shared trait transport/in_memory.rs — test fabric (mpsc channels, drop injection) transport/udp.rs — production UDP transport (tokio UdpSocket) `UdpTransport` is now re-exported from the `swim` public API so cluster startup and integration tests can construct one without reaching into internal modules.
…g field Callers previously had to wire up MembershipList, DisseminationQueue, FailureDetector, and the run-loop task themselves — a four-step sequence with easy ordering mistakes. `spawn_swim` collapses this into one call: it seeds the membership list from the provided address list, validates config, starts the detector task, and returns a `SwimHandle` that owns shutdown plumbing and exposes the shared membership and dissemination queue. `BootstrapConfig` gains `swim_udp_addr: Option<SocketAddr>` so operators can opt into SWIM by supplying a bind address. `None` keeps the existing behaviour (cluster boots without SWIM; membership is observed only through Raft). All existing callsites are updated with `swim_udp_addr: None`. `SwimHandle` and `spawn_swim` are re-exported from `nodedb_cluster` so dependent crates do not need to reach into swim internals. Adds a real-UDP integration test (`swim_udp_convergence`) that boots three SWIM nodes on ephemeral loopback ports and asserts they converge to a full Alive view before a member is shut down and the remainder observe it as Suspect/Dead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan