Skip to content

SWIM membership: failure detector, gossip dissemination, UDP transport#34

Merged
farhan-syah merged 5 commits intomainfrom
cluster
Apr 15, 2026
Merged

SWIM membership: failure detector, gossip dissemination, UDP transport#34
farhan-syah merged 5 commits intomainfrom
cluster

Conversation

@farhan-syah
Copy link
Copy Markdown
Contributor

Summary

  • Add a SWIM failure detector on top of the existing membership primitives: probe scheduler, direct + indirect probes, suspicion timer, inflight registry, and a Tokio run loop with cooperative shutdown.
  • Implement Lifeguard-style gossip dissemination with a bounded piggyback queue (`max_piggyback`, `fanout_lambda`), least-disseminated-first selection, and integration into every outgoing Ping/PingReq/Ack.
  • Split the transport module into a `Transport` trait plus per-impl files, and add a production `UdpTransport` using zerompk-encoded datagrams with typed decode errors.
  • Introduce `spawn_swim` / `SwimHandle` as the single entry-point for standing up the detector, and add `swim_udp_addr: Option` to `BootstrapConfig` (defaults to `None`, fully backward compatible).
  • Bump workspace crates to 0.0.3 and refresh the lockfile.

Test plan

  • `cargo fmt --all --check`
  • `cargo clippy --workspace --all-targets --all-features -- -D warnings`
  • `cargo nextest run -p nodedb-cluster --all-features`
  • Real-UDP 3-node convergence integration test (`tests/swim_udp_convergence.rs`)

Add the core SWIM/Lifeguard failure detection runtime to nodedb-cluster.
The detector drives a tokio::select! probe loop that separates I/O from
logic via a Transport trait, allowing deterministic in-process testing
with InMemoryTransport and edge-drop/partition injection.

- FailureDetector: main runtime task owning the probe loop, suspicion
  timer, and inflight-probe registry
- ProbeRound: Lifeguard ping → ping-req indirect probe sequence with
  per-probe-id oneshot correlation
- ProbeScheduler: random-permutation epoch scheduler (every peer probed
  once per epoch, per Lifeguard §4.3)
- SuspicionTimer: timeout math per Lifeguard §3.1
  (max(min, mult * log2(n) * probe_interval)) with drain_expired polling
- Transport / InMemoryTransport / TransportFabric: Send+Sync async trait
  with mpsc-backed in-memory impl for unit tests
- SwimError variants TransportClosed and ProbeInflightOverflow

async-trait added to nodedb-cluster dependencies to support the
Transport trait definition.
Reflects the async-trait addition to nodedb-cluster and advances all
workspace member versions from 0.0.2 to 0.0.3.
Introduce the dissemination module (DisseminationQueue, PendingUpdate,
apply_and_disseminate) that carries membership deltas as piggyback
payloads on every outgoing probe datagram.

Each outgoing Ping, PingReq, and Ack now attaches up to max_piggyback
rumours selected from the queue. A rumour is retired after it has been
forwarded ceil(lambda * log2(n+1)) times, matching the bound from the
SWIM:GREED paper (Das et al. §4.3) that guarantees with high probability
every live member receives each delta.

Inbound piggyback is ingested before dispatching the message so that a
self-refutation incarnation bump is reflected in the outgoing Ack of the
same round-trip. SwimConfig gains max_piggyback and fanout_lambda fields
with validation; ProbeRound and DetectorRunner are wired to pass them
through. Integration tests cover cross-node delta propagation and
self-refutation via piggyback.
Extract `InMemoryTransport` into its own file and add a real
`UdpTransport` alongside it, with the shared `Transport` trait and
`TransportFabric` living in a `transport/mod.rs` hub.

The old monolithic `transport.rs` held the trait, the in-memory
fabric, and placeholder comments for UDP; keeping them together made
the file grow as soon as UDP landed. The new layout mirrors the
established one-concern-per-file rule:
  transport/mod.rs      — re-exports + shared trait
  transport/in_memory.rs — test fabric (mpsc channels, drop injection)
  transport/udp.rs       — production UDP transport (tokio UdpSocket)

`UdpTransport` is now re-exported from the `swim` public API so cluster
startup and integration tests can construct one without reaching into
internal modules.
…g field

Callers previously had to wire up MembershipList, DisseminationQueue,
FailureDetector, and the run-loop task themselves — a four-step
sequence with easy ordering mistakes. `spawn_swim` collapses this into
one call: it seeds the membership list from the provided address list,
validates config, starts the detector task, and returns a `SwimHandle`
that owns shutdown plumbing and exposes the shared membership and
dissemination queue.

`BootstrapConfig` gains `swim_udp_addr: Option<SocketAddr>` so
operators can opt into SWIM by supplying a bind address. `None` keeps
the existing behaviour (cluster boots without SWIM; membership is
observed only through Raft). All existing callsites are updated with
`swim_udp_addr: None`.

`SwimHandle` and `spawn_swim` are re-exported from `nodedb_cluster` so
dependent crates do not need to reach into swim internals.

Adds a real-UDP integration test (`swim_udp_convergence`) that boots
three SWIM nodes on ephemeral loopback ports and asserts they converge
to a full Alive view before a member is shut down and the remainder
observe it as Suspect/Dead.
@farhan-syah farhan-syah merged commit d9d34f9 into main Apr 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant