Block Storage for Kubernetes
seaweed-block is a standalone block-storage experiment built around a
deterministic semantic core. The current repository is the first public
runnable slice of that work: a narrow block sparrow that proves one clean
recovery route end-to-end.
facts -> engine decision -> adapter command -> runtime execution -> session close
seaweed-block aims to make block storage for Kubernetes much lighter, easier,
and more flexible than traditional storage stacks.
The product direction is:
- simpler to understand and operate than heavyweight systems such as Ceph
- lighter to start and iterate on for developers and platform teams
- flexible enough to grow from a small cluster service into a practical Kubernetes block platform
- easier to reason about during failure and recovery, without a maze of hidden control-plane behavior
In short:
- easier than heavyweight storage systems
- more direct than control-plane-heavy designs
- still structured enough to grow into serious replicated block storage
The technical design is intentionally shaped around a few strict choices:
- semantic core first: recovery meaning is defined in a deterministic engine before broad system growth
- one route only: observation, decision, execution, and terminal close follow one explicit path
- strict authority boundaries: engine decides, adapter normalizes, runtime executes
- terminal truth is narrow: recovery is not "successful" until explicit session close says so
- reviewable growth: the project is phase-driven so new features do not silently pollute the core
- runtime must not silently redefine semantics, including widening engine-issued recovery targets
Many storage systems become difficult because recovery semantics, transport mechanics, retries, and product features get mixed together.
seaweed-block is an attempt to separate those layers more cleanly:
- facts determine semantics
- transport does not silently redefine policy
- the system can be tested and reviewed from the semantic contract outward
- the same semantic model should later support broader runtime work without changing its meaning
Current status:
- semantic core: present
- replay/conformance runtime: present
- adapter-backed route: present
- runnable block sparrow: present
- operations UX: not yet built out
- persistence / crash recovery: not yet built out
The current repository can run one narrow block slice end-to-end through real TCP transport.
Demonstrated paths:
- healthy: replica is already caught up
- catch-up: replica is behind within retained window
- rebuild: replica is behind beyond retained window
The current route stays intentionally narrow:
- one semantic route
- one active session at a time
- one terminal-close authority
- executor honors the engine-issued
targetLSN
This repository is not yet:
- production-ready storage
- a full SeaweedFS block product
- a complete frontend protocol implementation
- a broad operations shell or UI
- a replacement claim over the current
V2baseline
Current limitations:
- storage is in-memory only
- no persistence or crash recovery
- no master service; assignment is hardcoded in the demo
- no iSCSI or NVMe-oF frontend
- no concurrent write path during replication
- no broad timeout / reconnect / hardening logic
cmd/
sparrow/ runnable Phase 04 demo entry point
core/
engine/ deterministic semantic core
schema/ conformance case schema and conversion
runtime/ replay runner
conformance/ YAML semantic cases
adapter/ single-route adapter boundary
storage/ minimal in-memory block store
transport/ minimal TCP transport for the runnable sparrow
core/ is the public-facing semantic center path for this repository.
For a full taxonomy of events, commands, truth domains, and operator enums — and why there is no single unified state diagram — see docs/surface.md.
Requirements:
- Go
1.23+
Run the runnable sparrow:
go run ./cmd/sparrowExpected outcome:
- healthy demo passes
- catch-up demo passes
- rebuild demo passes
Run the test suite:
go test ./...The sparrow supports optional flags for repeatable validation and read-only inspection. Defaults are unchanged from the Phase 04 demo:
go run ./cmd/sparrow # three demos, text output (default)
go run ./cmd/sparrow --help # authoritative scope statement
go run ./cmd/sparrow --json # machine-readable output for CI
go run ./cmd/sparrow --runs 10 # repeat the full demo N times
go run ./cmd/sparrow --http :9090 # add read-only HTTP inspection
go run ./cmd/sparrow --calibrate # Phase 06 calibration pass (C1-C5)
go run ./cmd/sparrow --calibrate --json # machine-readable calibration Report
go run ./cmd/sparrow --persist-demo --persist-dir DIR # Phase 07 single-node persistence demoHTTP endpoints (read-only): GET / returns the self-describing surface
map; GET /status, GET /projection, GET /trace, GET /watchdog,
GET /diagnose expose the bounded single-node inspection surface. Every
mutation verb returns 501 with an explicit read-only ops-surface body.
See docs/single-node-surface.md for the bounded single-node product surface, or docs/bootstrap-validation.md for the full list of supported flags, endpoints, and exit codes.
This binary is a development and validation entry point only. The
production operations surface is weed shell after integration.
Phase 06 adds a small calibration set that drives the accepted route through five scenario families (C1-C5) and records expected-versus- observed evidence. Run it with:
go run ./cmd/sparrow --calibrate # text report
go run ./cmd/sparrow --calibrate --json # machine-readable ReportEvidence artifacts:
If a case diverges, record it in divergence-log.md before changing
the route or the expectations.
A bounded local data process owns read, write, flush, checkpoint,
and recovery on one node, behind the LogicalStorage interface.
Acked writes survive abrupt process kill; recovery is deterministic;
a background flusher drains the WAL into the extent and advances
the on-disk checkpoint.
go run ./cmd/sparrow --persist-demo --persist-dir /tmp/sparrow-persistWhat's proven (single-node):
- Acked writes survive process kill (verified by simulated-kill tests
that bypass
Close()and a crash family across four windows). - Recovery is deterministic across reopens of the same on-disk state.
- Unacked writes may vanish but never corrupt acked data.
What's NOT in scope: distributed durability across nodes;
power-loss durability beyond what fsync guarantees at the
OS+device boundary; bit-rot detection in the extent.
For details:
- docs/local-data-process.md — the institution, what it owns, the crash model, carry-forward
- docs/persistence.md — backend implementation details, on-disk format, exit codes, NVMe/raw-device path
The current replicated path is documented as two bounded lower institutions:
- docs/data-sync-institution.md — byte movement, wire protocol, lineage gate, achieved-frontier report
- docs/recovery-execution-institution.md — command admission, real execution start, invalidation, close-path lifecycle truth
Above the three lower institutions (local data, data sync, recovery execution) sits one bounded single-node operator surface — start / inspect / validate / diagnose — exposed as six read-only HTTP endpoints plus the sparrow CLI. No cluster-shaped wording; no mutation authority.
- docs/single-node-surface.md — surface map, workflow, honesty rules, carry-forward
The first bounded product capability beyond single-node operation: one old-primary → new-primary → rejoin path that converges with explicit fencing and stale-lineage rejection. Mechanism, not policy — who becomes primary and when to fail over belong to later phases.
- docs/replicated-slice.md — the bounded route, authority boundary, durability claim, known limitations, carry-forward
The current implementation is intentionally shaped around a few strict rules:
- timers trigger observation; facts determine semantics
- the engine owns semantic recovery decisions
- the adapter/runtime may execute, but may not redefine policy
- terminal truth comes only from explicit session close
- session
targetLSNis fixed by the engine and must not be silently widened by the executor
The next planned steps are:
- freeze and stabilize the first public runnable shape under
core/ - add minimal operations and test interfaces
- calibrate against selected high-value scenarios from the existing benchmark path
- expand only after the semantic boundary stays clean
This repository should currently be read as:
- a runnable semantic-core-first block prototype
- a clean recovery-route reference
- a base for future operations, calibration, and storage work
It should not yet be read as:
- finished block product
- production-ready replicated storage
- complete protocol or deployment surface