fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20) by farhan-syah · Pull Request #21 · NodeDB-Lab/nodedb

farhan-syah · 2026-04-14T13:43:48Z

Closes #20.

Full context, reproduction, and before/after numbers on the issue: #20 (comment)

Summary

Reported symptom: a fresh NodeDB container (or native binary) idles at ~150-175% CPU with no schema, no clients, and no workload. This branch fixes the root cause plus the four related side-findings in the same report.

1. Idle CPU burn (primary fix — commit `c10ee61`)

Two busy-poll loops in the Control / Event Plane had no idle backoff:

Response poller (main.rs) was loop { poll_and_route_responses(); yield_now().await; }. yield_now() immediately re-schedules, so one Tokio worker was pegged at 100% with zero clients connected.
Event consumer (event/consumer.rs) empty-ring poll was fixed at 1ms per core × 23 Data Plane cores ≈ 23k task wakes/sec even at idle.

Fix:

poll_and_route_responses now returns the number of routed responses so the loop can detect activity.
Response poller uses adaptive backoff: yield_now while routing or within 256 idle iters (sub-ms burst recovery), then 1ms, then 10ms.
Event consumer empty-poll ramps from 1ms → 50ms after 32 consecutive empty polls, resets on the first batch.

The hot path still uses yield_now while responses are flowing, so request latency under load is unchanged.

2. Missing system-table warnings (commit `1bd04da`)

The SystemCatalog::open init transaction was missing 7 tables, so fresh DBs spewed Table '_system.alert_rules' does not exist etc. on startup. Added ALERT_RULES, RETENTION_POLICIES, SEQUENCES, SEQUENCE_STATE, COLUMN_STATS, VECTOR_MODEL_METADATA, CHECKPOINTS to the init path.

3. Docker volume permission UX (commit `751e804`)

New docker-entrypoint.sh: runs as root just long enough to chown the data volume to nodedb:nodedb, then exec gosu nodedb drops privileges. -v nodedb-data:/var/lib/nodedb now works without --user 0:0. If the container is started with --user 10001:10001 the entrypoint detects it and skips the chown.
Clear actionable error instead of cryptic WAL I/O error: Permission denied (os error 13) when the volume is unwritable.
Dockerfile: added gosu, moved USER switch into the entrypoint, added COPY docker-entrypoint.sh.

4. Docs / mount-path mismatch (commit `751e804`)

nodedb-docs/docs/introduction/docker.rdx now uses /var/lib/nodedb everywhere (was -v nodedb-data:/data in 5 places, which silently lost data on docker rm because the image doesn't set NODEDB_DATA_DIR=/data).
New installation.rdx and getting-started.md structure: prebuilt Linux binary recommended first (best performance), Docker for macOS/Windows/WSL2, source for development. All share one Configuration section. Binary download command resolves latest tag and arch dynamically — no hardcoded version.

5. Misc improvements that landed while diagnosing

06d3b07 — nextest config serialising cluster tests + retry-on-flake.
3ce34b8 — cluster test harness hardened against shutdown/replication flakes (including the wait_for for rolling-upgrade compat-mode exit at the bottom of TestCluster::spawn_three).
48b1081 — nodedb-cluster join retry policy is now configurable via JoinRetryConfig instead of hardcoded constants.
a88356b — closes a post-apply race between the in-memory metadata cache and the applied-index watcher (small, unrelated to idle CPU but caught by flaky tests during this work).

Each of these is its own commit for a reason — happy to split the branch into multiple PRs if reviewers prefer, but they're all small and interrelated enough that I kept them together.

Verification

Build	Idle CPU	Startup warnings
v0.0.0 release binary (bare Linux)	149%	yes
v0.0.0 Docker image	175%	yes
Post-fix binary (bare Linux)	0.0%	none
Post-fix Docker image (this branch)	0.86%	none

cargo nextest run — green.
cargo clippy --all-targets --all-features -- -D warnings — clean.
cargo fmt --all — clean.
Manual pgwire smoke test against the post-fix Docker image: connect, CREATE COLLECTION, INSERT, SELECT all work; CPU drops back to <1% immediately after the query.

Test plan

Reproduce idle CPU on v0.0.0 binary and Docker image
Verify fix on native binary (0.0% idle)
Verify fix in Docker image (0.86% idle)
Confirm psql connect + CRUD still works after the fix
Confirm no _system.* warnings on fresh DB start
Confirm docker run -v nodedb-data:/var/lib/nodedb works without --user 0:0
cargo nextest run
cargo clippy --all-targets --all-features -- -D warnings
cargo fmt --all

…sumer The response poller loop unconditionally called yield_now() even when no responses were in flight, keeping a tokio worker pinned at ~100% CPU on an idle server. Similarly the Event Plane consumer woke every 1ms regardless of ring buffer activity. response_poller now uses adaptive backoff: yield_now() while active, ramp to sleep(1ms) after 256 idle iterations, then sleep(10ms) after 1024 (roughly one second of idleness). This bounds idle CPU to ~0.1% of one core while preserving sub-millisecond latency under load. The Event Plane consumer gains the same adaptive ramp: it stays at 1ms for the first 32 empty polls then backs off to 50ms, capping idle wakeups to ~20/sec per core rather than 1000/sec. poll_and_route_responses now returns the routed-response count so the poller can distinguish active from idle iterations. The data-plane tick loops in test harnesses (and session.rs) are tightened to exit on Disconnected as well as on the stop signal — previously a panic-induced drop of the sender left spawn_blocking threads spinning forever on a closed channel, which blocked tokio runtime shutdown and wasted CI time at slow-timeout.

…d-index watcher Previously all post-apply side effects ran inside a tokio::spawn task. The metadata applier then bumped the applied-index watcher, meaning a reader that woke on the watcher bump (e.g. waiting for applied_index to advance past N) could query the in-memory credential or permission cache before install_replicated_user / install_replicated_owner had run — a scheduler-order race that caused sporadic test failures. Split post-apply into two phases: - apply_post_apply_side_effects_sync runs inline on the applier thread BEFORE the watcher bump, covering all in-memory cache updates (users, roles, permissions, API keys, sequences, etc.). Any reader observing applied_index >= N is now guaranteed to see every sync side-effect of every entry up to N. - spawn_post_apply_async_side_effects spawns the genuinely async work (Data Plane Register dispatch for PutCollection). Correctness does not depend on this completing before the watcher advances. Also tighten the cluster-mode CREATE USER path: if the user entry is missing after propose_catalog_entry returns (which can happen when a leader change truncates the log entry between assignment and quorum commit), return a retryable 40001 error so exec_ddl_on_any_leader re-proposes on the current leader rather than silently succeeding with a phantom log index. Single-node mode is unchanged: it still writes to redb and installs the cache entry inline when a catalog is present, and works correctly without one (test fixtures).

…nRetryPolicy The join loop's backoff schedule was a hard-coded match arm table with a fixed attempt count. This made integration tests that exercise join-failure paths (e.g. cluster_join_leader_crash) wait up to ~64 seconds of cumulative backoff per run. Extract the policy into JoinRetryPolicy { max_attempts, max_backoff_secs } with a Default that preserves the production schedule (8 attempts, 32 s ceiling). The per-attempt delay is now derived from a single ceiling value: delay = max_backoff_secs >> (max_attempts - attempt), so the schedule grows exponentially from ~ceiling/2^max_attempts up to the ceiling. The formula is tested directly. ClusterConfig gains a join_retry field. nodedb's cluster init reads NODEDB_JOIN_RETRY_MAX_ATTEMPTS and NODEDB_JOIN_RETRY_MAX_BACKOFF_SECS from the environment so CI and integration test harnesses can override the schedule without recompiling. The raft_loop match arm for Ok(idx) was incorrectly structured as a statement; fixed to return the value directly.

…replication races Several independent sources of CI flakiness in the cluster integration suite are addressed together since they compound each other: Panic-safe teardown: TestClusterNode now implements Drop, firing all watch shutdown senders and aborting every JoinHandle synchronously. Previously a panicking test dropped the node without signalling shutdown, leaving background tasks alive, redb file handles open, and the tokio runtime blocked until nextest killed the process at slow-timeout (~2 minutes per flaky test). Applied-index convergence barrier: exec_ddl_on_any_leader now waits for every follower's applied_index to reach the proposer's current watermark before returning. propose_catalog_entry already waits for the entry to commit on the proposing node, but followers apply asynchronously. Without this barrier, subsequent visibility checks on followers would race the applier queue and trip their timeouts on the cold-start attempt. Rolling-upgrade compat-mode guard: TestCluster::spawn_three now waits for all three nodes to exit rolling-upgrade compat mode before returning. While in compat mode, propose_catalog_entry returns Ok(0) without going through Raft, taking a non-replicated legacy path. Tests that issued DDL immediately after join convergence would silently get a leader-only write and then find the record missing on followers. Test transports use a 4-second RPC timeout instead of the production 5-second default, cutting join-failure test wall time substantially. Wait budgets for all convergence checks are widened from 5s to 10s to absorb cold-start election lag on loaded CI runners without masking genuine regressions. Descriptor lease renewal test creates its collection before acquiring the lease so the renewal loop's lookup_current_version finds it and does not prematurely release the lease as orphaned.

…e renewal system_catalog now opens all declared redb tables during the init transaction. Tables that were referenced later but never opened in the migration block caused a redb schema mismatch on the first write after an upgrade (alert_rules, retention_policies, sequences, sequence_state, column_stats, vector_model_metadata, checkpoints). JWT test RSA keygen switched from 2048-bit to 1024-bit keys. The tests exercise signing and verification logic, not key strength; the reduced size cuts per-test keygen time ~10x without changing coverage. Lease renewal code removes inline comments that duplicated the logic they annotated verbatim, replaced with the struct-update syntax for ClusterTransportTuning in the unit test so it reads clearly. drop(lock_guard) before await in the peer warm-up path in main.rs is replaced with a scoped block to satisfy clippy::await_holding_lock.

…nded getting-started guide Docker: the image no longer runs as root. A new docker-entrypoint.sh (using gosu) fixes ownership on the data volume when started as root, then drops to uid 10001 (nodedb) before exec-ing the server. When already started as a non-root user (--user 10001:10001) the entrypoint passes through directly. This makes named-volume mounts work on Linux hosts where Docker initialises volumes as root-owned. CI: the test workflow now installs cargo-nextest via taiki-e/install-action and runs cargo nextest run. Plain cargo test ignores the nextest.toml cluster test-group that serialises 3-node integration tests and would hang on the cluster suite. JUnit output is uploaded as an artifact on every run for post-mortem analysis. Docs: getting-started gains a prebuilt binary install section for Linux (x64 and arm64), plain docker run instructions alongside the existing Compose block, a systemd unit example, and a unified configuration reference that applies to all install methods. README test command updated to reflect nextest.

Cluster integration tests spin up 3-node Raft clusters with per-node Tokio runtimes; running them alongside the rest of the suite caused port/fd exhaustion and starved Raft heartbeats on high-core machines. Pin them to a single-threaded test-group that claims all test slots, and allow one retry for startup jitter. CI profile adds more retries and JUnit output.

Two independent but compounding issues caused cluster join to hang for tens of seconds on every startup when seeds were not yet bootstrapped: 1. The QUIC RPC timeout only covered the response-read phase. A handshake attempt against an unreachable or not-yet-listening peer blocked for the transport's internal idle timeout (~30 s), not the configured RPC timeout. In a 5-node race where every non-bootstrapper seed redirects to another non-bootstrapper, this multiplied to (N-1) × 30 s of wasted wall time per join attempt. Fixed by wrapping the entire send_rpc_to_addr operation — handshake, stream open, write, and read — in a single tokio::time::timeout bounded by self.rpc_timeout, and extracting the inner work into send_rpc_to_addr_inner so the public interface stays clean. 2. The seed work-list was a Vec used as a stack (pop), so seed order was unspecified. Under the single-elected-bootstrapper rule the lexicographically smallest address is the one peer that can actually answer during the initial race; hitting it last meant exhausting timeouts against every other seed first. Fixed by sorting seeds at the start of the join loop so the designated bootstrapper surfaces first, and switching to VecDeque so leader redirects are pushed to the front (push_front / pop_front) and consumed before unvisited seeds.

farhan-syah added 8 commits April 14, 2026 21:36

farhan-syah merged commit 1ce86d0 into main Apr 14, 2026
2 checks passed

farhan-syah deleted the fix/idle-cpu-burn branch April 14, 2026 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20)#21

fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20)#21
farhan-syah merged 8 commits intomainfrom
fix/idle-cpu-burn

farhan-syah commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

farhan-syah commented Apr 14, 2026

Summary

1. Idle CPU burn (primary fix — commit c10ee61)

2. Missing system-table warnings (commit 1bd04da)

3. Docker volume permission UX (commit 751e804)

4. Docs / mount-path mismatch (commit 751e804)

5. Misc improvements that landed while diagnosing

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Idle CPU burn (primary fix — commit `c10ee61`)

2. Missing system-table warnings (commit `1bd04da`)

3. Docker volume permission UX (commit `751e804`)

4. Docs / mount-path mismatch (commit `751e804`)