fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20)#21
Merged
farhan-syah merged 8 commits intomainfrom Apr 14, 2026
Merged
Conversation
…sumer The response poller loop unconditionally called yield_now() even when no responses were in flight, keeping a tokio worker pinned at ~100% CPU on an idle server. Similarly the Event Plane consumer woke every 1ms regardless of ring buffer activity. response_poller now uses adaptive backoff: yield_now() while active, ramp to sleep(1ms) after 256 idle iterations, then sleep(10ms) after 1024 (roughly one second of idleness). This bounds idle CPU to ~0.1% of one core while preserving sub-millisecond latency under load. The Event Plane consumer gains the same adaptive ramp: it stays at 1ms for the first 32 empty polls then backs off to 50ms, capping idle wakeups to ~20/sec per core rather than 1000/sec. poll_and_route_responses now returns the routed-response count so the poller can distinguish active from idle iterations. The data-plane tick loops in test harnesses (and session.rs) are tightened to exit on Disconnected as well as on the stop signal — previously a panic-induced drop of the sender left spawn_blocking threads spinning forever on a closed channel, which blocked tokio runtime shutdown and wasted CI time at slow-timeout.
…d-index watcher Previously all post-apply side effects ran inside a tokio::spawn task. The metadata applier then bumped the applied-index watcher, meaning a reader that woke on the watcher bump (e.g. waiting for applied_index to advance past N) could query the in-memory credential or permission cache before install_replicated_user / install_replicated_owner had run — a scheduler-order race that caused sporadic test failures. Split post-apply into two phases: - apply_post_apply_side_effects_sync runs inline on the applier thread BEFORE the watcher bump, covering all in-memory cache updates (users, roles, permissions, API keys, sequences, etc.). Any reader observing applied_index >= N is now guaranteed to see every sync side-effect of every entry up to N. - spawn_post_apply_async_side_effects spawns the genuinely async work (Data Plane Register dispatch for PutCollection). Correctness does not depend on this completing before the watcher advances. Also tighten the cluster-mode CREATE USER path: if the user entry is missing after propose_catalog_entry returns (which can happen when a leader change truncates the log entry between assignment and quorum commit), return a retryable 40001 error so exec_ddl_on_any_leader re-proposes on the current leader rather than silently succeeding with a phantom log index. Single-node mode is unchanged: it still writes to redb and installs the cache entry inline when a catalog is present, and works correctly without one (test fixtures).
…nRetryPolicy
The join loop's backoff schedule was a hard-coded match arm table with
a fixed attempt count. This made integration tests that exercise
join-failure paths (e.g. cluster_join_leader_crash) wait up to ~64
seconds of cumulative backoff per run.
Extract the policy into JoinRetryPolicy { max_attempts, max_backoff_secs }
with a Default that preserves the production schedule (8 attempts, 32 s
ceiling). The per-attempt delay is now derived from a single ceiling
value: delay = max_backoff_secs >> (max_attempts - attempt), so the
schedule grows exponentially from ~ceiling/2^max_attempts up to the
ceiling. The formula is tested directly.
ClusterConfig gains a join_retry field. nodedb's cluster init reads
NODEDB_JOIN_RETRY_MAX_ATTEMPTS and NODEDB_JOIN_RETRY_MAX_BACKOFF_SECS
from the environment so CI and integration test harnesses can override
the schedule without recompiling.
The raft_loop match arm for Ok(idx) was incorrectly structured as a
statement; fixed to return the value directly.
…replication races Several independent sources of CI flakiness in the cluster integration suite are addressed together since they compound each other: Panic-safe teardown: TestClusterNode now implements Drop, firing all watch shutdown senders and aborting every JoinHandle synchronously. Previously a panicking test dropped the node without signalling shutdown, leaving background tasks alive, redb file handles open, and the tokio runtime blocked until nextest killed the process at slow-timeout (~2 minutes per flaky test). Applied-index convergence barrier: exec_ddl_on_any_leader now waits for every follower's applied_index to reach the proposer's current watermark before returning. propose_catalog_entry already waits for the entry to commit on the proposing node, but followers apply asynchronously. Without this barrier, subsequent visibility checks on followers would race the applier queue and trip their timeouts on the cold-start attempt. Rolling-upgrade compat-mode guard: TestCluster::spawn_three now waits for all three nodes to exit rolling-upgrade compat mode before returning. While in compat mode, propose_catalog_entry returns Ok(0) without going through Raft, taking a non-replicated legacy path. Tests that issued DDL immediately after join convergence would silently get a leader-only write and then find the record missing on followers. Test transports use a 4-second RPC timeout instead of the production 5-second default, cutting join-failure test wall time substantially. Wait budgets for all convergence checks are widened from 5s to 10s to absorb cold-start election lag on loaded CI runners without masking genuine regressions. Descriptor lease renewal test creates its collection before acquiring the lease so the renewal loop's lookup_current_version finds it and does not prematurely release the lease as orphaned.
…e renewal system_catalog now opens all declared redb tables during the init transaction. Tables that were referenced later but never opened in the migration block caused a redb schema mismatch on the first write after an upgrade (alert_rules, retention_policies, sequences, sequence_state, column_stats, vector_model_metadata, checkpoints). JWT test RSA keygen switched from 2048-bit to 1024-bit keys. The tests exercise signing and verification logic, not key strength; the reduced size cuts per-test keygen time ~10x without changing coverage. Lease renewal code removes inline comments that duplicated the logic they annotated verbatim, replaced with the struct-update syntax for ClusterTransportTuning in the unit test so it reads clearly. drop(lock_guard) before await in the peer warm-up path in main.rs is replaced with a scoped block to satisfy clippy::await_holding_lock.
…nded getting-started guide Docker: the image no longer runs as root. A new docker-entrypoint.sh (using gosu) fixes ownership on the data volume when started as root, then drops to uid 10001 (nodedb) before exec-ing the server. When already started as a non-root user (--user 10001:10001) the entrypoint passes through directly. This makes named-volume mounts work on Linux hosts where Docker initialises volumes as root-owned. CI: the test workflow now installs cargo-nextest via taiki-e/install-action and runs cargo nextest run. Plain cargo test ignores the nextest.toml cluster test-group that serialises 3-node integration tests and would hang on the cluster suite. JUnit output is uploaded as an artifact on every run for post-mortem analysis. Docs: getting-started gains a prebuilt binary install section for Linux (x64 and arm64), plain docker run instructions alongside the existing Compose block, a systemd unit example, and a unified configuration reference that applies to all install methods. README test command updated to reflect nextest.
Cluster integration tests spin up 3-node Raft clusters with per-node Tokio runtimes; running them alongside the rest of the suite caused port/fd exhaustion and starved Raft heartbeats on high-core machines. Pin them to a single-threaded test-group that claims all test slots, and allow one retry for startup jitter. CI profile adds more retries and JUnit output.
Two independent but compounding issues caused cluster join to hang for tens of seconds on every startup when seeds were not yet bootstrapped: 1. The QUIC RPC timeout only covered the response-read phase. A handshake attempt against an unreachable or not-yet-listening peer blocked for the transport's internal idle timeout (~30 s), not the configured RPC timeout. In a 5-node race where every non-bootstrapper seed redirects to another non-bootstrapper, this multiplied to (N-1) × 30 s of wasted wall time per join attempt. Fixed by wrapping the entire send_rpc_to_addr operation — handshake, stream open, write, and read — in a single tokio::time::timeout bounded by self.rpc_timeout, and extracting the inner work into send_rpc_to_addr_inner so the public interface stays clean. 2. The seed work-list was a Vec used as a stack (pop), so seed order was unspecified. Under the single-elected-bootstrapper rule the lexicographically smallest address is the one peer that can actually answer during the initial race; hitting it last meant exhausting timeouts against every other seed first. Fixed by sorting seeds at the start of the join loop so the designated bootstrapper surfaces first, and switching to VecDeque so leader redirects are pushed to the front (push_front / pop_front) and consumed before unvisited seeds.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #20.
Full context, reproduction, and before/after numbers on the issue: #20 (comment)
Summary
Reported symptom: a fresh NodeDB container (or native binary) idles at ~150-175% CPU with no schema, no clients, and no workload. This branch fixes the root cause plus the four related side-findings in the same report.
1. Idle CPU burn (primary fix — commit
c10ee61)Two busy-poll loops in the Control / Event Plane had no idle backoff:
main.rs) wasloop { poll_and_route_responses(); yield_now().await; }.yield_now()immediately re-schedules, so one Tokio worker was pegged at 100% with zero clients connected.event/consumer.rs) empty-ring poll was fixed at 1ms per core × 23 Data Plane cores ≈ 23k task wakes/sec even at idle.Fix:
poll_and_route_responsesnow returns the number of routed responses so the loop can detect activity.yield_nowwhile routing or within 256 idle iters (sub-ms burst recovery), then 1ms, then 10ms.The hot path still uses
yield_nowwhile responses are flowing, so request latency under load is unchanged.2. Missing system-table warnings (commit
1bd04da)The
SystemCatalog::openinit transaction was missing 7 tables, so fresh DBs spewedTable '_system.alert_rules' does not existetc. on startup. AddedALERT_RULES,RETENTION_POLICIES,SEQUENCES,SEQUENCE_STATE,COLUMN_STATS,VECTOR_MODEL_METADATA,CHECKPOINTSto the init path.3. Docker volume permission UX (commit
751e804)docker-entrypoint.sh: runs as root just long enough tochownthe data volume tonodedb:nodedb, thenexec gosu nodedbdrops privileges.-v nodedb-data:/var/lib/nodedbnow works without--user 0:0. If the container is started with--user 10001:10001the entrypoint detects it and skips the chown.WAL I/O error: Permission denied (os error 13)when the volume is unwritable.gosu, movedUSERswitch into the entrypoint, addedCOPY docker-entrypoint.sh.4. Docs / mount-path mismatch (commit
751e804)nodedb-docs/docs/introduction/docker.rdxnow uses/var/lib/nodedbeverywhere (was-v nodedb-data:/datain 5 places, which silently lost data ondocker rmbecause the image doesn't setNODEDB_DATA_DIR=/data).installation.rdxandgetting-started.mdstructure: prebuilt Linux binary recommended first (best performance), Docker for macOS/Windows/WSL2, source for development. All share one Configuration section. Binary download command resolveslatesttag and arch dynamically — no hardcoded version.5. Misc improvements that landed while diagnosing
06d3b07— nextest config serialising cluster tests + retry-on-flake.3ce34b8— cluster test harness hardened against shutdown/replication flakes (including thewait_forfor rolling-upgrade compat-mode exit at the bottom ofTestCluster::spawn_three).48b1081—nodedb-clusterjoin retry policy is now configurable viaJoinRetryConfiginstead of hardcoded constants.a88356b— closes a post-apply race between the in-memory metadata cache and the applied-index watcher (small, unrelated to idle CPU but caught by flaky tests during this work).Each of these is its own commit for a reason — happy to split the branch into multiple PRs if reviewers prefer, but they're all small and interrelated enough that I kept them together.
Verification
cargo nextest run— green.cargo clippy --all-targets --all-features -- -D warnings— clean.cargo fmt --all— clean.CREATE COLLECTION,INSERT,SELECTall work; CPU drops back to <1% immediately after the query.Test plan
_system.*warnings on fresh DB startdocker run -v nodedb-data:/var/lib/nodedbworks without--user 0:0cargo nextest runcargo clippy --all-targets --all-features -- -D warningscargo fmt --all