Skip to content

fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20)#21

Merged
farhan-syah merged 8 commits intomainfrom
fix/idle-cpu-burn
Apr 14, 2026
Merged

fix: eliminate idle CPU burn + missing system-table warnings + Docker volume UX (closes #20)#21
farhan-syah merged 8 commits intomainfrom
fix/idle-cpu-burn

Conversation

@farhan-syah
Copy link
Copy Markdown
Contributor

Closes #20.

Full context, reproduction, and before/after numbers on the issue: #20 (comment)

Summary

Reported symptom: a fresh NodeDB container (or native binary) idles at ~150-175% CPU with no schema, no clients, and no workload. This branch fixes the root cause plus the four related side-findings in the same report.

1. Idle CPU burn (primary fix — commit c10ee61)

Two busy-poll loops in the Control / Event Plane had no idle backoff:

  • Response poller (main.rs) was loop { poll_and_route_responses(); yield_now().await; }. yield_now() immediately re-schedules, so one Tokio worker was pegged at 100% with zero clients connected.
  • Event consumer (event/consumer.rs) empty-ring poll was fixed at 1ms per core × 23 Data Plane cores ≈ 23k task wakes/sec even at idle.

Fix:

  • poll_and_route_responses now returns the number of routed responses so the loop can detect activity.
  • Response poller uses adaptive backoff: yield_now while routing or within 256 idle iters (sub-ms burst recovery), then 1ms, then 10ms.
  • Event consumer empty-poll ramps from 1ms → 50ms after 32 consecutive empty polls, resets on the first batch.

The hot path still uses yield_now while responses are flowing, so request latency under load is unchanged.

2. Missing system-table warnings (commit 1bd04da)

The SystemCatalog::open init transaction was missing 7 tables, so fresh DBs spewed Table '_system.alert_rules' does not exist etc. on startup. Added ALERT_RULES, RETENTION_POLICIES, SEQUENCES, SEQUENCE_STATE, COLUMN_STATS, VECTOR_MODEL_METADATA, CHECKPOINTS to the init path.

3. Docker volume permission UX (commit 751e804)

  • New docker-entrypoint.sh: runs as root just long enough to chown the data volume to nodedb:nodedb, then exec gosu nodedb drops privileges. -v nodedb-data:/var/lib/nodedb now works without --user 0:0. If the container is started with --user 10001:10001 the entrypoint detects it and skips the chown.
  • Clear actionable error instead of cryptic WAL I/O error: Permission denied (os error 13) when the volume is unwritable.
  • Dockerfile: added gosu, moved USER switch into the entrypoint, added COPY docker-entrypoint.sh.

4. Docs / mount-path mismatch (commit 751e804)

  • nodedb-docs/docs/introduction/docker.rdx now uses /var/lib/nodedb everywhere (was -v nodedb-data:/data in 5 places, which silently lost data on docker rm because the image doesn't set NODEDB_DATA_DIR=/data).
  • New installation.rdx and getting-started.md structure: prebuilt Linux binary recommended first (best performance), Docker for macOS/Windows/WSL2, source for development. All share one Configuration section. Binary download command resolves latest tag and arch dynamically — no hardcoded version.

5. Misc improvements that landed while diagnosing

  • 06d3b07 — nextest config serialising cluster tests + retry-on-flake.
  • 3ce34b8 — cluster test harness hardened against shutdown/replication flakes (including the wait_for for rolling-upgrade compat-mode exit at the bottom of TestCluster::spawn_three).
  • 48b1081nodedb-cluster join retry policy is now configurable via JoinRetryConfig instead of hardcoded constants.
  • a88356b — closes a post-apply race between the in-memory metadata cache and the applied-index watcher (small, unrelated to idle CPU but caught by flaky tests during this work).

Each of these is its own commit for a reason — happy to split the branch into multiple PRs if reviewers prefer, but they're all small and interrelated enough that I kept them together.

Verification

Build Idle CPU Startup warnings
v0.0.0 release binary (bare Linux) 149% yes
v0.0.0 Docker image 175% yes
Post-fix binary (bare Linux) 0.0% none
Post-fix Docker image (this branch) 0.86% none
  • cargo nextest run — green.
  • cargo clippy --all-targets --all-features -- -D warnings — clean.
  • cargo fmt --all — clean.
  • Manual pgwire smoke test against the post-fix Docker image: connect, CREATE COLLECTION, INSERT, SELECT all work; CPU drops back to <1% immediately after the query.

Test plan

  • Reproduce idle CPU on v0.0.0 binary and Docker image
  • Verify fix on native binary (0.0% idle)
  • Verify fix in Docker image (0.86% idle)
  • Confirm psql connect + CRUD still works after the fix
  • Confirm no _system.* warnings on fresh DB start
  • Confirm docker run -v nodedb-data:/var/lib/nodedb works without --user 0:0
  • cargo nextest run
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo fmt --all

…sumer

The response poller loop unconditionally called yield_now() even when
no responses were in flight, keeping a tokio worker pinned at ~100%
CPU on an idle server. Similarly the Event Plane consumer woke every
1ms regardless of ring buffer activity.

response_poller now uses adaptive backoff: yield_now() while active,
ramp to sleep(1ms) after 256 idle iterations, then sleep(10ms) after
1024 (roughly one second of idleness). This bounds idle CPU to ~0.1%
of one core while preserving sub-millisecond latency under load.

The Event Plane consumer gains the same adaptive ramp: it stays at
1ms for the first 32 empty polls then backs off to 50ms, capping
idle wakeups to ~20/sec per core rather than 1000/sec.

poll_and_route_responses now returns the routed-response count so the
poller can distinguish active from idle iterations.

The data-plane tick loops in test harnesses (and session.rs) are
tightened to exit on Disconnected as well as on the stop signal —
previously a panic-induced drop of the sender left spawn_blocking
threads spinning forever on a closed channel, which blocked tokio
runtime shutdown and wasted CI time at slow-timeout.
…d-index watcher

Previously all post-apply side effects ran inside a tokio::spawn task.
The metadata applier then bumped the applied-index watcher, meaning a
reader that woke on the watcher bump (e.g. waiting for applied_index
to advance past N) could query the in-memory credential or permission
cache before install_replicated_user / install_replicated_owner had
run — a scheduler-order race that caused sporadic test failures.

Split post-apply into two phases:

- apply_post_apply_side_effects_sync runs inline on the applier thread
  BEFORE the watcher bump, covering all in-memory cache updates (users,
  roles, permissions, API keys, sequences, etc.). Any reader observing
  applied_index >= N is now guaranteed to see every sync side-effect
  of every entry up to N.

- spawn_post_apply_async_side_effects spawns the genuinely async work
  (Data Plane Register dispatch for PutCollection). Correctness does
  not depend on this completing before the watcher advances.

Also tighten the cluster-mode CREATE USER path: if the user entry is
missing after propose_catalog_entry returns (which can happen when a
leader change truncates the log entry between assignment and quorum
commit), return a retryable 40001 error so exec_ddl_on_any_leader
re-proposes on the current leader rather than silently succeeding with
a phantom log index.

Single-node mode is unchanged: it still writes to redb and installs
the cache entry inline when a catalog is present, and works correctly
without one (test fixtures).
…nRetryPolicy

The join loop's backoff schedule was a hard-coded match arm table with
a fixed attempt count. This made integration tests that exercise
join-failure paths (e.g. cluster_join_leader_crash) wait up to ~64
seconds of cumulative backoff per run.

Extract the policy into JoinRetryPolicy { max_attempts, max_backoff_secs }
with a Default that preserves the production schedule (8 attempts, 32 s
ceiling). The per-attempt delay is now derived from a single ceiling
value: delay = max_backoff_secs >> (max_attempts - attempt), so the
schedule grows exponentially from ~ceiling/2^max_attempts up to the
ceiling. The formula is tested directly.

ClusterConfig gains a join_retry field. nodedb's cluster init reads
NODEDB_JOIN_RETRY_MAX_ATTEMPTS and NODEDB_JOIN_RETRY_MAX_BACKOFF_SECS
from the environment so CI and integration test harnesses can override
the schedule without recompiling.

The raft_loop match arm for Ok(idx) was incorrectly structured as a
statement; fixed to return the value directly.
…replication races

Several independent sources of CI flakiness in the cluster integration
suite are addressed together since they compound each other:

Panic-safe teardown: TestClusterNode now implements Drop, firing all
watch shutdown senders and aborting every JoinHandle synchronously.
Previously a panicking test dropped the node without signalling
shutdown, leaving background tasks alive, redb file handles open, and
the tokio runtime blocked until nextest killed the process at
slow-timeout (~2 minutes per flaky test).

Applied-index convergence barrier: exec_ddl_on_any_leader now waits
for every follower's applied_index to reach the proposer's current
watermark before returning. propose_catalog_entry already waits for
the entry to commit on the proposing node, but followers apply
asynchronously. Without this barrier, subsequent visibility checks on
followers would race the applier queue and trip their timeouts on the
cold-start attempt.

Rolling-upgrade compat-mode guard: TestCluster::spawn_three now waits
for all three nodes to exit rolling-upgrade compat mode before
returning. While in compat mode, propose_catalog_entry returns Ok(0)
without going through Raft, taking a non-replicated legacy path. Tests
that issued DDL immediately after join convergence would silently get
a leader-only write and then find the record missing on followers.

Test transports use a 4-second RPC timeout instead of the production
5-second default, cutting join-failure test wall time substantially.

Wait budgets for all convergence checks are widened from 5s to 10s to
absorb cold-start election lag on loaded CI runners without masking
genuine regressions.

Descriptor lease renewal test creates its collection before acquiring
the lease so the renewal loop's lookup_current_version finds it and
does not prematurely release the lease as orphaned.
…e renewal

system_catalog now opens all declared redb tables during the init
transaction. Tables that were referenced later but never opened in the
migration block caused a redb schema mismatch on the first write after
an upgrade (alert_rules, retention_policies, sequences, sequence_state,
column_stats, vector_model_metadata, checkpoints).

JWT test RSA keygen switched from 2048-bit to 1024-bit keys. The tests
exercise signing and verification logic, not key strength; the reduced
size cuts per-test keygen time ~10x without changing coverage.

Lease renewal code removes inline comments that duplicated the logic
they annotated verbatim, replaced with the struct-update syntax for
ClusterTransportTuning in the unit test so it reads clearly.

drop(lock_guard) before await in the peer warm-up path in main.rs is
replaced with a scoped block to satisfy clippy::await_holding_lock.
…nded getting-started guide

Docker: the image no longer runs as root. A new docker-entrypoint.sh
(using gosu) fixes ownership on the data volume when started as root,
then drops to uid 10001 (nodedb) before exec-ing the server. When
already started as a non-root user (--user 10001:10001) the entrypoint
passes through directly. This makes named-volume mounts work on Linux
hosts where Docker initialises volumes as root-owned.

CI: the test workflow now installs cargo-nextest via taiki-e/install-action
and runs cargo nextest run. Plain cargo test ignores the nextest.toml
cluster test-group that serialises 3-node integration tests and would
hang on the cluster suite. JUnit output is uploaded as an artifact on
every run for post-mortem analysis.

Docs: getting-started gains a prebuilt binary install section for
Linux (x64 and arm64), plain docker run instructions alongside the
existing Compose block, a systemd unit example, and a unified
configuration reference that applies to all install methods.
README test command updated to reflect nextest.
Cluster integration tests spin up 3-node Raft clusters with per-node
Tokio runtimes; running them alongside the rest of the suite caused
port/fd exhaustion and starved Raft heartbeats on high-core machines.
Pin them to a single-threaded test-group that claims all test slots,
and allow one retry for startup jitter. CI profile adds more retries
and JUnit output.
Two independent but compounding issues caused cluster join to hang for
tens of seconds on every startup when seeds were not yet bootstrapped:

1. The QUIC RPC timeout only covered the response-read phase. A
   handshake attempt against an unreachable or not-yet-listening peer
   blocked for the transport's internal idle timeout (~30 s), not the
   configured RPC timeout. In a 5-node race where every non-bootstrapper
   seed redirects to another non-bootstrapper, this multiplied to
   (N-1) × 30 s of wasted wall time per join attempt.

   Fixed by wrapping the entire send_rpc_to_addr operation — handshake,
   stream open, write, and read — in a single tokio::time::timeout
   bounded by self.rpc_timeout, and extracting the inner work into
   send_rpc_to_addr_inner so the public interface stays clean.

2. The seed work-list was a Vec used as a stack (pop), so seed order
   was unspecified. Under the single-elected-bootstrapper rule the
   lexicographically smallest address is the one peer that can actually
   answer during the initial race; hitting it last meant exhausting
   timeouts against every other seed first.

   Fixed by sorting seeds at the start of the join loop so the
   designated bootstrapper surfaces first, and switching to VecDeque
   so leader redirects are pushed to the front (push_front / pop_front)
   and consumed before unvisited seeds.
@farhan-syah farhan-syah merged commit 1ce86d0 into main Apr 14, 2026
2 checks passed
@farhan-syah farhan-syah deleted the fix/idle-cpu-burn branch April 14, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fresh container idles at ~144% CPU (v0.0.0) — no schema, no clients, no workload

1 participant