-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Problem
Container restarts are occurring with termination reason "Completed" and exit code 0, despite the agent process likely crashing. This makes root cause analysis impossible.
Observed Behavior
- Container restarts correlate with
yamux: Failed to read header: failed to read frame header: EOFon the server - All restarts show exit code 0 with "Completed" — regardless of actual cause
- No OOM pattern (memory usage varied from 9% to 89% at crash time)
- No server-side disconnect or timeout logs precede the crashes
Root Cause Analysis
The agent runs as a child of a reaper process (PID 1). In agent/reaper/reaper_unix.go, when the child exits, the reaper calls Wait4 but discards the exit status entirely and exits 0. This masks:
- Panic crashes (exit code 2)
- SIGKILL from cgroup limits
- Any other termination signal
Additionally, there is no recover() in the agent's production code — any goroutine panic crashes the process immediately with no captured output.
Proposed Fixes
-
Reaper logging: Patch the reaper to log the child's exit status (
wstatusfromWait4) before exiting. This would immediately reveal on the next crash whether it's a panic, SIGKILL, or other cause. -
Panic recovery: Ensure every goroutine has a deferred
recover()declared at the top of its function. This would catch panics, log the stack trace, and prevent silent crashes that are impossible to diagnose.
Environment
- Kubernetes deployment with
singleProcessOOMKillenabled - 220GB memory limit