Skip to content

bug: reaper discards child exit status, masking agent crash root cause #21661

@blinkagent

Description

@blinkagent

Problem

Container restarts are occurring with termination reason "Completed" and exit code 0, despite the agent process likely crashing. This makes root cause analysis impossible.

Observed Behavior

  • Container restarts correlate with yamux: Failed to read header: failed to read frame header: EOF on the server
  • All restarts show exit code 0 with "Completed" — regardless of actual cause
  • No OOM pattern (memory usage varied from 9% to 89% at crash time)
  • No server-side disconnect or timeout logs precede the crashes

Root Cause Analysis

The agent runs as a child of a reaper process (PID 1). In agent/reaper/reaper_unix.go, when the child exits, the reaper calls Wait4 but discards the exit status entirely and exits 0. This masks:

  • Panic crashes (exit code 2)
  • SIGKILL from cgroup limits
  • Any other termination signal

Additionally, there is no recover() in the agent's production code — any goroutine panic crashes the process immediately with no captured output.

Proposed Fixes

  1. Reaper logging: Patch the reaper to log the child's exit status (wstatus from Wait4) before exiting. This would immediately reveal on the next crash whether it's a panic, SIGKILL, or other cause.

  2. Panic recovery: Ensure every goroutine has a deferred recover() declared at the top of its function. This would catch panics, log the stack trace, and prevent silent crashes that are impossible to diagnose.

Environment

  • Kubernetes deployment with singleProcessOOMKill enabled
  • 220GB memory limit

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions