bug: reaper discards child exit status, masking agent crash root cause

## Problem

Container restarts are occurring with termination reason "Completed" and exit code 0, despite the agent process likely crashing. This makes root cause analysis impossible.

## Observed Behavior

- Container restarts correlate with `yamux: Failed to read header: failed to read frame header: EOF` on the server
- All restarts show exit code 0 with "Completed" — regardless of actual cause
- No OOM pattern (memory usage varied from 9% to 89% at crash time)
- No server-side disconnect or timeout logs precede the crashes

## Root Cause Analysis

The agent runs as a child of a reaper process (PID 1). In `agent/reaper/reaper_unix.go`, when the child exits, the reaper calls `Wait4` but **discards the exit status entirely** and exits 0. This masks:

- Panic crashes (exit code 2)
- SIGKILL from cgroup limits
- Any other termination signal

Additionally, there is no `recover()` in the agent's production code — any goroutine panic crashes the process immediately with no captured output.

## Proposed Fixes

1. **Reaper logging:** Patch the reaper to log the child's exit status (`wstatus` from `Wait4`) before exiting. This would immediately reveal on the next crash whether it's a panic, SIGKILL, or other cause.

2. **Panic recovery:** Ensure every goroutine has a deferred `recover()` declared at the top of its function. This would catch panics, log the stack trace, and prevent silent crashes that are impossible to diagnose.

## Environment

- Kubernetes deployment with `singleProcessOOMKill` enabled
- 220GB memory limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: reaper discards child exit status, masking agent crash root cause #21661

Problem

Observed Behavior

Root Cause Analysis

Proposed Fixes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: reaper discards child exit status, masking agent crash root cause #21661

Description

Problem

Observed Behavior

Root Cause Analysis

Proposed Fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions