Benchmark the real Pilot Go binary on Terminal-Bench 2.0. Thin Python shim wraps the production pilot task --local command — same executor pipeline as production, minus git workflow.
Harbor Orchestrator → PilotAgent (Python shim) → pilot binary (Go) → Claude Code
↓ ↓
Install & upload binary Full executor pipeline:
Upload test files prompt builder, model routing,
Write config.yaml hooks, quality gates, retries
Parse result JSON
Not a from-scratch agent. Claude Code is the executor. Pilot's Go binary handles prompt construction, model routing, quality gates, effort routing, and retries. The Python shim just installs everything in the container and collects results.
| Decision | Chosen | Why |
|---|---|---|
| Agent architecture | Go binary shim (not Python pipeline) | Python v3 agent scored 36.2% — worse than stock Claude Code (58%). Real binary uses production-tested prompt builder. |
| Prompt strategy | LocalMode problem-solving prompt | No restrictive PR constraints. Test-first: "check /tests/test_outputs.py before anything else." |
| Quality gates | pytest on test_outputs.py, retry 2x | Game-changer for tasks where first attempt is close but not exact (chess, data transforms). |
| Heartbeat timeout | 15min (configurable in v2.76+) | Complex tasks (image analysis, large codebases) need more than 5min between events. |
| Container setup | Root, bypassPermissions | Non-root caused 8 task failures (pip site-packages invisible, apt denied). |
| Run | break-filter | chess | gcode | Score | Key Change |
|---|---|---|---|---|---|
| val2-3 | 1.0 | 0.0 | 0.0 | 33% | Baseline — verifier broken |
| val4 | 1.0 | 1.0 | 0.0 | 67% | Quality gates + retry |
| val5 | 1.0 | 0.0 | 0.0 | 33% | Quality gate PATH broken |
| val8 | err | 1.0 | 0.0 | 50% | Sandbox deleted |
| val9 | 1.0 | 0.0 | 0.0 | 33% | Navigator prompt hijack |
| val10 | 1.0 | 1.0 | 1.0 | 100% | LocalMode priority + test-first prompt |
| Agent | Score | Notes |
|---|---|---|
| Pilot Python v3 | 36.2% | 5-step pipeline, 9KB prompt bloat |
| Claude Code (stock) | 58.0% | No optimizations |
| KIRA (best Claude agent) | 74.7% | Opus 4.6, 3 tools |
| Forge Code (#1) | 78.4% | Gemini 3.1 Pro |
| Pilot (target) | ≥58% | Must beat stock Claude Code |
# Prerequisites
pip install harbor-ai
source .env # DAYTONA_API_KEY, DAYTONA_BASE_URL, CLAUDE_CODE_OAUTH_TOKEN
# Build the binary (from project root)
make bench-binary
# 3-task validation
harbor run --job-name pilot-val -o jobs \
-d [email protected] --agent-import-path "pilot_agent:PilotAgent" \
-m "anthropic/claude-opus-4-6" -e daytona -n 1 --timeout-multiplier 5.0 \
--ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN" \
-t chess-best-move -t break-filter-js-from-html -t gcode-to-text
# Full 89-task run
harbor run --job-name pilot-full -o jobs \
-d [email protected] --agent-import-path "pilot_agent:PilotAgent" \
-m "anthropic/claude-opus-4-6" -e daytona -n 1 --timeout-multiplier 5.0 \
--ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN"
# Check results
python3 pilot_agent/scripts/analyze-results.py jobs/<job-name>pilot-bench/
├── pilot_agent/
│ ├── __init__.py # Exports PilotAgent
│ ├── agent.py # PilotAgent — install, run, parse results
│ ├── templates/
│ │ └── install-pilot-agent.sh.j2 # Container bootstrap (Node, Claude, uv)
│ └── scripts/
│ └── analyze-results.py # Post-run failure analysis
├── bin/
│ └── pilot-linux-amd64 # Pre-built static binary (19MB)
├── pyproject.toml # Python package config
├── WORKLOG.md # Development log with all validation runs
├── CHANGELOG-v4.md # Historical: Python v4 optimization notes
└── README.md # This file
The agent writes a production-grade ~/.pilot/config.yaml in the container:
- Quality gates:
pytest test_outputs.pywith 2 retries on failure - Effort routing: haiku classifier → effort level mapping
- Model routing: all tiers set to the run model (single-model bench)
- Hooks:
run_tests_on_stop,block_destructive - Retry: rate limit (3x, 30s backoff), API error (3x, 5s), timeout (2x, extend 1.5x)
- Intent judge: haiku (requires ANTHROPIC_API_KEY)
Bench findings that became real Pilot improvements:
| Finding | Issue | Status |
|---|---|---|
| LocalMode prompt priority bug | GH-2103 | Merged |
| Configurable HeartbeatTimeout | GH-2104 | Merged |
| Heartbeat cancel on result event | GH-2103 | Merged |
- Sequential only (
n=1) — Daytona free tier: 10GiB, ~4GiB per sandbox - DNS unreliable in Daytona sandboxes — mitigated with
8.8.8.8+ uv pre-install - Intent judge needs
ANTHROPIC_API_KEY(bench uses OAuth token only) - Full run takes 12-20h at
n=1 - Cost:
$1-1.30 per task ($90-115 for full 89-task run)