pilot-bench

Pilot Terminal-Bench Agent

Benchmark the real Pilot Go binary on Terminal-Bench 2.0. Thin Python shim wraps the production pilot task --local command — same executor pipeline as production, minus git workflow.

Architecture

Harbor Orchestrator → PilotAgent (Python shim) → pilot binary (Go) → Claude Code
                      ↓                          ↓
                      Install & upload binary     Full executor pipeline:
                      Upload test files           prompt builder, model routing,
                      Write config.yaml           hooks, quality gates, retries
                      Parse result JSON

Not a from-scratch agent. Claude Code is the executor. Pilot's Go binary handles prompt construction, model routing, quality gates, effort routing, and retries. The Python shim just installs everything in the container and collects results.

Key Design Decisions

Decision	Chosen	Why
Agent architecture	Go binary shim (not Python pipeline)	Python v3 agent scored 36.2% — worse than stock Claude Code (58%). Real binary uses production-tested prompt builder.
Prompt strategy	LocalMode problem-solving prompt	No restrictive PR constraints. Test-first: "check /tests/test_outputs.py before anything else."
Quality gates	pytest on test_outputs.py, retry 2x	Game-changer for tasks where first attempt is close but not exact (chess, data transforms).
Heartbeat timeout	15min (configurable in v2.76+)	Complex tasks (image analysis, large codebases) need more than 5min between events.
Container setup	Root, bypassPermissions	Non-root caused 8 task failures (pip site-packages invisible, apt denied).

Validation History

Run	break-filter	chess	gcode	Score	Key Change
val2-3	1.0	0.0	0.0	33%	Baseline — verifier broken
val4	1.0	1.0	0.0	67%	Quality gates + retry
val5	1.0	0.0	0.0	33%	Quality gate PATH broken
val8	err	1.0	0.0	50%	Sandbox deleted
val9	1.0	0.0	0.0	33%	Navigator prompt hijack
val10	1.0	1.0	1.0	100%	LocalMode priority + test-first prompt

Score Targets

Agent	Score	Notes
Pilot Python v3	36.2%	5-step pipeline, 9KB prompt bloat
Claude Code (stock)	58.0%	No optimizations
KIRA (best Claude agent)	74.7%	Opus 4.6, 3 tools
Forge Code (#1)	78.4%	Gemini 3.1 Pro
Pilot (target)	≥58%	Must beat stock Claude Code

Quick Start

# Prerequisites
pip install harbor-ai
source .env  # DAYTONA_API_KEY, DAYTONA_BASE_URL, CLAUDE_CODE_OAUTH_TOKEN

# Build the binary (from project root)
make bench-binary

# 3-task validation
harbor run --job-name pilot-val -o jobs \
  -d [email protected] --agent-import-path "pilot_agent:PilotAgent" \
  -m "anthropic/claude-opus-4-6" -e daytona -n 1 --timeout-multiplier 5.0 \
  --ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN" \
  -t chess-best-move -t break-filter-js-from-html -t gcode-to-text

# Full 89-task run
harbor run --job-name pilot-full -o jobs \
  -d [email protected] --agent-import-path "pilot_agent:PilotAgent" \
  -m "anthropic/claude-opus-4-6" -e daytona -n 1 --timeout-multiplier 5.0 \
  --ae "CLAUDE_CODE_OAUTH_TOKEN=$CLAUDE_CODE_OAUTH_TOKEN"

# Check results
python3 pilot_agent/scripts/analyze-results.py jobs/<job-name>

File Structure

pilot-bench/
├── pilot_agent/
│   ├── __init__.py                        # Exports PilotAgent
│   ├── agent.py                           # PilotAgent — install, run, parse results
│   ├── templates/
│   │   └── install-pilot-agent.sh.j2      # Container bootstrap (Node, Claude, uv)
│   └── scripts/
│       └── analyze-results.py             # Post-run failure analysis
├── bin/
│   └── pilot-linux-amd64                  # Pre-built static binary (19MB)
├── pyproject.toml                         # Python package config
├── WORKLOG.md                             # Development log with all validation runs
├── CHANGELOG-v4.md                        # Historical: Python v4 optimization notes
└── README.md                              # This file

Container Config

The agent writes a production-grade ~/.pilot/config.yaml in the container:

Quality gates: pytest test_outputs.py with 2 retries on failure
Effort routing: haiku classifier → effort level mapping
Model routing: all tiers set to the run model (single-model bench)
Hooks: run_tests_on_stop, block_destructive
Retry: rate limit (3x, 30s backoff), API error (3x, 5s), timeout (2x, extend 1.5x)
Intent judge: haiku (requires ANTHROPIC_API_KEY)

Findings → Production

Bench findings that became real Pilot improvements:

Finding	Issue	Status
LocalMode prompt priority bug	GH-2103	Merged
Configurable HeartbeatTimeout	GH-2104	Merged
Heartbeat cancel on result event	GH-2103	Merged

Known Limitations

Sequential only (n=1) — Daytona free tier: 10GiB, ~4GiB per sandbox
DNS unreliable in Daytona sandboxes — mitigated with 8.8.8.8 + uv pre-install
Intent judge needs ANTHROPIC_API_KEY (bench uses OAuth token only)
Full run takes 12-20h at n=1
Cost: ~~$1-1.30 per task (~~$90-115 for full 89-task run)

Name		Name	Last commit message	Last commit date
parent directory ..
aws		aws
pilot_agent		pilot_agent
.gitignore		.gitignore
CHANGELOG-v4.md		CHANGELOG-v4.md
README.md		README.md
WORKLOG.md		WORKLOG.md
pyproject.toml		pyproject.toml
refresh-token.sh		refresh-token.sh
run-benchmark.sh		run-benchmark.sh
run.sh		run.sh
submit.sh		submit.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Pilot Terminal-Bench Agent

Architecture

Key Design Decisions

Validation History

Score Targets

Quick Start

File Structure

Container Config

Findings → Production

Known Limitations

Uh oh!

FilesExpand file tree

pilot-bench

Directory actions

More options

Directory actions

More options

Latest commit

History

pilot-bench

Folders and files

parent directory

README.md

Pilot Terminal-Bench Agent

Architecture

Key Design Decisions

Validation History

Score Targets

Quick Start

File Structure

Container Config

Findings → Production

Known Limitations