Ralph is a harness for AI coding agents. You hand it a feature spec and walk away. It steers the agent with codebase context, verifies the output with structured checks, retries with actionable feedback, and learns from its mistakes across runs.
The problem it solves: AI coding agents are powerful, but they work on a single prompt at a time. If the agent doesn't finish in one shot, you're back to manually re-prompting, checking progress, and deciding what to try next. And even when the agent says "done," there's no guarantee the code actually works. Ralph automates the outer loop - iteration, verification, and improvement - so the agent produces working code, not just code that claims to work.
Most agent wrappers are retry loops: run the agent, check if it's done, retry if not. Ralph applies harness engineering - a combination of feedforward controls (steer the agent before it acts) and feedback sensors (verify after it acts) to systematically increase confidence in agent output.
flowchart TD
PRD["PRD + Prompt"] --> FF["Phase 0: Feedforward\nModule map, interfaces,\ndependency graph, conventions"]
FF --> Agent["AI Coding Agent\nClaude Code, Codex, or custom"]
Agent --> P1["Phase 1: Mechanical verification\nTests, typecheck, lint,\nscope, secrets, fixtures"]
P1 -->|fail| Retry["Structured retry context\nSource lines + fix hints"]
Retry --> Agent
P1 -->|pass| P2["Phase 2: Second-opinion review\nSeparate agent reviews diff\nagainst acceptance criteria"]
P2 -->|fail| Retry
P2 -->|pass| P3["Phase 3: Contract testing\nTier-by-tier merge +\nintegration tests"]
P3 -->|fail| Retry
P3 -->|pass| Done["Done"]
P1 --> Journal["Evolution Journal\nTrack patterns across runs,\npropose harness improvements"]
P2 --> Journal
uv tool install ralph-cli # install (requires Python 3.11+, uv)
cd your-project
ralph init . # scaffold config and prompt templates
ralph prd create # define what to build
ralph run 25 # let the agent work for up to 25 iterationsYou need at least one AI coding agent CLI:
| Agent | Install | Models |
|---|---|---|
| Claude Code (recommended) | claude.ai/code | sonnet, opus, haiku |
| OpenAI Codex | github.com/openai/codex | o3, o4-mini |
| Custom | Any command that reads stdin | - |
Before the agent writes a single line, ralph computationally analyzes the codebase and injects structural context into the prompt. No LLM calls, no token cost - pure static analysis:
- Module map - directory tree with file counts and lines of code
- Public interfaces - classes and function signatures extracted via Python's
astmodule - Dependency graph - internal import relationships between modules
- Active conventions - line length, quote style, type checking mode from pyproject.toml, ruff.toml, .editorconfig
This reduces wasted iterations. The agent knows "this project uses httpx, not requests" before it starts, instead of learning it from a linter failure on iteration 3.
When the agent signals completion, ralph doesn't just trust it. Every run goes through mechanical verification:
Phase 1 - Mechanical checks (computational, fast):
- Test suite passes
- Type checker passes
- Linter passes
- No changes outside allowed paths
- No leaked secrets or syntax errors
- Optional: mutation testing
Phase 2 - Second-opinion review (inferential, LLM-based):
- A separate agent reviews the diff against the acceptance criteria
- Modes:
hard(failures block),advisory(warn only),skip
Phase 3 - Contract testing (for multi-component runs):
- Merges component branches tier-by-tier
- Runs integration tests at each tier
- Bisects to identify which component broke integration
When verification fails, ralph doesn't dump raw stderr into the retry prompt. It parses tool output into structured failures with file paths, source context, and fix hints:
1. src/api/auth.py:23
error: Argument 1 to "verify_password" has incompatible type "str | None"
21 | password = request.form.get("password")
22 | user = get_user(username)
> 23 | if verify_password(password, user.password_hash):
24 | return create_token(user)
FIX: Add a None check before calling verify_password, or provide a default value.
After each factory run, ralph records outcomes to an evolution journal. Over multiple runs, it identifies recurring failure patterns and proposes harness improvements.
flowchart LR
Run1["Factory run N"] --> Record["Record outcomes\n.ralph/evolution.jsonl\n.ralph/experiments.tsv"]
Record --> Extract["Extract patterns\nGroup by error signature"]
Extract --> Propose["Generate proposals\n.ralph/proposals/*.md"]
Propose --> Review["Human review"]
Review -->|approve| Apply["Update CLAUDE.md,\npyproject.toml,\nfeedforward config"]
Apply --> Run2["Factory run N+1\nBenefits from\nimproved harness"]
ralph evolve # analyze recent runs, find patterns
ralph evolve --status # show experiment trends (retry rate over time)If the agent keeps triggering the same linter rule across components, ralph evolve proposes adding a convention to CLAUDE.md. If typecheck failures recur on Optional types, it proposes a mypy config change. Proposals are written as markdown files for human review.
This is the meta-loop: ralph doesn't just retry - it learns what causes failures and updates its own controls to prevent them.
For large features, ralph decomposes a spec into independent components and runs them in parallel:
ralph decompose --spec features.md --project-name myproject
ralph factory --manifest scripts/ralph/manifest.json --max-parallel 4Each component runs in an isolated git worktree with its own PRD. ralph run is actually factory mode with a single component - the same verification pipeline runs whether you're building one feature or twenty.
flowchart TD
Spec["Markdown spec"] --> Decompose["ralph decompose\nLLM-driven spec decomposition"]
Decompose --> Manifest["Manifest\nComponent DAG with dependencies"]
Manifest --> Validate["Validate DAG\nTopological sort, cycle detection"]
Validate --> Schedule["Schedule components\nRespect dependency order"]
Schedule --> WT1["Worktree A\nComponent A"]
Schedule --> WT2["Worktree B\nComponent B"]
Schedule --> WT3["Worktree C\nComponent C"]
WT1 --> V1["Phase 0-2\nFeedforward + verify + review"]
WT2 --> V2["Phase 0-2\nFeedforward + verify + review"]
WT3 --> V3["Phase 0-2\nFeedforward + verify + review"]
V1 --> PR1["PR + merge"]
V2 --> PR2["PR + merge"]
V3 --> PR3["PR + merge"]
PR1 --> Contract["Phase 3: Contract testing\nTier-by-tier merge + integration tests"]
PR2 --> Contract
PR3 --> Contract
Contract -->|pass| Done["Done"]
Contract -->|fail| Bisect["Bisect breaker\nIdentify which component\nbroke integration"]
Bisect --> Schedule
Agent-generated tests can be written to pass trivially. Approved fixtures are human-written input/output pairs that the agent's code must satisfy:
{
"branchName": "ralph/auth",
"fixtures": [
{
"description": "Login returns token",
"fixture_type": "cli",
"input_data": {"command": "curl -s localhost:8000/api/login -d '{\"user\":\"test\"}'"},
"expected": {"exit_code": 0, "stdout_contains": ["token"]}
},
{
"description": "Config is importable",
"fixture_type": "function",
"input_data": {"module": "src.config", "function": "get_settings", "args": []},
"expected": {"returns": {"debug": false}}
},
{
"description": "Migration file exists",
"fixture_type": "file",
"input_data": {"path": "migrations/001_users.sql"},
"expected": {"exists": true, "contains": ["CREATE TABLE users"]}
}
],
"userStories": [...]
}Three fixture types: cli (run a command, check output), function (import and call, check return), file (check existence and content). Fixtures run during Phase 1 alongside tests and typecheck. Snapshot regression detects when a previously-passing fixture breaks.
You can, and for small tasks you should. Ralph is for when you want to:
- Define success criteria before starting - acceptance criteria, golden fixtures, path restrictions - not just "make it work"
- Walk away - Ralph runs unattended with structured verification, not just a completion marker
- Give the agent context - feedforward injection means fewer wasted iterations discovering the codebase
- Get structured retries - parsed failures with source context and fix hints, not raw stderr
- Build multiple components in parallel - factory mode with worktree isolation and contract testing
- Improve over time - the evolution journal tracks patterns so the same mistakes don't keep recurring
- Plan before building - interactive mode stress-tests your spec with an AI PM before any code is written
ralph Launch TUI
ralph init [DIR] Set up Ralph in a project
ralph run [N] Run with verification (factory pipeline)
ralph run [N] --no-verify Run without verification (faster, less safe)
ralph run [N] --legacy Run with old direct loop (no factory)
ralph understand [N] Run read-only codebase mapping
ralph feature Two-phase: understand then implement
ralph decompose --spec FILE Decompose spec into component DAG
ralph factory Run multi-component factory
ralph evolve Analyze runs, propose harness improvements
ralph evolve --status Show experiment trends
ralph prd create PRD creation wizard
ralph prd import FILE Generate PRD from a spec document
ralph prd validate Check prd.json schema
ralph config show Print current config
ralph status Project overview
Ralph uses ralph.toml at the project root:
[agent]
type = "claude" # "claude", "codex", or "custom"
model = "" # model override
command = "" # shell command for custom agents
[run]
max_iterations = 10
sleep_seconds = 2
interactive = false
[paths]
allowed = [] # restrict which files the agent can change
[git]
branch = "" # override branch (empty = use PRD)
auto_checkout = true
# Feedforward controls (Phase 0)
[feedforward]
enabled = true
module_map = true # directory tree with LOC counts
public_interfaces = true # extract public symbols via ast
dependency_graph = true # internal import analysis
conventions = true # extract from pyproject.toml, ruff.toml, etc.
max_context_tokens = 4000 # cap to avoid prompt bloat
# Sensor output optimization
[sensors]
parse_output = true # structured parsing of test/lint output
include_source_context = true # include source lines around failures
max_failures_per_check = 10 # cap failures per check in retry context
# Continuous learning
[evolution]
enabled = true
journal_path = ".ralph/evolution.jsonl"
experiments_path = ".ralph/experiments.tsv"
min_pattern_frequency = 2 # pattern must recur N times before proposal
lookback_runs = 10 # how many past runs to analyze
auto_propose = true # generate proposals after each factory run
# Approved fixtures
[fixtures]
enabled = false # opt-in
snapshot_on_success = true # auto-snapshot outputs after verification pass
snapshot_dir = ".ralph/snapshots"Environment variables override ralph.toml: AGENT_CMD, MODEL, INTERACTIVE, SLEEP_SECONDS, ALLOWED_PATHS, RALPH_BRANCH.
The PRD (prd.json) is a list of user stories with testable acceptance criteria:
{
"branchName": "ralph/login-feature",
"userStories": [
{
"id": "US-001",
"title": "User can log in with email",
"acceptanceCriteria": [
"Login form accepts email and password",
"Invalid credentials show error message",
"Tests pass: uv run pytest tests/test_auth.py"
],
"priority": 1,
"passes": false,
"notes": ""
}
]
}The agent updates passes and notes as it works. Ralph reads these between iterations to decide whether to continue. Acceptance criteria should be concrete and testable - commands the agent can run, behavior it can verify.
Optionally, add a fixtures array for behavioral verification (see Approved fixtures).
This is what happens inside each component's execution loop:
flowchart TD
subgraph Init["Initialization"]
A1["Load config\ntoml + env vars + CLI flags"] --> A2["Load PRD"]
A2 --> A3["Checkout branch"]
A3 --> A4["Run scaffold\n(if configured)"]
A4 --> A5["Build feedforward context\nModule map, interfaces,\ndependency graph, conventions"]
end
subgraph Iteration["Iteration (repeats up to N times)"]
B1["Build prompt\nfeedforward + retry context + instructions"] --> B2["Run agent\nStream output line by line"]
B2 --> B3{"COMPLETE\nmarker?"}
B3 -->|No| B4["Enforce allowed paths\nRevert out-of-scope changes"]
B4 --> B1
B3 -->|Yes| B5["Phase 1: Mechanical verification\nTests, typecheck, lint, fixtures"]
end
subgraph Verify["Verification"]
B5 -->|fail| B6["Parse failures\nSource context + fix hints"]
B6 --> B1
B5 -->|pass| B7["Phase 2: Review\nSecond-opinion agent"]
B7 -->|fail| B6
B7 -->|pass| B8["Complete"]
end
A5 --> B1
flowchart TB
subgraph Input
Spec["Feature spec / PRD"]
Fixtures["Approved fixtures"]
end
subgraph Phase0["Phase 0: Feedforward"]
ModMap["Module map"]
Interfaces["Public interfaces"]
DepGraph["Dependency graph"]
Conventions["Conventions"]
end
subgraph Execution["Agent execution"]
Loop["Agentic loop\n(iterate until COMPLETE)"]
end
subgraph Phase1["Phase 1: Mechanical verification"]
Tests["Test suite"]
Types["Type checker"]
Lint["Linter"]
Scope["Diff scope check"]
Patterns["Bad pattern scan"]
FixtureCheck["Fixture checks"]
end
subgraph Phase2["Phase 2: Review"]
Review["Second-opinion agent\nreviews diff against spec"]
end
subgraph Phase3["Phase 3: Contract testing"]
Contract["Tier-by-tier merge\n+ integration tests"]
end
subgraph Learning["Continuous learning"]
Journal["Evolution journal"]
Experiments["Experiment tracker"]
Proposals["Harness proposals"]
end
Spec --> Phase0
Fixtures --> FixtureCheck
Phase0 --> Loop
Loop --> Phase1
Phase1 -->|pass| Phase2
Phase1 -->|fail| Loop
Phase2 -->|pass| Phase3
Phase2 -->|fail| Loop
Phase3 -->|pass| Done["Done"]
Phase3 -->|fail| Loop
Phase1 --> Journal
Phase2 --> Journal
Journal --> Experiments
Experiments --> Proposals
For multi-component factory runs, each component goes through this pipeline independently in parallel git worktrees, with contract testing merging them tier-by-tier after individual verification.
git clone https://github.com/0xfauzi/ralph-loop.git
cd ralph-loop
uv sync
uv tool install -e .
uv run pytest # 362 testsMIT