Agentic optimization loop with high watermarking. Ratchet uses Claude to iteratively improve a single file (the "lever") against a scorer, accepting only changes that beat the current high watermark. Like a ratchet wrench — it only moves forward.
┌─────────┐ ┌───────┐ ┌────────┐ ┌──────────┐
│ Agent │────▶│ Lever │────▶│ Scorer │────▶│ Watermark│
│ (Claude) │ │ (file)│ │ (sh) │ │ check │
└─────────┘ └───────┘ └────────┘ └──────────┘
▲ │
│ ✓ kept (score > watermark) │
└────────────────────────────────────────────┘
✗ discarded (rollback)
Each iteration, Claude proposes a single targeted change to the lever file. The scorer evaluates the result and produces a numeric score. If the score exceeds the current watermark by at least --min-delta, the change is kept and the watermark ratchets up. Otherwise, the change is rolled back. Progress is logged and snapshots are saved for every accepted iteration.
- Bun v1.0+
- An Anthropic API key
git clone [email protected]:hackerrdave/ratchet.git
cd ratchet
bun installRun directly with Bun:
bun run src/cli.ts <command>Or compile to a standalone binary:
bun run build
./dist/ratchet <command>bun run build:all
# Produces:
# dist/ratchet-macos (darwin-arm64)
# dist/ratchet-linux (linux-x64)
# dist/ratchet-windows.exe| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
Yes | Your Anthropic API key |
Run ratchet init in your project root to set everything up interactively. It walks you through:
-
Scorer type — how you'll measure improvement:
- Objective — a test suite, build script, latency benchmark, or any command that outputs a score
- Labeled — you label example pairs (match/no-match), Ratchet builds a scorer from them
- Live signal — a metrics endpoint or DB query that returns a live score
- LLM judge — uses a model to evaluate outputs (weakest anchor — scores may drift)
-
Goal — one sentence describing what you're optimizing for
-
Lever — the file path Claude is allowed to modify (e.g.,
prompts/classifier.md) -
Constraints — rules Claude must follow (e.g., "Must remain under 2000 tokens")
-
Context files — additional files Claude should read for background
You can run multiple independent ratchets in the same project using --name:
ratchet init --name classifier
ratchet init --name system-prompt
ratchet init --name config
ratchet start --name classifier
ratchet watch --name system-prompt
ratchet log --name configEach named ratchet gets its own spec, scorer, watermark, and history. Omitting --name uses default.
| Path | Purpose |
|---|---|
ratchet/<name>/RATCHET.md |
Optimization spec — goal, lever, constraints, context |
ratchet/<name>/scorer.sh |
Shell script that outputs a single float (higher = better) |
ratchet/<name>/watermark.txt |
Current high watermark score |
ratchet/<name>/progress.log |
JSONL log of every iteration |
ratchet/<name>/learnings.md |
Tactical learnings extracted from runs |
ratchet/<name>/best/ |
Latest best version of the lever |
ratchet/<name>/snapshots/ |
Snapshot of the lever at each kept iteration |
The generated spec file looks like this:
# Goal
Improve prompt accuracy for product-recall matching
# Lever
The file at prompts/classifier.md is the only thing you may change.
One targeted improvement per iteration. Do not rewrite wholesale.
# Constraints
- Must remain under 2000 tokens
- Must not remove the routing section
# Context
- Read TOPOLOGY.md for codebase structure
- The scorer measures how well the lever achieves the goal aboveYour scorer must output a single float to stdout. Higher is better. Example:
#!/bin/bash
# Run tests and report pass rate
TOTAL=$(bun test 2>&1 | grep -oP '\d+ pass' | grep -oP '\d+')
FAILED=$(bun test 2>&1 | grep -oP '\d+ fail' | grep -oP '\d+')
echo "scale=4; $TOTAL / ($TOTAL + $FAILED)" | bcAll commands accept --name <name> to target a specific ratchet. Defaults to default.
# Run 20 iterations with defaults
ratchet start
# Target a specific ratchet
ratchet start --name classifier
# Customize the run
ratchet start --name classifier --iterations 50 --min-delta 0.01 --model claude-haiku-4-5-20251001| Option | Default | Description |
|---|---|---|
--name <name> |
default |
Which ratchet to run |
-n, --iterations <n> |
20 |
Number of iterations to run |
--min-delta <delta> |
0.001 |
Minimum score improvement to accept a change |
--model <model> |
claude-haiku-4-5-20251001 |
Claude model to use |
--schedule <cron> |
— | Cron schedule for recurring runs (coming soon) |
Output looks like:
Starting ratchet loop (classifier)
Lever: prompts/classifier.md
Model: claude-haiku-4-5-20251001
Iterations: 20
Min delta: 0.001
Current watermark: 0.72
[1/20] Running agent... scoring... ✓ kept score=0.7400 (+0.0200) — Added few-shot example for edge case
[2/20] Running agent... scoring... ✗ disc score=0.7350 (-0.0050) — Restructured output format
[3/20] Running agent... scoring... ✓ kept score=0.7650 (+0.0250) — Clarified matching criteria
...
Done. 8 kept, 12 discarded. Final watermark: 0.8420
ratchet watch --name classifierDisplays a live staircase chart in the terminal showing score progression, watermark level, and accept/discard history. Refreshes every 2 seconds.
ratchet log --name classifier # Status Score Delta Timestamp Summary
────────────────────────────────────────────────────────────────────────────
1 ✓ kept 0.7400 +0.0200 2026-03-10T12:00:00.000Z Added few-shot example
2 ✗ disc 0.7350 -0.0050 2026-03-10T12:01:30.000Z Restructured output format
3 ✓ kept 0.7650 +0.0250 2026-03-10T12:03:00.000Z Clarified matching criteria
3 iterations, 2 kept, 1 discarded
# Diff the lever between two kept iterations
ratchet diff 1 3 --name classifierShows a unified diff of the lever file between any two snapshots.
ratchet show 3 --name classifier# Roll the lever back to iteration 3's state
ratchet checkout 3 --name classifierAfter each run, Ratchet extracts tactical learnings from the progress log — what worked, what didn't, and why. These are fed back to the agent on subsequent runs.
ratchet learnings --name classifier## What Works
- **Few-shot examples outperform explicit rules** — iterations 1,3 tried verbose instructions
and scored lower; iteration 4 added a single worked example and jumped +0.0234
## What Doesn't
- **Explicit classification criteria degrade accuracy** — prescriptive rules overconstrain the model
## Tactics
- Use 2-3 worked examples covering diverse edge cases
- Preserve JSON-only output constraint in all changesLearnings are stored in ratchet/<name>/learnings.md and updated after each run.
# Pause a running optimization (checked between iterations)
ratchet pause --name classifier
# Resume where you left off
ratchet resume --name classifier# 1. Set up your project
mkdir my-project && cd my-project
echo "Classify the product..." > prompts/classifier.md
# 2. Write a scorer
cat > scorer.sh << 'EOF'
#!/bin/bash
python3 eval.py prompts/classifier.md | tail -1
EOF
chmod +x scorer.sh
# 3. Initialize ratchet
export ANTHROPIC_API_KEY="sk-ant-..."
ratchet init --name classifier
# → Select "Objective" scorer type
# → Goal: "Improve classification accuracy"
# → Lever: prompts/classifier.md
# 4. Run the optimization loop
ratchet start --name classifier --iterations 30
# 5. Watch it climb
ratchet watch --name classifieryour-project/
├── prompts/
│ └── classifier.md # The lever (file being optimized)
└── ratchet/
├── classifier/
│ ├── RATCHET.md # Optimization spec
│ ├── scorer.sh # Scoring script
│ ├── watermark.txt # Current high watermark
│ ├── progress.log # JSONL iteration history
│ ├── learnings.md # Tactical learnings from runs
│ ├── best/
│ │ └── classifier.md # Best lever state so far
│ └── snapshots/
│ ├── 1/
│ ├── 3/
│ └── ... # One per kept iteration
└── system-prompt/
├── RATCHET.md
├── scorer.sh
└── ... # Another independent ratchet
MIT