A coding benchmark for evaluating LLM agents on multi-phase programming tasks. Tests three critical capabilities:
- Hidden requirement discovery — inferring undisclosed constraints from structured feedback
- Long-context retention — maintaining state and hypotheses across many iterations
- Iterative refinement — systematically improving solutions based on violation signals
+-----------------------------------------------------------+
| SaotriBench Flow |
+-----------------------------------------------------------+
| |
| +-------+ +------------+ +--------+ +-----------+ |
| | Agent |-->| solution.py|-->| Runner |-->| Evaluator | |
| +-------+ +------------+ +--------+ +-----------+ |
| ^ | | |
| | | v |
| | +-------------+ | +----------+ |
| +-------|feedback.json|<-----+ |Test Cases| |
| +-------------+ | (hidden) | |
| | +----------+ |
| v |
| +-------------+ |
| | Violations | |
| | + Coverage | |
| +-------------+ |
| |
+-----------------------------------------------------------+
| Phase 0 Phase 1 Phase 2 ... Phase N |
| +------+ +------+ +------+ +------+ |
| | Rule | | Rule | | Rule | | Rule | |
| | A | + | A | + | A | + ... + | A | |
| +------+ | B | | B | | B | |
| +------+ | C | | ... | |
| +------+ | Z | |
| +------+ |
| <------- Rules accumulate across phases -----------> |
+----------------------------------------------------------+
The agent receives only minimal initial information (input/output types, basic problem description). The actual correctness constraints are not fully disclosed — the agent must infer them from structured feedback on failed attempts.
Key properties:
- Agent starts with incomplete specification
- Hidden constraints are revealed indirectly through violation feedback
- Each phase introduces new undisclosed requirements that break the previous solution
- Success requires systematic exploration, not just code generation
pip install -e .After installation, the saotri-bench command becomes available. Alternatively, you can run without installing:
python -m saotri_bench.cli <command>| Task | Difficulty | Phases | Description |
|---|---|---|---|
task_00_fizzbuzz |
Easy | 3 | FizzBuzz with hidden divisor rules (%7, combinations) |
task_01_transform_list |
Easy | 3 | List transformation with evolving number handling |
task_02_merge_dicts |
Easy | 4 | Dict merge with type-aware conflict resolution |
task_03_validate_brackets |
Medium | 5 | Bracket validation with changing contract (bool → exception) |
task_04_sort_objects |
Medium | 6 | Object sorting with evolving key format and edge cases |
task_05_text_processor |
Medium | 7 | Text processing with unicode, quoting, and escape handling |
task_06_cache_eviction |
Medium | 8 | LRU cache with TTL, priority, and dirty-write tracking |
task_07_expression_parser |
Medium | 9 | Math expression parser with variables, implicit multiply, right-associativity |
task_08_access_control |
Medium | 10 | RBAC system with ownership, deny-priority, and role inheritance |
task_09_schedule_optimizer |
Hard | 12 | Task scheduler with dependencies, parallelism, and resource constraints |
task_10_data_pipeline |
Hard | 12 | Data transformation pipeline with filtering, aggregation, and joins |
task_11_version_resolver |
Hard | 15 | Semver dependency resolver with ranges, transitive deps, and conflicts |
All examples below use
saotri-bench(requirespip install -e .).
Without installing, replacesaotri-benchwithpython -m saotri_bench.cli.
saotri-bench list --tasks-dir taskssaotri-bench validate --task tasks/task_00_fizzbuzzsaotri-bench run --task tasks/task_00_fizzbuzz --workspace ./workspace --poll-interval 2saotri-bench run --task tasks/task_00_fizzbuzz --workspace ./workspace --single- The runner starts and creates the workspace with
problem.md,task.json,phase.json, and an emptysolution.py - The agent reads
problem.mdto understand the task andphase.jsonto see current rules - The agent writes its solution to
workspace/solution.py - The runner detects the file change and evaluates the solution
- The runner writes structured feedback to
workspace/feedback.json - The agent reads feedback, identifies violations, and refines its solution
- Steps 3–6 repeat until all phases pass or attempt limits are reached
You can simulate an agent by manually editing workspace/solution.py while the runner is active:
# Terminal 1: Start the runner
saotri-bench run --task tasks/task_00_fizzbuzz --workspace ./workspace --poll-interval 2
# Terminal 2: Write your solution
# Edit workspace/solution.py with your code
# The runner will auto-detect changes and evaluateStopping the runner:
- Type
q+ Enter in the runner terminal - Or press
Ctrl+C
Phase 0 — You read problem.md and write classic FizzBuzz:
def fizzbuzz(n):
if n % 15 == 0: return "FizzBuzz"
if n % 3 == 0: return "Fizz"
if n % 5 == 0: return "Buzz"
return str(n)Feedback: status: "valid" — Phase 0 passes, runner advances to Phase 1.
Phase 1 — Implicit evaluation runs. Feedback shows violations:
{"rule_id": "correct_output", "scope": "divisible_by_7", "count": 3}You infer: there's a hidden rule for multiples of 7. You add "Bazz" handling:
def fizzbuzz(n):
result = ""
if n % 3 == 0: result += "Fizz"
if n % 5 == 0: result += "Buzz"
if n % 7 == 0: result += "Bazz"
return result if result else str(n)Phase 1 passes. Phase 2 reveals combination violations (divisible_by_21, divisible_by_35, divisible_by_105). Your Phase 1 solution already handles these — Phase 2 passes too. Task complete!
When running a task, the runner creates a workspace directory with:
| File | Description |
|---|---|
problem.md |
Problem description (agent-visible) |
task.json |
Task metadata, interface, and limits |
phase.json |
Current phase info, rules, and previous feedback |
solution.py |
Agent writes solution here (runner watches for changes) |
feedback.json |
Evaluation feedback after each attempt |
report.json |
Final metrics report (written on session end) |
Note: The
workspace/directory is gitignored. Each run starts fresh — delete old workspace files before starting a new session if needed.
Each evaluation returns structured JSON feedback:
{
"phase_id": 1,
"attempt_id": 5,
"status": "partially_valid",
"status_reason": "Fails checks: no_mutation",
"violations": [
{"rule_id": "no_mutation", "scope": "direct", "count": 2}
],
"summary": {
"rules_total": 3,
"rules_passed": 2,
"rules_failed": 1,
"coverage": 0.85
},
"delta": {
"coverage_change": 0.15,
"new_failures": [],
"fixed_failures": ["correct_output"]
}
}Status values:
valid— all rules pass, phase advancespartially_valid— some rules pass, some failinvalid— no rules passerror— code failed to execute (syntax error, timeout, import violation)
Violation scopes hint at what went wrong without revealing test cases. For example, "scope": "divisible_by_7" tells the agent that something related to sevens is failing.
Each task is a directory with 4 files:
tasks/task_00_fizzbuzz/
├── task.yaml # Task metadata, phases, rules, limits
├── problem.md # Agent-visible problem description
├── evaluator.py # Evaluation logic (check_* methods)
└── tests.py # Test cases with expected values (hidden from agent)
- Create a new directory under
tasks/(convention:task_XX_name) - Define
task.yamlwith phases and rules - Write
problem.md(what the agent sees — keep it minimal for harder tasks) - Implement
evaluator.pywith a classEvaluator(BaseEvaluator)andcheck_{rule_id}methods - Create
tests.pywith aTEST_CASESlist ofTestCaseobjects - Validate with
saotri-bench validate --task tasks/your_task
- Each phase must break the previous solution — a naive solution passing Phase N should fail on Phase N+1
- Violation scopes are hints, not answers — they tell the agent what area failed, not what the answer is
- Expected values must be consistent across all phases — the evaluator runs ALL prior-phase tests on the current solution
- Easy tasks include examples in
problem.md; harder tasks give only the function signature
id: "task_00_fizzbuzz"
name: "FizzBuzz Extended"
description: "Implement FizzBuzz with evolving divisor rules"
difficulty: "easy"
interface:
function_name: "fizzbuzz"
signature: "def fizzbuzz(n: int) -> str"
allowed_imports: []
execution:
timeout_seconds: 10
phases:
- id: 0
description: "Classic FizzBuzz"
rules:
- id: "correct_output"
description: "Output matches expected string"
scopes: ["divisible_by_3", "divisible_by_5", "divisible_by_15", "plain_number"]
- id: 1
description: "New divisor rule"
rules:
- id: "correct_output"
description: "Output matches expected string"
scopes: ["divisible_by_3", "divisible_by_5", "divisible_by_15", "plain_number", "divisible_by_7"]
- id: "correct_type"
description: "Return value must be a string"
scopes: ["type_check"]
limits:
max_attempts_per_phase: 5
max_total_attempts: 15Solutions run in a sandboxed environment:
- Restricted imports — only explicitly allowed modules can be imported
- Restricted builtins — dangerous functions (
eval,exec,open,__import__) are blocked or controlled - Timeout enforcement — code execution is killed after the configured timeout
- Input immutability — evaluators use deep copies to prevent test case corruption
| Tier | Phases | Description |
|---|---|---|
| Easy | 3–5 | Basic transformations, simple rules |
| Medium | 5–15 | Moderate complexity, multiple interacting rules |
| Hard | 16–30 | Complex algorithms, many edge cases |
| Expert | 31–50 | Deep challenges, extensive hidden states |
# List all tasks
saotri-bench list [--tasks-dir PATH] [--json]
# Validate a task definition
saotri-bench validate --task PATH
# Run interactively (watch for file changes)
saotri-bench run --task PATH [--workspace PATH] [--agent-id ID] [--poll-interval SEC]
# Run single evaluation
saotri-bench run --task PATH --workspace PATH --singleThe agents/ module provides automated benchmarking of LLM models against SaotriBench tasks via OpenRouter.
# Install dependencies
pip install -r requirements.txt
# Set your OpenRouter API key
cp agents/.env.example agents/.env
# edit agents/.env with your key# Run all models on all tasks (sequential)
python -m agents.run_benchmark
# Run a specific model tier on a specific task
python -m agents.run_benchmark --tier strong --task task_00_fizzbuzz
# Run selected models
python -m agents.run_benchmark --models claude-opus,gpt,deepseek
# Run models in parallel (up to 4 at a time)
python -m agents.run_benchmark --models claude-opus,gpt,deepseek --parallel 4
# Run all models in parallel
python -m agents.run_benchmark --parallel 5
# List configured models
python -m agents.run_benchmark --list-models| Tier | Model | OpenRouter ID |
|---|---|---|
weak |
Gemma 2 9B | google/gemma-2-9b-it |
medium |
Llama 3.3 70B | meta-llama/llama-3.3-70b-instruct |
strong |
Claude Sonnet | anthropic/claude-sonnet-4 |
claude-opus |
Claude Opus 4.6 | anthropic/claude-opus-4.6 |
gemini-pro |
Gemini 3 Pro | google/gemini-3-pro-preview |
kimi |
Kimi K2.5 | moonshotai/kimi-k2.5 |
gpt |
GPT-5.2 Codex | openai/gpt-5.2-codex |
minimax |
MiniMax M2.5 | minimax/minimax-m2.5 |
glm |
GLM 5 | z-ai/glm-5 |
claude-sonnet |
Claude Sonnet 4.6 | anthropic/claude-sonnet-4.6 |
gemini-3.1 |
Gemini 3.1 Pro | google/gemini-3.1-pro-preview |
deepseek |
DeepSeek V3.2 | deepseek/deepseek-v3.2 |
grok |
Grok 4.1 Fast | x-ai/grok-4.1-fast |
trinity |
Trinity Large | arcee-ai/trinity-large-preview |
12 tasks, 94 total phases. Each model runs all tasks sequentially with up to 5-8 refinement attempts per phase.
| # | Model | Tasks Passed | Phases Completed | Phase % |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 6/12 | 53/94 | 56% |
| 2 | GPT-5.2 Codex | 4/12 | 45/94 | 48% |
| 3 | GLM 5 | 4/12 | 41/94 | 44% |
| 4 | MiniMax M2.5 | 4/12 | 40/94 | 43% |
| 5 | Kimi K2.5 | 4/12 | 37/94 | 39% |
| 6 | Gemini 3 Pro | 4/12 | 32/94 | 34% |
| Task | Difficulty | Claude Opus | GPT-5.2 | GLM 5 | MiniMax | Kimi K2.5 | Gemini 3 |
|---|---|---|---|---|---|---|---|
| FizzBuzz | Easy | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Transform List | Easy | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Merge Dicts | Easy | 4/4 | 3/4 | 3/4 | 3/4 | 3/4 | 4/4 |
| Validate Brackets | Medium | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 |
| Sort Objects | Medium | 6/6 | 4/6 | 6/6 | 3/6 | 2/6 | 3/6 |
| Text Processor | Medium | 4/7 | 3/7 | 3/7 | 2/7 | 2/7 | 3/7 |
| Cache Eviction | Medium | 4/8 | 4/8 | 2/8 | 2/8 | 2/8 | 2/8 |
| Expression Parser | Medium | 9/9 | 9/9 | 7/9 | 9/9 | 9/9 | 3/9 |
| Access Control | Medium | 2/10 | 4/10 | 3/10 | 4/10 | 3/10 | 2/10 |
| Schedule Optimizer | Hard | 2/12 | 2/12 | 1/12 | 2/12 | 1/12 | 1/12 |
| Data Pipeline | Hard | 3/12 | 3/12 | 3/12 | 2/12 | 2/12 | 2/12 |
| Version Resolver | Hard | 8/15 | 2/15 | 2/15 | 2/15 | 2/15 | 1/15 |
Bold = best result for that task.
- Easy tasks (3-4 phases): All models pass FizzBuzz, Transform List, and Validate Brackets. These serve as baseline sanity checks.
- Medium tasks (5-10 phases): Significant differentiation begins. Expression Parser is the standout — 4 of 6 models achieve 9/9 phases, indicating strong recursive parsing ability across frontier models.
- Hard tasks (12-15 phases): All models struggle. Only Claude Opus 4.6 reaches 8/15 on Version Resolver; all others plateau at 2-3 phases.
- Common failure points:
list_mergein Merge Dicts (Phase 3),escape_handlingin Text Processor,ttl_expiryin Cache Eviction,transitivedependencies in Version Resolver. - EmptyResponseError is the dominant infrastructure issue — models hit
max_tokens=4096on complex tasks. See TECHNICAL_ERRORS.md for details.
Reports are saved to reports/ as JSON files. See TECHNICAL_ERRORS.md for infrastructure errors and BENCHMARK_ANALYSIS.md for quality analysis.
See agents/README.md for full documentation.
A real-time web dashboard shows benchmark progress as results come in:
python serve_dashboard.py
# Opens at http://localhost:8050The dashboard auto-refreshes every 10 seconds and displays:
- Completion Summary — models ranked by pass rate with token/duration stats
- Pass/Fail Matrix — every task x model result at a glance
- By Difficulty — easy/medium/hard breakdown per model
To use a different port:
python serve_dashboard.py --port 9000The benchmark's stated goal is to determine whether LLM agents build internal models of projects — inferring hidden structure from feedback rather than just pattern-matching. Results from 7 strong-tier models (Feb 2026) suggest:
What the benchmark successfully measures:
-
Hidden requirement discovery (Phases 0-2 across all tasks): All models demonstrate the ability to read violation feedback and adjust code accordingly. The Phase 0→1 transition reliably differentiates models — weaker models fail to infer the meaning of scopes like
divisible_by_7, while stronger ones update their mental model of the problem. -
Iterative refinement under constraints: The benchmark clearly separates models that can iteratively refine from those that thrash. Claude Opus 4.6 achieves 8/15 on Version Resolver by systematically addressing violations, while most models plateau at 2/15 — unable to maintain all prior constraints while adding new ones.
-
Phase accumulation stress-testing: The core design — rules accumulate across phases, so Phase N solutions must satisfy all Phase 0..N-1 rules — is effective. This is where models break: they can add new behavior but often regress on previously-passing tests.
Where results are less conclusive:
-
Hard tasks may test algorithmic knowledge more than model-building. Version Resolver Phase 2 (
transitivedependencies) blocks every model except Claude Opus. This requires implementing a specific graph algorithm, not inferring requirements from feedback. Similarly, Schedule Optimizer Phase 2 (parallelism) requires topological sort knowledge. -
The
max_tokensceiling creates a confound. 5 of 7 models hit EmptyResponseError on complex tasks, terminating them prematurely. It's impossible to know if GLM 5 would have passed Expression Parser Phase 8 without the token limit (it achieved 9/9 in one run but 7/9 when constrained). The benchmark may be measuring token efficiency as much as reasoning ability. -
Medium tasks show ceiling effects. Expression Parser is fully solved by 4 of 6 models (9/9), yet the task has 9 phases — this suggests the phases aren't calibrated to differentiate top-tier models on this task type.
Relevance to the "internal model" question:
The benchmark provides indirect evidence. Models that score higher demonstrate behavior consistent with maintaining an internal model:
- They don't regress when adding new features (phase accumulation test)
- They infer correct semantics from scope names (
divisible_by_7→ "add Bazz for multiples of 7") - They maintain architectural coherence across refinements
However, the benchmark cannot definitively prove internal model building vs. sophisticated pattern matching. A model could pass by:
- Memorizing common patterns (FizzBuzz variants, bracket validators)
- Using the violation scope name as a direct hint (the scope name often reveals the solution)
- Applying generic "add a conditional" strategies without deeper understanding
Recommendations for improving benchmark signal:
- Increase
max_tokensto 8192+ for all models to remove the token ceiling confound - Add obfuscated scope names — use codes like
violation_A3instead ofdivisible_by_7to test whether models can infer meaning from test case patterns alone - Add regression-detection phases — phases that don't add new rules but test whether the model's solution is robust to edge cases of existing rules
- Run multiple trials per model to account for LLM non-determinism (GLM 5 scored 9/9 and 7/9 on the same task across two runs)
SAOTRI is an acronym that captures the core dimensions of the benchmark evaluation model:
| Letter | Stands for | Description |
|---|---|---|
| S | Hidden State | The concealed environment state — infrastructure, load, constraints invisible to the agent |
| A | Actions | The agent's actions, manifested as code patches submitted each attempt |
| O | Observations | What the agent perceives — logs, metrics, error signals, structured feedback |
| T | Non-stationary Transitions | The environment dynamics that shift across phases — rules accumulate, constraints tighten |
| R | Resilience function | The reward signal measuring solution robustness and survival under evolving requirements |
| I | Invariants | The safety and correctness guarantees — state integrity, data safety, behavioral contracts |
Learn more at saotri.com
MIT
