Two experiments that together produce publishable evidence for Guardian's value as a codebase orientation tool for LLM agents.
Already complete: Guardian-Bench 4-metric suite.
Results: F1@5=38.6%, any-hit=93.5%, recall@5=77.4%, context coverage=100%.
Run: guardian benchmark --tasks tests/benchmark/multi-codebase-tasks.jsonl
Does access to guardian_search + guardian_orient help an LLM agent correctly
identify the files and symbols needed to complete a task, compared to raw file
exploration?
Two conditions per task:
| Condition | Tools available |
|---|---|
| Baseline | list_dir, read_file |
| Guardian | list_dir, read_file, guardian_search, guardian_orient |
Model: gemma4:latest (8B, local via ollama — fully reproducible, no API cost)
Tasks: 31 tasks from tests/benchmark/multi-codebase-tasks.jsonl
Each task gives the agent:
- A natural language task description (the
descriptionfield) - The repo root directory
- Tools (condition-dependent)
- A stop condition: declare found files + symbols
What the agent is asked to do:
"You are navigating a codebase. Your task: {description}. Find the files and symbols that implement this. When confident, call
declare_answer(files, symbols). Do not write any code."
This is a navigation-only task — no code execution needed, making it safe to run locally and easy to evaluate objectively against ground truth.
| Metric | Definition |
|---|---|
files_hit@k |
Ground truth files found in declared answer |
symbols_hit@k |
Ground truth symbols found in declared answer |
tool_calls |
Total tool invocations before declare_answer |
tokens_in |
Total input tokens consumed |
tokens_out |
Total output tokens generated |
turns |
Number of agent turns |
first_hit_turn |
Turn at which first correct file was found |
declared_correctly |
Boolean: ≥1 correct file in declared answer |
- Mean files_hit: Guardian vs Baseline
- Mean turns to first hit: Guardian vs Baseline
- Mean total tokens: Guardian vs Baseline (the efficiency claim)
- Pass rate (declared_correctly): Guardian vs Baseline
Papers cite this as: "Using guardian_search, agents found the correct file in X fewer turns and Y% fewer tokens compared to raw file exploration, with Z% higher declaration accuracy."
At what codebase size does guardian's token compression become critical — i.e., at what point does raw file exploration exhaust the context window before finding the answer?
Synthetic scale ladder: Test the same navigation task against repos of increasing size:
- Small: ~20 files (test-fixtures)
- Medium: ~100 files (guardian self)
- Large: ~500 files (ghostwriter, yowi_app)
- Very large: ~2000+ files (fastapi)
For each scale point:
- Measure how many tokens raw exploration consumes before finding the target
- Measure how many tokens guardian orientation consumes
- Record whether gemma4's 128K context window would be exhausted
Key comparison: At what file count does the Baseline agent hit context limits before finding the answer? Guardian's compression advantage grows non-linearly with codebase size.
raw_orient_tokens(n): tokens to find target file via list_dir + read_file at scale nguardian_orient_tokens(n): tokens for guardian_search + guardian_orient at scale ncompression_ratio(n): raw/guardian at each scale pointcontext_exhaustion_threshold: file count at which raw exploration fails
AgentHarness(
model="gemma4:latest", # ollama model
condition="baseline|guardian", # which tools
task=BenchmarkTask, # from multi-codebase-tasks.jsonl
repo_root=str, # absolute path to repo
specs_dir=str, # path to .specs for guardian condition
max_turns=15, # prevent infinite loops
max_tokens=60_000 # context budget
)
Tools available in baseline condition:
list_dir(path)→ directory listingread_file(path)→ file contents (truncated at 3000 chars)
Tools added in guardian condition:
guardian_search(query)→ callsnode dist/cli.js search --queryguardian_orient()→ returns architecture-context.md guardian block
Stop tool (both conditions):
declare_answer(files: list[str], symbols: list[str])→ terminates episode
python experiments/run_experiment.py \
--tasks tests/benchmark/multi-codebase-tasks.jsonl \
--condition baseline # or guardian \
--model gemma4:latest \
--output experiments/results/baseline.jsonl
Runs all 31 tasks, writes per-task results. Takes ~2-4 hours on a Mac M-series for 31 tasks × 2 conditions with gemma4.
Loads both result files, computes all metrics, outputs:
- Paper-ready table (LaTeX + markdown)
- Per-task breakdown CSV
- Statistical significance (Wilcoxon signed-rank on paired tasks)
experiments/
README.md ← this file
agent_harness.py ← core agent loop with tool dispatch
run_experiment.py ← CLI runner for batch execution
analyze.py ← results analysis + table generation
prompts/
system_baseline.txt ← system prompt for baseline condition
system_guardian.txt ← system prompt for guardian condition
results/
baseline.jsonl ← per-task results, baseline condition
guardian.jsonl ← per-task results, guardian condition
analysis.md ← generated comparison report
"Guardian reduces the number of agent turns and tokens required to correctly identify relevant code locations by providing a compressed, pre-indexed orientation layer. On 31 navigation tasks across 6 programming languages, agents equipped with Guardian found correct files in X fewer turns and consumed Y% fewer tokens, while achieving Z% higher declaration accuracy compared to agents using only direct file exploration."
If Z ≥ 10pp and Y ≥ 30%, that's a publishable result at a workshop venue (MSR, ICSE NIER, FSE Ideas). If Z ≥ 20pp, it's a full paper contribution.