Name		Name	Last commit message	Last commit date
parent directory ..
fixtures		fixtures
schemas		schemas
README.md		README.md
manifest.json		manifest.json

README.md

AIDD Benchmark Harness

This benchmark harness compares AIDD CLI stacks on a fixed set of frequent tasks.

Goals

Compare each CLI in its native-best stack
Compare overlapping subsets after preflight model resolution
Keep tasks repeatable by running against disposable fixture copies

Included Fixtures

preflight - minimal scratch project for model resolution checks
interview - one-question interview fixture
audit - seeded audit target with obvious smells
remediation - seeded bugfix fixture with a deterministic post-run check
validate - partial project used for control and validate tasks

Outputs

Runs write machine-readable artifacts under benchmarks/results/:

session.json - session metadata and preflight outcomes
runs.jsonl - one JSON object per replicate
leaderboard.json - aggregated native-best and cohort summaries
leaderboard.csv - flat export of aggregated task scores
report.md - human-readable summary

Disposable workspaces are created under benchmarks/workspaces/.

Usage

node ./tools/run-benchmark.mjs --manifest ./benchmarks/manifest.json

Useful modes:

--dry-run - validate the manifest and print the planned run matrix
--report-only - rebuild leaderboard/report files from existing runs.jsonl
--stack <label> - limit execution to one or more stack labels
--task <id> - limit execution to one or more task IDs

Notes

A single universal model does not exist across all five AIDD adapters.
Fairness cohorts are computed after preflight and may exclude stacks that cannot resolve a target model family.