This benchmark harness compares AIDD CLI stacks on a fixed set of frequent tasks.
- Compare each CLI in its native-best stack
- Compare overlapping subsets after preflight model resolution
- Keep tasks repeatable by running against disposable fixture copies
preflight- minimal scratch project for model resolution checksinterview- one-question interview fixtureaudit- seeded audit target with obvious smellsremediation- seeded bugfix fixture with a deterministic post-run checkvalidate- partial project used for control and validate tasks
Runs write machine-readable artifacts under benchmarks/results/:
session.json- session metadata and preflight outcomesruns.jsonl- one JSON object per replicateleaderboard.json- aggregated native-best and cohort summariesleaderboard.csv- flat export of aggregated task scoresreport.md- human-readable summary
Disposable workspaces are created under benchmarks/workspaces/.
node ./tools/run-benchmark.mjs --manifest ./benchmarks/manifest.jsonUseful modes:
--dry-run- validate the manifest and print the planned run matrix--report-only- rebuild leaderboard/report files from existingruns.jsonl--stack <label>- limit execution to one or more stack labels--task <id>- limit execution to one or more task IDs
- A single universal model does not exist across all five AIDD adapters.
- Fairness cohorts are computed after preflight and may exclude stacks that cannot resolve a target model family.