Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

AIDD Benchmark Harness

This benchmark harness compares AIDD CLI stacks on a fixed set of frequent tasks.

Goals

  • Compare each CLI in its native-best stack
  • Compare overlapping subsets after preflight model resolution
  • Keep tasks repeatable by running against disposable fixture copies

Included Fixtures

  • preflight - minimal scratch project for model resolution checks
  • interview - one-question interview fixture
  • audit - seeded audit target with obvious smells
  • remediation - seeded bugfix fixture with a deterministic post-run check
  • validate - partial project used for control and validate tasks

Outputs

Runs write machine-readable artifacts under benchmarks/results/:

  • session.json - session metadata and preflight outcomes
  • runs.jsonl - one JSON object per replicate
  • leaderboard.json - aggregated native-best and cohort summaries
  • leaderboard.csv - flat export of aggregated task scores
  • report.md - human-readable summary

Disposable workspaces are created under benchmarks/workspaces/.

Usage

node ./tools/run-benchmark.mjs --manifest ./benchmarks/manifest.json

Useful modes:

  • --dry-run - validate the manifest and print the planned run matrix
  • --report-only - rebuild leaderboard/report files from existing runs.jsonl
  • --stack <label> - limit execution to one or more stack labels
  • --task <id> - limit execution to one or more task IDs

Notes

  • A single universal model does not exist across all five AIDD adapters.
  • Fairness cohorts are computed after preflight and may exclude stacks that cannot resolve a target model family.