Model Failure Lab is a CLI-first system for structured LLM failure analysis.
It lets you run prompt datasets against a model, classify failures with simple heuristic logic, save deterministic local artifacts, and compare runs without needing a database or extra infrastructure.
Use Python 3.11 or newer.
python3 -m pip install .
failure-lab demo
failure-lab run --dataset reasoning-failures-v1 --model demo
failure-lab report --run <run-id>
failure-lab compare <baseline-run-id> <candidate-run-id>Use the Run ID from failure-lab demo as <baseline-run-id> and the Run ID from the bundled
run command as <candidate-run-id> in the later commands.
That standard install path exercises the real failure-lab console script: the demo writes a
bundled dataset snapshot plus one run and report, the bundled run command gives you a second run
to inspect, report rebuilds summaries from saved artifacts, and compare writes a saved
comparison artifact for those two run IDs. Because the demo dataset and the reasoning dataset are
different, that final quickstart comparison is expected to report incompatible_dataset while
still writing the comparison files; rerun failure-lab run against the same dataset twice when
you want a fully compatible comparison.
By default, failure-lab writes datasets/, runs/, and reports/ under your current working
directory. Pass --root /path/to/workspace when you want the artifacts somewhere else.
If your shell does not expose the console script on PATH, use the module entrypoint instead:
python3 -m model_failure_lab demoList bundled datasets shipped with the installed package:
failure-lab datasets listAll commands accept explicit paths and --root as well, so you can keep datasets, runs, and
reports in an isolated workspace instead of the current directory.
The base install above is enough for the shipped artifact-backed engine loop:
- bundled datasets
failure-lab demofailure-lab runfailure-lab reportfailure-lab compare- local Ollama routing such as
--model ollama:llama3.2
Install extras only when you need those optional surfaces from a repo checkout:
| Need | Install |
|---|---|
| Anthropic adapter support | python3 -m pip install '.[anthropic]' |
| OpenAI adapter support | python3 -m pip install '.[openai]' |
| Legacy benchmark, training, and old reporting surfaces | python3 -m pip install '.[legacy]' |
| Legacy Streamlit results explorer | python3 -m pip install '.[ui]' |
| Test and lint tools | python3 -m pip install '.[dev]' |
If you are installing from a built wheel or published distribution instead of a local checkout, the
equivalent package form is model-failure-lab[anthropic], model-failure-lab[openai],
model-failure-lab[legacy], or model-failure-lab[ui].
The repo currently ships bundled packs for:
reasoning-failures-v1hallucination-failures-v1rag-failures-v1
Bundled datasets default to the core slice for fast local runs. Use --full to include the
extended tail:
failure-lab run --dataset rag-failures-v1 --model demo --fullThe engine writes simple filesystem artifacts:
datasets/
runs/<run-id>/
run.json
results.json
reports/<report-id>/
report.json
report_details.json
The main contracts are:
PromptCaseRunResultReport
Everything stays inspectable by hand. There is no database layer.
The React debugger reads an existing artifact workspace through one supported seam:
FAILURE_LAB_ARTIFACT_ROOT.
Point it at the directory that contains runs/, reports/, and optional datasets/:
export FAILURE_LAB_ARTIFACT_ROOT=/path/to/failure-lab-workspace
npm --prefix frontend run devThat contract is the same whether the artifacts were written from this repo checkout, from a normal installed-package workflow, or from an Ollama-backed run. The debugger does not have an in-app artifact-root picker; the server-side environment variable is the supported handoff.
failure-lab run supports:
demofor deterministic local execution- Anthropic models through explicit routing such as
anthropic:claude-sonnet-4-0after installing.[anthropic] - OpenAI model names such as
gpt-4.1-miniafter installing.[openai] - Ollama models through explicit routing such as
ollama:llama3.2 - explicit adapter routing with
<adapter>:<model>
One explicit Anthropic example:
failure-lab run \
--dataset reasoning-failures-v1 \
--model anthropic:claude-sonnet-4-0 \
--anthropic-base-url http://127.0.0.1:8000 \
--system-prompt "Be concise." \
--model-option max_tokens=256One explicit local Ollama example:
failure-lab run \
--dataset reasoning-failures-v1 \
--model ollama:llama3.2 \
--ollama-host http://localhost:11434 \
--system-prompt "Be concise." \
--model-option temperature=0That same surface supports the normal saved-artifact loop:
failure-lab run --dataset reasoning-failures-v1 --model ollama:baseline-model --ollama-host http://localhost:11434 --system-prompt "Be concise." --model-option temperature=0
failure-lab run --dataset reasoning-failures-v1 --model ollama:candidate-model --ollama-host http://localhost:11434 --system-prompt "Be concise." --model-option temperature=0
failure-lab report --run <baseline-run-id>
failure-lab compare <baseline-run-id> <candidate-run-id>The package also exposes simple registration seams for future extension:
register_model(...)register_classifier(...)
If you want a reusable local workspace for /analysis, comparison explanation, and insight
drillthrough without depending on external models, generate the checked-in fixture workspace:
python3 scripts/generate_insight_fixture.pyBy default that writes a deterministic artifact root at
artifacts/insight-fixture-workspace with:
- 1 dataset snapshot
- 4 compatible runs
- 4 run reports
- 3 comparison reports
- a rebuilt local query index
Then point the debugger at it:
export FAILURE_LAB_ARTIFACT_ROOT="$(pwd)/artifacts/insight-fixture-workspace"
npm --prefix frontend run devUseful smoke commands against that workspace:
failure-lab query --root artifacts/insight-fixture-workspace --failure-type hallucination --last-n 4 --summarize
failure-lab compare <baseline-run-id> <candidate-run-id> --root artifacts/insight-fixture-workspace --explainClosed-loop harvest replay over the same workspace:
failure-lab harvest --root artifacts/insight-fixture-workspace --comparison <comparison-report-id> --delta regression --out artifacts/insight-fixture-workspace/datasets/harvested/regression-pack.json
failure-lab dataset review artifacts/insight-fixture-workspace/datasets/harvested/regression-pack.json
failure-lab dataset promote artifacts/insight-fixture-workspace/datasets/harvested/regression-pack.json --dataset-id fixture-regression-pack-v1 --root artifacts/insight-fixture-workspace
failure-lab run --root artifacts/insight-fixture-workspace --dataset fixture-regression-pack-v1 --model insight_fixture_v1:candidate-model --classifier insight_fixture_classifier_v1
failure-lab run --root artifacts/insight-fixture-workspace --dataset fixture-regression-pack-v1 --model insight_fixture_v1:stable-model --classifier insight_fixture_classifier_v1
failure-lab report --root artifacts/insight-fixture-workspace --run <candidate-rerun-id>
failure-lab report --root artifacts/insight-fixture-workspace --run <stable-rerun-id>
failure-lab compare <candidate-rerun-id> <stable-rerun-id> --root artifacts/insight-fixture-workspace --explain
failure-lab query --root artifacts/insight-fixture-workspace --dataset fixture-regression-pack-v1 --summarizeEditable install:
python3 -m pip install -e '.[dev]'Add extras as needed from a checkout:
python3 -m pip install -e '.[anthropic]'
python3 -m pip install -e '.[openai]'
python3 -m pip install -e '.[ui]'
python3 -m pip install -e '.[legacy]'If you need the old benchmark and research surfaces while developing, install both dev tooling and the legacy runtime stack together:
python3 -m pip install -e '.[dev,legacy]'Focused checks:
python3 -m pytest -q
python3 -m ruff check src testsThis repo still contains the earlier benchmark and UI work that focused on CivilComments,
distribution shift, and the React failure debugger. That material remains useful as legacy
reference, but the current product direction is the engine-first failure-lab workflow above.
Useful legacy references: