LLM Failure Atlas is a failure-analysis platform for mapping where language models break, not just how often they succeed. The repository is organized around taxonomy stability, prompt versioning, deterministic scoring, auditable run metadata, and comparative inspection across multiple model targets.
Created by Robin.
The repository already includes:
- a versioned taxonomy and prompt dataset across
reasoning,hallucination, andtool_use - prompt variants as first-class benchmark executions
- deterministic evaluation with failure labels
- SQLAlchemy-backed persistence for runs and evaluations
- FastAPI endpoints for benchmark execution, comparison, prompts, taxonomy, and run inspection
- a Next.js UI for configuring multiple targets and running the suite against all of them
- provider adapters for
mock,openai_compatible, andollama - automatic Markdown report generation on benchmark execution
This project is for:
- failure classification
- benchmark reproducibility
- cross-model comparison
- prompt-variant robustness analysis
- evidence-first inspection of failure clusters
This project is not for:
- generic chat playgrounds
- hidden prompt engineering tricks
- leaderboard-only reporting
- opaque evaluation pipelines
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]
uvicorn app.main:app --app-dir backend --reloadcd frontend
npm install
npm run dev- API docs:
http://127.0.0.1:8000/docs - UI:
http://localhost:3000/ - Runs page:
http://localhost:3000/runs
GET /api/healthGET /api/taxonomyGET /api/promptsGET /api/runsGET /api/overviewGET /api/comparisonPOST /api/benchmarks/runPOST /api/benchmarks/run-batch
Supported target contracts today:
mockopenai_compatibleollama
This is the current generic compatibility boundary. It intentionally does not claim to cover every vendor API. If a provider does not expose an OpenAI-compatible chat surface or a local Ollama-compatible route, it should receive a dedicated adapter.
curl -X POST http://127.0.0.1:8000/api/benchmarks/run ^
-H "Content-Type: application/json" ^
-d "{\"provider\":\"openai_compatible\",\"model_name\":\"gpt-4.1-mini\",\"model_version\":\"api\",\"base_url\":\"https://your-endpoint/v1\",\"api_key\":\"YOUR_KEY\"}"curl -X POST http://127.0.0.1:8000/api/benchmarks/run-batch ^
-H "Content-Type: application/json" ^
-d "{\"targets\":[{\"id\":\"mock-baseline\",\"label\":\"Mock Baseline\",\"provider\":\"mock\",\"model_name\":\"atlas-mock-1\",\"model_version\":\"2026-03-12\"},{\"id\":\"local-qwen\",\"label\":\"Qwen local\",\"provider\":\"openai_compatible\",\"model_name\":\"qwen2.5-0.5b-instruct-q4_k_m.gguf\",\"model_version\":\"local\",\"base_url\":\"http://127.0.0.1:8088/v1\"}],\"include_variants\":true}"OPENAI_COMPATIBLE_BASE_URLOPENAI_COMPATIBLE_API_KEYOLLAMA_BASE_URLOLLAMA_API_KEYATLAS_CORS_ORIGINSDATABASE_URL
Benchmarks write Markdown reports automatically:
reports/latest-report.mdreports/<provider-model>-latest-report.md
The manual report helper remains available:
set PYTHONPATH=backend
python scripts/generate_report.pybackend/
app/
api/
analysis/
dataset/
db/
evaluation/
models/
runner/
frontend/
app/
lib/
prompts/
reasoning/
hallucination/
tool_use/
docs/
reports/
scripts/
tests/
Before opening a PR, read CONTRIBUTING.md. The short version:
- keep schema and evaluator integrity ahead of UI polish
- update docs when contracts change
- avoid mixing provider logic with scoring logic
- do not commit credentials, local model weights, or generated artifacts
Run before submitting:
pytest
cd frontend
npm run build