LLM Failure Atlas

LLM Failure Atlas is a failure-analysis platform for mapping where language models break, not just how often they succeed. The repository is organized around taxonomy stability, prompt versioning, deterministic scoring, auditable run metadata, and comparative inspection across multiple model targets.

Created by Robin.

Current state

The repository already includes:

a versioned taxonomy and prompt dataset across reasoning, hallucination, and tool_use
prompt variants as first-class benchmark executions
deterministic evaluation with failure labels
SQLAlchemy-backed persistence for runs and evaluations
FastAPI endpoints for benchmark execution, comparison, prompts, taxonomy, and run inspection
a Next.js UI for configuring multiple targets and running the suite against all of them
provider adapters for mock, openai_compatible, and ollama
automatic Markdown report generation on benchmark execution

Product boundaries

This project is for:

failure classification
benchmark reproducibility
cross-model comparison
prompt-variant robustness analysis
evidence-first inspection of failure clusters

This project is not for:

generic chat playgrounds
hidden prompt engineering tricks
leaderboard-only reporting
opaque evaluation pipelines

Quickstart

Backend

python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]
uvicorn app.main:app --app-dir backend --reload

Frontend

cd frontend
npm install
npm run dev

Main URLs

API docs: http://127.0.0.1:8000/docs
UI: http://localhost:3000/
Runs page: http://localhost:3000/runs

Current API surface

GET /api/health
GET /api/taxonomy
GET /api/prompts
GET /api/runs
GET /api/overview
GET /api/comparison
POST /api/benchmarks/run
POST /api/benchmarks/run-batch

Provider support

Supported target contracts today:

mock
openai_compatible
ollama

This is the current generic compatibility boundary. It intentionally does not claim to cover every vendor API. If a provider does not expose an OpenAI-compatible chat surface or a local Ollama-compatible route, it should receive a dedicated adapter.

Single-target example

curl -X POST http://127.0.0.1:8000/api/benchmarks/run ^
  -H "Content-Type: application/json" ^
  -d "{\"provider\":\"openai_compatible\",\"model_name\":\"gpt-4.1-mini\",\"model_version\":\"api\",\"base_url\":\"https://your-endpoint/v1\",\"api_key\":\"YOUR_KEY\"}"

Batch example

curl -X POST http://127.0.0.1:8000/api/benchmarks/run-batch ^
  -H "Content-Type: application/json" ^
  -d "{\"targets\":[{\"id\":\"mock-baseline\",\"label\":\"Mock Baseline\",\"provider\":\"mock\",\"model_name\":\"atlas-mock-1\",\"model_version\":\"2026-03-12\"},{\"id\":\"local-qwen\",\"label\":\"Qwen local\",\"provider\":\"openai_compatible\",\"model_name\":\"qwen2.5-0.5b-instruct-q4_k_m.gguf\",\"model_version\":\"local\",\"base_url\":\"http://127.0.0.1:8088/v1\"}],\"include_variants\":true}"

Environment variables

OPENAI_COMPATIBLE_BASE_URL
OPENAI_COMPATIBLE_API_KEY
OLLAMA_BASE_URL
OLLAMA_API_KEY
ATLAS_CORS_ORIGINS
DATABASE_URL

Reports

Benchmarks write Markdown reports automatically:

reports/latest-report.md
reports/<provider-model>-latest-report.md

The manual report helper remains available:

set PYTHONPATH=backend
python scripts/generate_report.py

Repository layout

backend/
  app/
    api/
    analysis/
    dataset/
    db/
    evaluation/
    models/
    runner/
frontend/
  app/
  lib/
prompts/
  reasoning/
  hallucination/
  tool_use/
docs/
reports/
scripts/
tests/

Documentation index

Contributing

Before opening a PR, read CONTRIBUTING.md. The short version:

keep schema and evaluator integrity ahead of UI polish
update docs when contracts change
avoid mixing provider logic with scoring logic
do not commit credentials, local model weights, or generated artifacts

Run before submitting:

pytest
cd frontend
npm run build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Failure Atlas

Current state

Product boundaries

Quickstart

Backend

Frontend

Main URLs

Current API surface

Provider support

Single-target example

Batch example

Environment variables

Reports

Repository layout

Documentation index

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docs		docs
frontend		frontend
prompts		prompts
reports		reports
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

LLM Failure Atlas

Current state

Product boundaries

Quickstart

Backend

Frontend

Main URLs

Current API surface

Provider support

Single-target example

Batch example

Environment variables

Reports

Repository layout

Documentation index

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages