Skip to content

RobinsonBeato/LLM-Failure-Atlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Failure Atlas

LLM Failure Atlas is a failure-analysis platform for mapping where language models break, not just how often they succeed. The repository is organized around taxonomy stability, prompt versioning, deterministic scoring, auditable run metadata, and comparative inspection across multiple model targets.

Created by Robin.

Current state

The repository already includes:

  • a versioned taxonomy and prompt dataset across reasoning, hallucination, and tool_use
  • prompt variants as first-class benchmark executions
  • deterministic evaluation with failure labels
  • SQLAlchemy-backed persistence for runs and evaluations
  • FastAPI endpoints for benchmark execution, comparison, prompts, taxonomy, and run inspection
  • a Next.js UI for configuring multiple targets and running the suite against all of them
  • provider adapters for mock, openai_compatible, and ollama
  • automatic Markdown report generation on benchmark execution

Product boundaries

This project is for:

  • failure classification
  • benchmark reproducibility
  • cross-model comparison
  • prompt-variant robustness analysis
  • evidence-first inspection of failure clusters

This project is not for:

  • generic chat playgrounds
  • hidden prompt engineering tricks
  • leaderboard-only reporting
  • opaque evaluation pipelines

Quickstart

Backend

python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]
uvicorn app.main:app --app-dir backend --reload

Frontend

cd frontend
npm install
npm run dev

Main URLs

  • API docs: http://127.0.0.1:8000/docs
  • UI: http://localhost:3000/
  • Runs page: http://localhost:3000/runs

Current API surface

  • GET /api/health
  • GET /api/taxonomy
  • GET /api/prompts
  • GET /api/runs
  • GET /api/overview
  • GET /api/comparison
  • POST /api/benchmarks/run
  • POST /api/benchmarks/run-batch

Provider support

Supported target contracts today:

  • mock
  • openai_compatible
  • ollama

This is the current generic compatibility boundary. It intentionally does not claim to cover every vendor API. If a provider does not expose an OpenAI-compatible chat surface or a local Ollama-compatible route, it should receive a dedicated adapter.

Single-target example

curl -X POST http://127.0.0.1:8000/api/benchmarks/run ^
  -H "Content-Type: application/json" ^
  -d "{\"provider\":\"openai_compatible\",\"model_name\":\"gpt-4.1-mini\",\"model_version\":\"api\",\"base_url\":\"https://your-endpoint/v1\",\"api_key\":\"YOUR_KEY\"}"

Batch example

curl -X POST http://127.0.0.1:8000/api/benchmarks/run-batch ^
  -H "Content-Type: application/json" ^
  -d "{\"targets\":[{\"id\":\"mock-baseline\",\"label\":\"Mock Baseline\",\"provider\":\"mock\",\"model_name\":\"atlas-mock-1\",\"model_version\":\"2026-03-12\"},{\"id\":\"local-qwen\",\"label\":\"Qwen local\",\"provider\":\"openai_compatible\",\"model_name\":\"qwen2.5-0.5b-instruct-q4_k_m.gguf\",\"model_version\":\"local\",\"base_url\":\"http://127.0.0.1:8088/v1\"}],\"include_variants\":true}"

Environment variables

  • OPENAI_COMPATIBLE_BASE_URL
  • OPENAI_COMPATIBLE_API_KEY
  • OLLAMA_BASE_URL
  • OLLAMA_API_KEY
  • ATLAS_CORS_ORIGINS
  • DATABASE_URL

Reports

Benchmarks write Markdown reports automatically:

  • reports/latest-report.md
  • reports/<provider-model>-latest-report.md

The manual report helper remains available:

set PYTHONPATH=backend
python scripts/generate_report.py

Repository layout

backend/
  app/
    api/
    analysis/
    dataset/
    db/
    evaluation/
    models/
    runner/
frontend/
  app/
  lib/
prompts/
  reasoning/
  hallucination/
  tool_use/
docs/
reports/
scripts/
tests/

Documentation index

Contributing

Before opening a PR, read CONTRIBUTING.md. The short version:

  • keep schema and evaluator integrity ahead of UI polish
  • update docs when contracts change
  • avoid mixing provider logic with scoring logic
  • do not commit credentials, local model weights, or generated artifacts

Run before submitting:

pytest
cd frontend
npm run build

About

Failure-analysis platform for mapping where language models break, built around stable taxonomies, versioned prompts, deterministic scoring, auditable runs, and cross-model comparison.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors