LLM Judge is a command-line test harness that measures alignment and bias in LLM responses to politically sensitive topics. It queries OpenRouter-hosted models, gathers their initial and follow-up answers, and scores them using a configured judge model. A Flask + Svelte web dashboard is included for streaming runs with live charts and chat-style transcripts.
- Runs the curated prompt suite against one or more OpenRouter models.
- Captures raw completions and judge decisions as JSON artifacts.
- Produces a timestamped CSV summary with heuristic refusals and judge scores.
- Real-time dashboard with WebSocket streaming showing prompts, answers, judge notes, and rolling scoreboards.
- Ships with strict linting, formatting, type checking, and tests (Black, Flake8, Mypy, Pyright, Pytest).
This project uses uv for dependency management. To install tooling and development extras:
make installAlternatively, run uv sync --extra dev --extra test.
Set an OpenRouter API key and invoke the CLI:
export OPENROUTER_API_KEY=sk-your-key
uv run python judge.py \
--models qwen/qwen3-next-80b-a3b-instruct mistral/mistral-large-latest \
--judge-model x-ai/grok-4-fast \
--verboseResults are written under results/, e.g. results/results_<timestamp>.csv with raw artifacts in results/runs/<timestamp>/.
By default the CLI and dashboard target qwen/qwen3-next-80b-a3b-instruct as the evaluated model and score with x-ai/grok-4-fast.
A live control panel is bundled in webui/. It streams judge runs over a WebSocket connection, renders a chat-style prompt/response timeline, and keeps per-model scoreboards.
make web # install web deps, build with Vite, start Gunicorn + gevent on :5000Override defaults as needed:
make web GUNICORN_BIND=127.0.0.1:8000 GUNICORN_WORKERS=2The command rebuilds Svelte assets with Vite and serves the Flask application via Gunicorn using gevent workers for concurrency.
For local checks:
make fmt # format with Black
make lint # Black --check + Flake8
make type # Mypy + Pyright
make test # Pytest suite
make check # run lint + type + testFor iterative work on the Flask API and Svelte web UI with live reload, use the background dev stack manager:
make devstack-startIt launches both servers, writes logs under .devstack/, and prints the URLs plus controller PID. You can inspect progress at any time with make devstack-status and stop everything—including child processes spawned by the reloaders—with make devstack-stop (set FORCE=1 to escalate to SIGKILL). The controller PID can also be terminated directly via kill -- -<pid> to tear down the whole process group.
judge.py– CLI entry point.src/llm_judge/– package containing API client, prompt definitions (YAML-backed), judge configuration, utilities, runner, and the Flask web app.webui/– Svelte/Vite front-end compiled towebui/distfor the dashboard.tests/– pytest suites for helpers, prompt loading, and judge configuration.pyproject.toml– project metadata and tool configuration (Black, Mypy, Pyright, Pytest, uv).
Released under the MIT License. See LICENSE.
