LLM Judge

LLM Judge is a command-line test harness that measures alignment and bias in LLM responses to politically sensitive topics. It queries OpenRouter-hosted models, gathers their initial and follow-up answers, and scores them using a configured judge model. A Flask + Svelte web dashboard is included for streaming runs with live charts and chat-style transcripts.

Features

Runs the curated prompt suite against one or more OpenRouter models.
Captures raw completions and judge decisions as JSON artifacts.
Produces a timestamped CSV summary with heuristic refusals and judge scores.
Real-time dashboard with WebSocket streaming showing prompts, answers, judge notes, and rolling scoreboards.
Ships with strict linting, formatting, type checking, and tests (Black, Flake8, Mypy, Pyright, Pytest).

Installation

This project uses uv for dependency management. To install tooling and development extras:

make install

Alternatively, run uv sync --extra dev --extra test.

Usage

Set an OpenRouter API key and invoke the CLI:

export OPENROUTER_API_KEY=sk-your-key
uv run python judge.py \
  --models qwen/qwen3-next-80b-a3b-instruct mistral/mistral-large-latest \
  --judge-model x-ai/grok-4-fast \
  --verbose

Results are written under results/, e.g. results/results_<timestamp>.csv with raw artifacts in results/runs/<timestamp>/.

By default the CLI and dashboard target qwen/qwen3-next-80b-a3b-instruct as the evaluated model and score with x-ai/grok-4-fast.

Web dashboard

A live control panel is bundled in webui/. It streams judge runs over a WebSocket connection, renders a chat-style prompt/response timeline, and keeps per-model scoreboards.

make web  # install web deps, build with Vite, start Gunicorn + gevent on :5000

Override defaults as needed:

make web GUNICORN_BIND=127.0.0.1:8000 GUNICORN_WORKERS=2

The command rebuilds Svelte assets with Vite and serves the Flask application via Gunicorn using gevent workers for concurrency.

For local checks:

make fmt     # format with Black
make lint    # Black --check + Flake8
make type    # Mypy + Pyright
make test    # Pytest suite
make check   # run lint + type + test

Development stack

For iterative work on the Flask API and Svelte web UI with live reload, use the background dev stack manager:

make devstack-start

It launches both servers, writes logs under .devstack/, and prints the URLs plus controller PID. You can inspect progress at any time with make devstack-status and stop everything—including child processes spawned by the reloaders—with make devstack-stop (set FORCE=1 to escalate to SIGKILL). The controller PID can also be terminated directly via kill -- -<pid> to tear down the whole process group.

Project Layout

judge.py – CLI entry point.
src/llm_judge/ – package containing API client, prompt definitions (YAML-backed), judge configuration, utilities, runner, and the Flask web app.
webui/ – Svelte/Vite front-end compiled to webui/dist for the dashboard.
tests/ – pytest suites for helpers, prompt loading, and judge configuration.
pyproject.toml – project metadata and tool configuration (Black, Mypy, Pyright, Pytest, uv).

License

Released under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github		.github
config		config
src/llm_judge		src/llm_judge
tests		tests
webui		webui
.flake8		.flake8
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
config.example.yaml		config.example.yaml
judge.py		judge.py
llm-judge.code-workspace		llm-judge.code-workspace
llm-judge.png		llm-judge.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Judge

Features

Installation

Usage

Web dashboard

Development stack

Project Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Judge

Features

Installation

Usage

Web dashboard

Development stack

Project Layout

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages