An AI-powered application that evaluates and iteratively improves a grading rubric using an LLM council. Multiple LLM judges independently analyze the rubric against real student answers, debate improvements through a structured voting protocol, and converge on an optimized rubric through consensus or synthesis.
- Prerequisites
- Quick Start
- Approach
- Algorithm
- Design Choices
- How to Run
- Configuration
- Output Format
- Testing
- Dependencies
- Python 3.11+
- uv package manager
- An API key for at least one LLM provider (OpenAI, Anthropic, Google, or Azure)
cd rubric_evaluator
make setup # installs deps, copies config templates (⚠️ overwrites config.yaml and .env)
# Edit .env with your API keys (default config uses both OpenAI and Anthropic)
make run # runs the pipeline, output lands in output/<timestamp>/Output: output/<timestamp>/result.json (improved rubric + explanation) and metadata.json (full pipeline trace).
| Command | Description |
|---|---|
make help |
Show all available targets |
make install |
Install dependencies via uv sync |
make setup |
Install + copy sample configs to config.yaml and .env |
make run |
Run the pipeline (override config with CONFIG=path.yaml) |
make test |
Run unit and integration tests with coverage |
make lint |
Check linting and formatting (no changes) |
make lint-fix |
Auto-fix linting and formatting issues |
make clean |
Remove build artifacts, caches, and output |
The core idea is to treat rubric improvement as a multi-agent deliberation problem. Rather than asking a single LLM to improve the rubric in one shot, the application assembles a council of N independently configured LLM judges that each bring a different perspective (different models, temperatures, or provider-specific features like extended thinking). Each student answer acts as a stress test that may expose different rubric weaknesses.
The rubric evolves through sequential rounds (one per student answer), where each round runs parallel evaluation, parallel voting, and optional synthesis. This mimics how a real teaching team would refine a rubric: review it against real answers, discuss, vote, and iterate.
The rubric is evaluated on three dimensions:
| Criterion | Key Question | Goal |
|---|---|---|
| Ambiguity | Is the rubric formulated with objective, unambiguous criteria? | All graders reach the same interpretation independently |
| Applicability | Does the rubric cover the diversity of possible student responses? | No valid answer type is left unaddressed |
| Discrimination Power | Does the rubric clearly separate excellent from poor work? | High-quality answers score significantly better than weak ones |
flowchart TD
Load["Load config, rubric,\nexam question, student answers"] --> R1
subgraph R1 ["Council Round — one per student answer"]
direction TB
E1["Each judge independently\nevaluates rubric vs. answer"] --> V1["Each judge votes\nfor best proposal"] --> C1{"Consensus\nreached?"}
C1 -->|Yes| W1["Accept winning\nproposal as new rubric"]
C1 -->|No| S1{"Synthesis\nattempts left?"}
S1 -->|Yes| Syn1["Merge top proposals\ninto single rubric"] -.->|re-evaluate| E1
S1 -->|No| D1["Drop round,\nkeep previous rubric"]
end
R1 -- "improved rubric" --> R2["Round 2 — Student 2"]
R2 --> dots["⋯"]
dots --> RS["Round S — Student S"]
RS -- "final rubric" --> Diff["Compare original vs.\nfinal rubric"]
Diff --> Out["Export result.json\nand metadata.json"]
style dots fill:none,stroke:none
Each round feeds the improved rubric from the previous round into the next. Student answers are processed sequentially because each round's output is the next round's input.
Each round runs the following protocol for one student answer:
flowchart TD
Start(["Begin Round"]) --> Eval
Eval["All N judges evaluate rubric\nagainst student answer (parallel)"]
Eval --> NoChange{"All judges propose\nno changes?"}
NoChange -->|Yes| Unchanged["Rubric unchanged —\nskip to next round"]
NoChange -->|No| OneJudge{"Only one\njudge?"}
OneJudge -->|"Yes — auto-accept\nsole proposal"| Winner
OneJudge -->|No| Vote
Vote["All judges vote on\nbest proposal (parallel)"]
Vote --> Validate["Discard invalid votes\n(unrecognized proposal IDs)"]
Validate --> Tally{"Any proposal meets\nconsensus threshold?"}
Tally -->|"Yes — clear winner"| Winner
Tally -->|"Yes — tie"| Tie["Break tie randomly\n(avoid ordering bias)"] --> Winner
Tally -->|No| SynthLeft{"Synthesis attempts\nremaining?"}
SynthLeft -->|Yes| Synth["Select top-ranked proposals,\nsynthesize into merged rubric"] -.->|"re-evaluate with\nmerged rubric"| Eval
SynthLeft -->|No| Drop["Drop round — rollback rubric,\nflag for manual review"]
Winner["Accept winning proposal\nas new rubric"]
Winner --> Done(["Round Complete"])
Unchanged --> Done
Drop --> Done
1. Parallel Evaluation. All N judges receive the same inputs (exam question, teaching resource, optimization dimensions, current rubric, student answer) and independently produce a Proposal containing:
- Per-criterion analysis (issues found + suggested improvements) for ambiguity, applicability, and discrimination power.
- A complete rewritten rubric, or
nullif no changes are needed. - Reasoning explaining the analysis (generated before the structured fields to improve quality).
All evaluations run concurrently via asyncio.gather().
2. No-Change Short-Circuit. If every judge returns improved_rubric = null (i.e., the rubric adequately handles this student answer), the round ends immediately. No voting occurs, and the rubric passes through unchanged.
3. Single-Judge Fast Path. With N=1, the sole proposal wins automatically. No voting phase is needed.
4. Parallel Voting. Each judge reviews all proposals and votes for the best one (self-voting is allowed). Votes run concurrently. Each vote is validated: if a judge returns a voted_for ID that doesn't match any proposal, the vote is silently discarded. If all votes are invalid, the pipeline raises an error (fail-fast rather than silently producing biased results).
5. Consensus Check. Votes are tallied. If any proposal receives votes / total >= consensus_threshold, it wins. If multiple proposals meet the threshold (a tie), the winner is chosen randomly to avoid implicit bias toward earlier-listed judges. The winning proposal's rubric replaces the current rubric.
6. Synthesis Fallback. If no proposal meets the threshold, the top-ranked proposals (top two tiers by vote count) are sent to a separate synthesizer LLM, which merges the best elements into a single improved rubric. The round then restarts from step 1 with this synthesized rubric, giving judges a chance to converge.
7. Outlier Detection. If synthesis fails to produce consensus after max_synthesis_attempts retries, the round is dropped as an outlier. The rubric rolls back to its state before the round, and the round is flagged in the output metadata for manual review. This prevents a single adversarial or ambiguous student answer from corrupting the rubric.
8. Final Explanation. After all rounds, a diff explainer (separate LLM) compares the original rubric against the final improved rubric and produces a structured explanation of all changes, organized by the three evaluation criteria.
A single LLM call risks model-specific biases and blind spots. The council pattern provides:
- Reduced bias: Multiple models/temperatures surface different issues.
- Built-in quality gate: Voting ensures only changes that multiple judges endorse are accepted.
- Graceful degradation: Works with N=1 (single judge, auto-consensus) up to any N.
- Transparency: The voting record and per-judge reasoning are preserved in metadata.
- LangChain provides
with_structured_output()for native Pydantic schema validation: the LLM returns validated Python objects directly, eliminating manual JSON parsing and error handling. - LiteLLM (via
ChatLiteLLM) provides provider-agnostic routing: a single interface handles OpenAI, Anthropic, Google, Azure, and others. Provider-specific features (reasoning effort for o-series models, extended thinking for Claude) are passed throughmodel_kwargswithout conditional logic.
The council algorithm is a sequential round loop with parallel fan-out/fan-in. This maps directly to asyncio.gather() inside a for loop. LangGraph would add a graph abstraction layer without simplifying the control flow, since the pipeline's topology is fixed (not dynamic or conditional at the graph level).
Judge evaluations and votes are embarrassingly parallel within a round: same inputs, no dependencies between judges. asyncio.gather() with LangChain's native ainvoke() is the simplest correct solution. Rounds are sequential because each round's rubric depends on the previous round's output.
All LLM response schemas place the reasoning field before the analysis and rubric fields. This is intentional: LLMs generate tokens sequentially, so producing the reasoning first means the model has already articulated its analysis when it reaches the structured output fields. This consistently improves the quality of the downstream fields.
The input student answers are handwritten PDFs. Rather than coupling the rubric evaluation pipeline to an OCR or vision-model transcription step (which introduces its own error modes), the student answers are pre-transcribed to plain text files. This ensures:
- Reproducibility: The same input produces the same output regardless of OCR model availability.
- Separation of concerns: Transcription quality doesn't confound rubric evaluation quality.
- Simplicity: The pipeline focuses on its core task.
When multiple proposals meet the consensus threshold with equal vote shares, the winner is chosen uniformly at random. Deterministic tie-breaking (e.g., first proposal wins) would introduce implicit ordering bias. Randomness ensures fairness across runs.
During the voting phase, judges are not told which proposal is their own. All proposals are presented with only the judge ID as a label, and the voting prompt does not indicate authorship. This prevents self-preference bias: judges evaluate proposals on merit rather than defaulting to their own output. For the same reason, the sample config uses neutral judge IDs (judge_1, judge_2) instead of model names, since labels like GPT_5_4 or Claude_Sonnet_4_6 could trigger pretraining-based model affinity.
If a judge returns a voted_for value that doesn't match any proposal ID, the vote is discarded entirely. Defaulting to self-vote would silently inflate self-preference. Discarding preserves vote integrity: only intentional, valid votes count.
If judges cannot agree even after synthesis, it likely means the student answer is ambiguous or the proposed changes are contentious. Forcing a winner would introduce noise. Dropping the round and rolling back preserves rubric quality. Dropped rounds are flagged in metadata so a human reviewer can investigate.
The per-round analysis captures incremental changes. The diff explainer provides a holistic comparison between the original and final rubric, producing a structured explanation of all changes. Using a separate LLM call (with its own model configuration) ensures the explanation is coherent and not biased by any single round's perspective.
LLM API calls are expensive and slow. The cache stores successfully parsed responses keyed by a SHA-256 hash of the full request parameters (model, messages, schema). Key properties:
- Only valid responses are cached: Parse failures are never stored, preventing cache poisoning.
- Deterministic keys: Same request always produces the same cache key.
- Synchronous writes: Intentional; prevents interleaved writes from concurrent
asyncio.gather()coroutines (single-threaded, so sync calls are atomic from the event loop's perspective). - Configurable: Set
cache_path: nullin config to disable entirely.
cd rubric_evaluator
make setupThis will:
- Install all dependencies via
uv sync. - Copy
config.sample.yamltoconfig.yaml. - Copy
.env.sampleto.env.
Then edit .env with your API keys (the default config uses both OpenAI and Anthropic):
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-api03-...
# GOOGLE_API_KEY=...Optionally edit config.yaml to change models, number of judges, temperatures, consensus threshold, etc.
make runOr with a custom config:
make run CONFIG=my_config.yamlOr directly:
uv run rubric-evaluator config.yamlAll parameters are in config.yaml. See config.sample.yaml for annotated defaults.
# Each judge is an independently configured LLM
judges:
- id: "judge_1"
llm:
model: "gpt-5.4" # Any LiteLLM-supported model
api_key: "${OPENAI_API_KEY}" # Resolved from environment
temperature: 0.3
reasoning_effort: "medium" # OpenAI GPT-5+ / o-series only
max_completion_tokens: 8192
- id: "judge_2"
llm:
model: "anthropic/claude-sonnet-4-6"
api_key: "${ANTHROPIC_API_KEY}"
temperature: 0.1
max_completion_tokens: 8192
# thinking_budget: 5000 # Anthropic extended thinking
# Council voting parameters
council:
consensus_threshold: 0.5 # Fraction of votes needed to win (0 < t <= 1)
max_synthesis_attempts: 2 # Retries before dropping a round (0 = no synthesis, drop immediately on disagreement)
# Synthesis and diff explanation use separate LLM configs
synthesizer:
model: "gpt-5.4"
api_key: "${OPENAI_API_KEY}"
temperature: 0.3
max_completion_tokens: 8192
diff_explainer:
model: "anthropic/claude-sonnet-4-6"
api_key: "${ANTHROPIC_API_KEY}"
temperature: 0.3
max_completion_tokens: 8192
# Input file paths (relative to working directory)
inputs:
exam_question: "data/transcribed/exam_question.txt"
grading_rubric: "data/transcribed/rubric.txt"
teaching_resource: "data/transcribed/teaching_resource.txt"
optimization_dimensions: "data/transcribed/optimization_dimensions.txt"
student_answers_dir: "data/transcribed/student_answers" # all .txt files loaded (must contain at least one)
output:
dir: "output"
# LLM cache (set to null to disable)
cache_path: ".cache"Environment variables are resolved at load time via ${VAR_NAME} syntax. A missing variable raises a clear error.
The pipeline writes two files to a timestamped subdirectory under the configured output directory:
output/
2026-04-11T09-42-21-152946/
result.json # Improved rubric + explanation
metadata.json # Pipeline run details
A sample run output is included in examples/ for reference.
Contains the improved rubric and a structured explanation of the changes:
{
"improved_rubric": "1 point per bad actor action (total = 3 points)\n...",
"evaluation_explanation": {
"ambiguity": {
"issues": [
"The term 'sufficiently described' is subjective..."
],
"improvements": [
"Replaced with explicit criteria: must name a specific stakeholder..."
]
},
"applicability": {
"issues": ["..."],
"improvements": ["..."]
},
"discrimination_power": {
"issues": ["..."],
"improvements": ["..."]
}
}
}Contains the full pipeline trace for auditability:
{
"original_rubric": "1 point per bad actor action...",
"rounds_processed": 3,
"rounds_dropped": [],
"rubric_versions": [
{
"round_index": 1,
"answer_path": "data/transcribed/student_answers/student1.txt",
"consensus_reached": true,
"num_attempts": 1,
"dropped": false,
"winning_judge_id": "judge_1",
"rubric_before": "...",
"rubric_after": "...",
"issues": {
"ambiguity": ["..."],
"applicability": ["..."],
"discrimination_power": ["..."]
},
"improvements": {
"ambiguity": ["..."],
"applicability": ["..."],
"discrimination_power": ["..."]
}
}
]
}make testRuns unit and integration tests with coverage reporting. Tests are structured as:
- Unit tests for each module (models, helpers, cache, judge, council, synthesizer, output, LLM client).
- Integration test (
test_pipeline_integration.py) that exercises the full pipeline end-to-end with mocked LLM responses: config loading, file I/O, multi-round council execution, diff explanation, and JSON output verification.
All LLM calls are mocked in tests. No API keys are required to run the test suite.
Linting and formatting:
make lint # Check only
make lint-fix # Auto-fix| Package | Purpose |
|---|---|
langchain-core / langchain-community |
LLM abstraction layer, with_structured_output() for Pydantic schema validation |
litellm |
Provider-agnostic LLM routing (OpenAI, Anthropic, Google, Azure, etc.) |
pydantic |
Configuration validation and LLM response schemas |
structlog |
Structured logging with timestamps and key-value context |
pyyaml |
YAML configuration loading |
python-dotenv |
Load API keys from .env files |
Dev dependencies: ruff (linting/formatting), coverage (test coverage).