Skip to content

spChalk/rubric-evaluator

Repository files navigation

Rubric Evaluator

An AI-powered application that evaluates and iteratively improves a grading rubric using an LLM council. Multiple LLM judges independently analyze the rubric against real student answers, debate improvements through a structured voting protocol, and converge on an optimized rubric through consensus or synthesis.

Table of Contents

Prerequisites

  • Python 3.11+
  • uv package manager
  • An API key for at least one LLM provider (OpenAI, Anthropic, Google, or Azure)

Quick Start

cd rubric_evaluator
make setup                # installs deps, copies config templates (⚠️ overwrites config.yaml and .env)
# Edit .env with your API keys (default config uses both OpenAI and Anthropic)
make run                  # runs the pipeline, output lands in output/<timestamp>/

Output: output/<timestamp>/result.json (improved rubric + explanation) and metadata.json (full pipeline trace).

Available Make Targets

Command Description
make help Show all available targets
make install Install dependencies via uv sync
make setup Install + copy sample configs to config.yaml and .env
make run Run the pipeline (override config with CONFIG=path.yaml)
make test Run unit and integration tests with coverage
make lint Check linting and formatting (no changes)
make lint-fix Auto-fix linting and formatting issues
make clean Remove build artifacts, caches, and output

Approach

The core idea is to treat rubric improvement as a multi-agent deliberation problem. Rather than asking a single LLM to improve the rubric in one shot, the application assembles a council of N independently configured LLM judges that each bring a different perspective (different models, temperatures, or provider-specific features like extended thinking). Each student answer acts as a stress test that may expose different rubric weaknesses.

The rubric evolves through sequential rounds (one per student answer), where each round runs parallel evaluation, parallel voting, and optional synthesis. This mimics how a real teaching team would refine a rubric: review it against real answers, discuss, vote, and iterate.

Evaluation Criteria

The rubric is evaluated on three dimensions:

Criterion Key Question Goal
Ambiguity Is the rubric formulated with objective, unambiguous criteria? All graders reach the same interpretation independently
Applicability Does the rubric cover the diversity of possible student responses? No valid answer type is left unaddressed
Discrimination Power Does the rubric clearly separate excellent from poor work? High-quality answers score significantly better than weak ones

Algorithm

Pipeline Overview

flowchart TD
    Load["Load config, rubric,\nexam question, student answers"] --> R1

    subgraph R1 ["Council Round — one per student answer"]
        direction TB
        E1["Each judge independently\nevaluates rubric vs. answer"] --> V1["Each judge votes\nfor best proposal"] --> C1{"Consensus\nreached?"}
        C1 -->|Yes| W1["Accept winning\nproposal as new rubric"]
        C1 -->|No| S1{"Synthesis\nattempts left?"}
        S1 -->|Yes| Syn1["Merge top proposals\ninto single rubric"] -.->|re-evaluate| E1
        S1 -->|No| D1["Drop round,\nkeep previous rubric"]
    end

    R1 -- "improved rubric" --> R2["Round 2 — Student 2"]
    R2 --> dots["⋯"]
    dots --> RS["Round S — Student S"]

    RS -- "final rubric" --> Diff["Compare original vs.\nfinal rubric"]
    Diff --> Out["Export result.json\nand metadata.json"]

    style dots fill:none,stroke:none
Loading

Each round feeds the improved rubric from the previous round into the next. Student answers are processed sequentially because each round's output is the next round's input.

Council Round (Detail)

Each round runs the following protocol for one student answer:

flowchart TD
    Start(["Begin Round"]) --> Eval

    Eval["All N judges evaluate rubric\nagainst student answer (parallel)"]
    Eval --> NoChange{"All judges propose\nno changes?"}

    NoChange -->|Yes| Unchanged["Rubric unchanged —\nskip to next round"]
    NoChange -->|No| OneJudge{"Only one\njudge?"}

    OneJudge -->|"Yes — auto-accept\nsole proposal"| Winner
    OneJudge -->|No| Vote

    Vote["All judges vote on\nbest proposal (parallel)"]
    Vote --> Validate["Discard invalid votes\n(unrecognized proposal IDs)"]
    Validate --> Tally{"Any proposal meets\nconsensus threshold?"}

    Tally -->|"Yes — clear winner"| Winner
    Tally -->|"Yes — tie"| Tie["Break tie randomly\n(avoid ordering bias)"] --> Winner
    Tally -->|No| SynthLeft{"Synthesis attempts\nremaining?"}

    SynthLeft -->|Yes| Synth["Select top-ranked proposals,\nsynthesize into merged rubric"] -.->|"re-evaluate with\nmerged rubric"| Eval
    SynthLeft -->|No| Drop["Drop round — rollback rubric,\nflag for manual review"]

    Winner["Accept winning proposal\nas new rubric"]
    Winner --> Done(["Round Complete"])
    Unchanged --> Done
    Drop --> Done
Loading

Step-by-Step Walkthrough

1. Parallel Evaluation. All N judges receive the same inputs (exam question, teaching resource, optimization dimensions, current rubric, student answer) and independently produce a Proposal containing:

  • Per-criterion analysis (issues found + suggested improvements) for ambiguity, applicability, and discrimination power.
  • A complete rewritten rubric, or null if no changes are needed.
  • Reasoning explaining the analysis (generated before the structured fields to improve quality).

All evaluations run concurrently via asyncio.gather().

2. No-Change Short-Circuit. If every judge returns improved_rubric = null (i.e., the rubric adequately handles this student answer), the round ends immediately. No voting occurs, and the rubric passes through unchanged.

3. Single-Judge Fast Path. With N=1, the sole proposal wins automatically. No voting phase is needed.

4. Parallel Voting. Each judge reviews all proposals and votes for the best one (self-voting is allowed). Votes run concurrently. Each vote is validated: if a judge returns a voted_for ID that doesn't match any proposal, the vote is silently discarded. If all votes are invalid, the pipeline raises an error (fail-fast rather than silently producing biased results).

5. Consensus Check. Votes are tallied. If any proposal receives votes / total >= consensus_threshold, it wins. If multiple proposals meet the threshold (a tie), the winner is chosen randomly to avoid implicit bias toward earlier-listed judges. The winning proposal's rubric replaces the current rubric.

6. Synthesis Fallback. If no proposal meets the threshold, the top-ranked proposals (top two tiers by vote count) are sent to a separate synthesizer LLM, which merges the best elements into a single improved rubric. The round then restarts from step 1 with this synthesized rubric, giving judges a chance to converge.

7. Outlier Detection. If synthesis fails to produce consensus after max_synthesis_attempts retries, the round is dropped as an outlier. The rubric rolls back to its state before the round, and the round is flagged in the output metadata for manual review. This prevents a single adversarial or ambiguous student answer from corrupting the rubric.

8. Final Explanation. After all rounds, a diff explainer (separate LLM) compares the original rubric against the final improved rubric and produces a structured explanation of all changes, organized by the three evaluation criteria.

Design Choices

Why an LLM Council?

A single LLM call risks model-specific biases and blind spots. The council pattern provides:

  • Reduced bias: Multiple models/temperatures surface different issues.
  • Built-in quality gate: Voting ensures only changes that multiple judges endorse are accepted.
  • Graceful degradation: Works with N=1 (single judge, auto-consensus) up to any N.
  • Transparency: The voting record and per-judge reasoning are preserved in metadata.

Why LangChain + LiteLLM?

  • LangChain provides with_structured_output() for native Pydantic schema validation: the LLM returns validated Python objects directly, eliminating manual JSON parsing and error handling.
  • LiteLLM (via ChatLiteLLM) provides provider-agnostic routing: a single interface handles OpenAI, Anthropic, Google, Azure, and others. Provider-specific features (reasoning effort for o-series models, extended thinking for Claude) are passed through model_kwargs without conditional logic.

Why Not LangGraph?

The council algorithm is a sequential round loop with parallel fan-out/fan-in. This maps directly to asyncio.gather() inside a for loop. LangGraph would add a graph abstraction layer without simplifying the control flow, since the pipeline's topology is fixed (not dynamic or conditional at the graph level).

Why Plain asyncio for Parallelism?

Judge evaluations and votes are embarrassingly parallel within a round: same inputs, no dependencies between judges. asyncio.gather() with LangChain's native ainvoke() is the simplest correct solution. Rounds are sequential because each round's rubric depends on the previous round's output.

Why Structured Output with Reasoning First?

All LLM response schemas place the reasoning field before the analysis and rubric fields. This is intentional: LLMs generate tokens sequentially, so producing the reasoning first means the model has already articulated its analysis when it reaches the structured output fields. This consistently improves the quality of the downstream fields.

Why Pre-Transcribed Student Answers?

The input student answers are handwritten PDFs. Rather than coupling the rubric evaluation pipeline to an OCR or vision-model transcription step (which introduces its own error modes), the student answers are pre-transcribed to plain text files. This ensures:

  • Reproducibility: The same input produces the same output regardless of OCR model availability.
  • Separation of concerns: Transcription quality doesn't confound rubric evaluation quality.
  • Simplicity: The pipeline focuses on its core task.

Why Random Tie-Breaking?

When multiple proposals meet the consensus threshold with equal vote shares, the winner is chosen uniformly at random. Deterministic tie-breaking (e.g., first proposal wins) would introduce implicit ordering bias. Randomness ensures fairness across runs.

Why Blind Voting?

During the voting phase, judges are not told which proposal is their own. All proposals are presented with only the judge ID as a label, and the voting prompt does not indicate authorship. This prevents self-preference bias: judges evaluate proposals on merit rather than defaulting to their own output. For the same reason, the sample config uses neutral judge IDs (judge_1, judge_2) instead of model names, since labels like GPT_5_4 or Claude_Sonnet_4_6 could trigger pretraining-based model affinity.

Why Discard Invalid Votes (Not Default to Self-Vote)?

If a judge returns a voted_for value that doesn't match any proposal ID, the vote is discarded entirely. Defaulting to self-vote would silently inflate self-preference. Discarding preserves vote integrity: only intentional, valid votes count.

Why Drop Rounds Instead of Forcing Consensus?

If judges cannot agree even after synthesis, it likely means the student answer is ambiguous or the proposed changes are contentious. Forcing a winner would introduce noise. Dropping the round and rolling back preserves rubric quality. Dropped rounds are flagged in metadata so a human reviewer can investigate.

Why a Separate Diff Explainer?

The per-round analysis captures incremental changes. The diff explainer provides a holistic comparison between the original and final rubric, producing a structured explanation of all changes. Using a separate LLM call (with its own model configuration) ensures the explanation is coherent and not biased by any single round's perspective.

Why a JSON File Cache?

LLM API calls are expensive and slow. The cache stores successfully parsed responses keyed by a SHA-256 hash of the full request parameters (model, messages, schema). Key properties:

  • Only valid responses are cached: Parse failures are never stored, preventing cache poisoning.
  • Deterministic keys: Same request always produces the same cache key.
  • Synchronous writes: Intentional; prevents interleaved writes from concurrent asyncio.gather() coroutines (single-threaded, so sync calls are atomic from the event loop's perspective).
  • Configurable: Set cache_path: null in config to disable entirely.

How to Run

Setup

cd rubric_evaluator
make setup

This will:

  1. Install all dependencies via uv sync.
  2. Copy config.sample.yaml to config.yaml.
  3. Copy .env.sample to .env.

Then edit .env with your API keys (the default config uses both OpenAI and Anthropic):

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-api03-...
# GOOGLE_API_KEY=...

Optionally edit config.yaml to change models, number of judges, temperatures, consensus threshold, etc.

Run

make run

Or with a custom config:

make run CONFIG=my_config.yaml

Or directly:

uv run rubric-evaluator config.yaml

Configuration

All parameters are in config.yaml. See config.sample.yaml for annotated defaults.

# Each judge is an independently configured LLM
judges:
  - id: "judge_1"
    llm:
      model: "gpt-5.4"              # Any LiteLLM-supported model
      api_key: "${OPENAI_API_KEY}"   # Resolved from environment
      temperature: 0.3
      reasoning_effort: "medium"     # OpenAI GPT-5+ / o-series only
      max_completion_tokens: 8192

  - id: "judge_2"
    llm:
      model: "anthropic/claude-sonnet-4-6"
      api_key: "${ANTHROPIC_API_KEY}"
      temperature: 0.1
      max_completion_tokens: 8192
      # thinking_budget: 5000       # Anthropic extended thinking

# Council voting parameters
council:
  consensus_threshold: 0.5   # Fraction of votes needed to win (0 < t <= 1)
  max_synthesis_attempts: 2  # Retries before dropping a round (0 = no synthesis, drop immediately on disagreement)

# Synthesis and diff explanation use separate LLM configs
synthesizer:
  model: "gpt-5.4"
  api_key: "${OPENAI_API_KEY}"
  temperature: 0.3
  max_completion_tokens: 8192

diff_explainer:
  model: "anthropic/claude-sonnet-4-6"
  api_key: "${ANTHROPIC_API_KEY}"
  temperature: 0.3
  max_completion_tokens: 8192

# Input file paths (relative to working directory)
inputs:
  exam_question: "data/transcribed/exam_question.txt"
  grading_rubric: "data/transcribed/rubric.txt"
  teaching_resource: "data/transcribed/teaching_resource.txt"
  optimization_dimensions: "data/transcribed/optimization_dimensions.txt"
  student_answers_dir: "data/transcribed/student_answers"  # all .txt files loaded (must contain at least one)

output:
  dir: "output"

# LLM cache (set to null to disable)
cache_path: ".cache"

Environment variables are resolved at load time via ${VAR_NAME} syntax. A missing variable raises a clear error.

Output Format

The pipeline writes two files to a timestamped subdirectory under the configured output directory:

output/
  2026-04-11T09-42-21-152946/
    result.json      # Improved rubric + explanation
    metadata.json    # Pipeline run details

A sample run output is included in examples/ for reference.

result.json

Contains the improved rubric and a structured explanation of the changes:

{
  "improved_rubric": "1 point per bad actor action (total = 3 points)\n...",
  "evaluation_explanation": {
    "ambiguity": {
      "issues": [
        "The term 'sufficiently described' is subjective..."
      ],
      "improvements": [
        "Replaced with explicit criteria: must name a specific stakeholder..."
      ]
    },
    "applicability": {
      "issues": ["..."],
      "improvements": ["..."]
    },
    "discrimination_power": {
      "issues": ["..."],
      "improvements": ["..."]
    }
  }
}

metadata.json

Contains the full pipeline trace for auditability:

{
  "original_rubric": "1 point per bad actor action...",
  "rounds_processed": 3,
  "rounds_dropped": [],
  "rubric_versions": [
    {
      "round_index": 1,
      "answer_path": "data/transcribed/student_answers/student1.txt",
      "consensus_reached": true,
      "num_attempts": 1,
      "dropped": false,
      "winning_judge_id": "judge_1",
      "rubric_before": "...",
      "rubric_after": "...",
      "issues": {
        "ambiguity": ["..."],
        "applicability": ["..."],
        "discrimination_power": ["..."]
      },
      "improvements": {
        "ambiguity": ["..."],
        "applicability": ["..."],
        "discrimination_power": ["..."]
      }
    }
  ]
}

Testing

make test

Runs unit and integration tests with coverage reporting. Tests are structured as:

  • Unit tests for each module (models, helpers, cache, judge, council, synthesizer, output, LLM client).
  • Integration test (test_pipeline_integration.py) that exercises the full pipeline end-to-end with mocked LLM responses: config loading, file I/O, multi-round council execution, diff explanation, and JSON output verification.

All LLM calls are mocked in tests. No API keys are required to run the test suite.

Linting and formatting:

make lint       # Check only
make lint-fix   # Auto-fix

Dependencies

Package Purpose
langchain-core / langchain-community LLM abstraction layer, with_structured_output() for Pydantic schema validation
litellm Provider-agnostic LLM routing (OpenAI, Anthropic, Google, Azure, etc.)
pydantic Configuration validation and LLM response schemas
structlog Structured logging with timestamps and key-value context
pyyaml YAML configuration loading
python-dotenv Load API keys from .env files

Dev dependencies: ruff (linting/formatting), coverage (test coverage).

About

AI-powered grading rubric evaluation and improvement using an LLM council

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors