eigenhelm¶

Catch low-quality AI-generated code before it lands.

The problem¶

AI coding agents produce working code fast. But "working" isn't the same as "good." Tests pass, the diff looks plausible, and it gets merged — but the structure is off. Functions do too much. Patterns repeat where they should be abstracted. Complexity concentrates in the wrong places.

Humans catch this in review — when they have time. As agents write more code faster, review gets shallower. Quality drifts toward whatever the model's training data looks like, which is GitHub average.

LLM-based reviewers (CodeRabbit, Copilot review) help, but they reason from text, not structure. They share the same blind spots as the agent that wrote the code. And their feedback is non-deterministic — run the same review twice, get different comments.

Why "good enough" isn't¶

The case against caring about code quality is familiar: it works, tests pass, we have deadlines. But structural quality isn't about aesthetics — it's about what happens next.

Defect density correlates with structural complexity. Code with high cyclomatic density and concentrated Halstead effort produces more post-merge defects. This isn't theory — it's been measured across decades of empirical software engineering research. The modules that "work fine" but have 75-line functions and repeated validation blocks are the ones that break when requirements change.

AI-generated code accelerates the problem. An agent can produce 500 lines in seconds. If 10% of that is structurally unsound, you're accumulating technical debt at machine speed. Without measurement, you won't notice until the cost of changing that code exceeds the cost of rewriting it.

Review doesn't scale to agent output volume. A team that reviews 200 lines per PR carefully can't maintain the same rigor when an agent opens 10 PRs a day. The human eye glazes over. Structural issues that would have been caught in a 50-line diff pass unnoticed in a 500-line one.

The fix is cheap when it's early. An agent that receives a [high] reduce_complexity directive and refactors before the PR exists costs nothing — it's a few seconds of compute. The same structural problem discovered six months later during an incident costs days of debugging.

eigenhelm doesn't ask you to write perfect code. It asks you to measure, so you can make informed tradeoffs instead of uninformed ones.

What eigenhelm does¶

eigenhelm scores code structure using information theory — not an LLM. It parses the AST, extracts a structural fingerprint, and measures how closely the code resembles high-quality open-source projects. The output is a deterministic score, a ranking, and actionable directives pointing to specific code locations.

uv tool install eigenhelm
eh evaluate src/ --rank              # rank files best-to-worst
eh evaluate src/mymodule.py --classify  # single-file classification

Before and after¶

An agent writes a module. eigenhelm evaluates it:

src/pipeline.py
  decision: reject
  score:    0.72 (p12 — worse than 88% of training corpus)
  confidence: high
  contributions:
    manifold_drift           0.22  (weight: 0.30, normalized: 0.73)
    manifold_alignment       0.18  (weight: 0.30, normalized: 0.60)
    token_entropy            0.10  (weight: 0.15, normalized: 0.67)
    compression_structure    0.13  (weight: 0.15, normalized: 0.87)
    ncd_exemplar_distance    0.09  (weight: 0.10, normalized: 0.90)
  directives:
    [high] reduce_complexity → process_batch (lines 15-89)
      #1 cyclomatic_density: contribution=-1.2, deviation=+2.8σ
    [high] extract_repeated_logic → validate_row (lines 42-67)
      #1 wl_hash_bin_44: contribution=-0.9, deviation=+2.1σ
    [medium] review_token_distribution → Pipeline.__init__ (lines 3-14)
      #1 halstead_effort: contribution=-0.6, deviation=+1.4σ

The agent reads the directives: process_batch is too complex, validate_row has repeated structure, the constructor is doing too much. It refactors — splits the batch processor, extracts validation, simplifies init. Tests still pass. Re-evaluate:

src/pipeline.py
  decision: accept
  score:    0.35 (p70 — better than 70% of training corpus)
  confidence: high
  contributions:
    manifold_drift           0.09  (weight: 0.30, normalized: 0.30)
    manifold_alignment       0.08  (weight: 0.30, normalized: 0.27)
    token_entropy            0.06  (weight: 0.15, normalized: 0.40)
    compression_structure    0.07  (weight: 0.15, normalized: 0.47)
    ncd_exemplar_distance    0.05  (weight: 0.10, normalized: 0.50)

Score dropped from 0.72 to 0.35. The code is structurally sound. No human reviewed it.

Important: eigenhelm is a signal, not a judge. Don't loop until accept. Don't optimize for the score. Don't hard-gate merges with default thresholds. Use it to focus attention on the code that needs it most.

In a controlled benchmark (3 scenarios, 6 builds, scored by a separate reviewer not involved in generation), agents using eigenhelm's skill contract produced code rated 46% higher on design, robustness, and spec compliance — with zero correctness regressions. Full benchmark results →

See it on real code: We ran eigenhelm against the official FastAPI full-stack template and found a grab-bag utils.py mixing email, JWT tokens, and SMTP configuration. Splitting the original file produced a new tokens.py scoring 0.59 (marginal), versus 0.89 (reject) for the original mixed-concern file — with zero behavior change. Read the full case study →

Why not just use an LLM reviewer?¶

	eigenhelm	LLM reviewer
Input	AST structure (69-dim vector)	Source text
Deterministic	Yes — same code, same score, every time	No
Trainable on your corpus	Yes — `eh train` on your best code	No
Hard CI gate	Yes — with calibrated thresholds	Suggestions only
Tracks quality over time	Yes — scores are comparable across runs	No stable metric
Catches structural decay	Yes — entropy, compression, manifold distance	Guesses from text
Catches logic bugs	No	Yes
Reviews naming/docs	No	Yes
Cost per evaluation	Zero (local, no API calls)	Per-token LLM cost

They're complementary. eigenhelm runs first — in the agent's inner loop, before a PR exists. LLM reviewers run second, on the PR, where contextual reasoning adds the most value. Full comparison →

How it works¶

eigenhelm extracts a 69-dimensional structural fingerprint from each file using tree-sitter and projects it into a PCA eigenspace trained on high-quality open-source code. The score combines five dimensions:

Dimension	What it measures
Manifold drift	Distance from the learned code quality manifold
Manifold alignment	Alignment with principal quality axes
Token entropy	Information density of the byte stream
Compression structure	Structural regularity (Birkhoff aesthetic measure)
NCD exemplar distance	Similarity to nearest high-quality exemplar

Learn more about the scoring model →

Integrate in 30 seconds¶

GitHub ActionPre-commitHTTP APIAgent skill

- uses: actions/checkout@v4
  with:
    fetch-depth: 0
- uses: metacogdev/eigenhelm@v0
  with:
    diff: origin/main...HEAD
    format: sarif

repos:
  - repo: https://github.com/metacogdev/eigenhelm
    rev: v0.5.0
    hooks:
      - id: eigenhelm-check

eh serve --port 8080
curl -X POST http://localhost:8080/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{"source": "def add(a, b): return a + b", "language": "python"}'

npx skills add metacogdev/skills
# or: eh skill --install

Outputs¶

Human — readable terminal output with color and classification
JSON — machine-readable for scripting and dashboards
SARIF 2.1.0 — upload to GitHub Code Scanning, VS Code, or any SARIF viewer

Get started¶

Installation

Install via pip, uv, or Docker
Quick Start

Evaluate your first file in 60 seconds
Configuration

Customize thresholds, models, and paths
GitHub Action

Add eigenhelm to your CI pipeline

Supported languages¶

Trained models: Python, JavaScript, TypeScript, Go, Rust.

Parser support (feature extraction available, bring your own model): Java, C, C++, Ruby, Kotlin.

License¶

eigenhelm is licensed under AGPL-3.0. A commercial license is available for proprietary use.