Go stable

matchspec

Define correctness as code.

go get github.com/greynewell/matchspec

Docs → Tutorials GitHub ↗

Datasets as code

Define evaluation inputs and expected outputs in Go or YAML. Version them alongside your prompts and models.

Composable graders

Exact match, regex, semantic similarity, and LLM-as-judge. Combine graders with weighted scoring.

Statistical rigor

Set confidence intervals and minimum sample sizes. Know when your pass rate is meaningful.

Deployment gates

Run evals in CI. Fail the build when pass rate drops below threshold. No more vibes-based deploys.

HTTP API

Trigger eval runs and retrieve results over HTTP. Integrate with any pipeline.

Zero dependencies

Built on mist-go. No runtime dependencies beyond the standard library.

Example

matchspec

$ matchspec run ./evals/summarization/ loading dataset: 120 examples running graders: exact_match, semantic_similarity suite: summarization-v2 ───────────────────────────────────── exact_match 0.74 ✓ (≥0.70) semantic_similarity 0.91 ✓ (≥0.85) ───────────────────────────────────── overall PASS $ echo $? 0

Ship evals before you ship features. matchspec is the enforcement layer between your prompt and your production deployment — define correctness, measure it reproducibly, fail the build when it regresses. Graders compose. Run `exact_match` and `semantic_similarity` on the same output, weight them, add an `llm_judge` step for anything that needs nuance. The `Grader` interface is a single method, so custom graders are easy to write and test in isolation. `matchspec run` exits non-zero on failure. No custom CI logic required.

matchspec

Datasets as code

Composable graders

Statistical rigor

Deployment gates

HTTP API

Zero dependencies

Example

Continue

Documentation →

Tutorials →

GitHub ↗