Go
stable
matchspec
Define correctness as code.
go get github.com/greynewell/matchspec
Datasets as code
Define evaluation inputs and expected outputs in Go or YAML. Version them alongside your prompts and models.
Composable graders
Exact match, regex, semantic similarity, and LLM-as-judge. Combine graders with weighted scoring.
Statistical rigor
Set confidence intervals and minimum sample sizes. Know when your pass rate is meaningful.
Deployment gates
Run evals in CI. Fail the build when pass rate drops below threshold. No more vibes-based deploys.
HTTP API
Trigger eval runs and retrieve results over HTTP. Integrate with any pipeline.
Zero dependencies
Built on mist-go. No runtime dependencies beyond the standard library.
Example
matchspec
$ matchspec run ./evals/summarization/
loading dataset: 120 examples
running graders: exact_match, semantic_similarity
suite: summarization-v2
─────────────────────────────────────
exact_match 0.74 ✓ (≥0.70)
semantic_similarity 0.91 ✓ (≥0.85)
─────────────────────────────────────
overall PASS
$ echo $?
0
Ship evals before you ship features. matchspec is the enforcement layer between your prompt and your production deployment — define correctness, measure it reproducibly, fail the build when it regresses.
Graders compose. Run `exact_match` and `semantic_similarity` on the same output, weight them, add an `llm_judge` step for anything that needs nuance. The `Grader` interface is a single method, so custom graders are easy to write and test in isolation.
`matchspec run` exits non-zero on failure. No custom CI logic required.