See through the mist.
Eval-driven infrastructure for AI systems.
go get github.com/greynewell/mist-gomatchspec
Eval framework for AI systems. Define datasets, write graders, run suites, gate deployments on passing thresholds.
infermux
Inference router for AI systems. Load balance across providers, fail over automatically, cost-optimize model calls.
schemaflux
Structured data compiler. Parse markdown and frontmatter, run 12-pass IR pipeline, emit typed output.
tokentrace
Observability for AI inference. Track cost, latency, and quality across your stack. Alert on regressions.
AI Agents
Cascading errors, context rot, and agents that ignore instructions. What the benchmarks don't show about production agents.
Model Harnesses
Bad data, silent failures, and catastrophic forgetting. The real problems behind fine-tuning that tutorials skip.
RL Environments
Reward hacking, reproducibility crises, and the debugging abyss. Why RL training fails and what's missing from the toolchain.
swe-bench-fast
SWE-bench eval harness with native ARM64 containers. 6.3x test runner speedup on Apple Silicon and AWS Graviton.