Cryptographic provenance for LLM inference
You have no proof your LLM provider ran the model they claim. CommitLLM is a cryptographic commit-and-audit protocol that closes that gap: the provider serves normally on GPU and returns a compact receipt. A verifier checks the receipt and opened trace on CPU.
- Routine audit (Llama 70B) 1.3 ms/tok
- Online tracing overhead ~12–14%
- Full audit (1 tok, 70B) ~10 ms
- Within 1 quant bucket >99.8%
- Verifier CPU only
- Provider Normal GPU
Between fingerprints and zero-knowledge proofs
Two unsatisfying extremes—and a design point between them where real deployments need to live.
Fingerprinting
Statistical heuristics provide evidence but not exact per-response verification. A determined provider can game them.
Commit-and-audit
Commitment-bound end-to-end. Information-theoretically sound algebraic checks for large linear layers, canonical replay for supported nonlinear components, CPU-only verification.
ZK proofs
Strong proof objects, but prover costs remain too high for production LLM serving. Impractical at scale today.
Setup once. Commit every response. Verify on challenge.
The verifier holds a secret key derived from public weights. The provider commits during normal inference. Expensive work happens only when challenged.
Build the verifier key
From a public checkpoint, the verifier computes a Merkle root over weights, secret Freivalds vectors for eight matrix families (Wq, Wk, Wv, Wo, Wgate, Wup, Wdown, LM_head), and the model configuration needed for canonical replay.
Serve normally, return a receipt
The provider runs inference on the normal GPU path with a tracing sidecar that captures retained state. It returns the response plus a compact receipt binding the execution trace, KV state, deployment manifest, prompt, sampling randomness, and token count.
Challenge specific positions and layers
The verifier challenges token positions and layers after the commitment. The provider opens the requested region. Routine audit samples prefix state; deep audit opens everything.
CPU-only checks
Embedding Merkle proof. Freivalds on shell matmuls. Exact INT8 bridge recomputation. KV provenance. Attention replay against committed post-attention output. Final-token tail from captured residual. LM-head binding. Decode and output policy replay.
What is exact, approximate, and statistical
Commitment-bound end-to-end, with explicit boundaries for each verification class. Not “uniformly exact”—honestly delineated.
The attention interior remains approximate because native GPU FP16/BF16 attention is not bit-reproducible across devices or even across runs. CommitLLM constrains it strongly—shell-verified Q, K, and V on both sides, commitment-verified prefix state, independent verifier replay, cross-layer consistency through the residual stream—but does not pretend it is exact. In routine audit mode, prefix/KV provenance is statistical: Merkle binding is exact, sampled positions are shell-verified exactly, but unopened positions are covered probabilistically. Deep audit upgrades this to exact full-prefix verification. The honest claim is not “uniformly exact end-to-end” but a precisely delineated guarantee boundary.
Routine audit stays cheap. Deep audit upgrades coverage.
CommitLLM uses the same receipt in both modes. Routine audit keeps steady-state verification light; deep audit opens the full retained window and upgrades prefix provenance to exact verification.
Low-friction spot checks
Designed for normal operation when you want frequent verification without opening the full trace every time.
- Freivalds-based checks on large linear layers
- Canonical replay for supported nonlinear subcomputations
- Sampled prefix and KV provenance with statistical coverage
- Bounded approximate attention replay on CPU
Escalate when the stakes are higher
Use the same commitment, but require a larger opening. This removes the routine-audit statistical gap on the retained prefix window.
- Full-prefix and KV openings across the retained audit window
- Exact prefix provenance instead of sampled coverage
- The same algebraic, replay, and decode checks as routine audit
- Higher bandwidth and storage cost, not a different serving path
Verify huge matrix multiplies cheaply
The provider claims z = W @ x for a public weight matrix W.
Recomputing the full product is expensive. Freivalds’ algorithm gives a much cheaper check:
the verifier precomputes v = rTW with a secret random vector r,
then checks v·x =? rT·z in the finite field
Fp where p = 232−5.
If z ≠ Wx, the check fails with probability ≥ 1−1/p.
This is information-theoretically sound. Transformers are mostly matrix multiplication;
once those multiplies are cheap to audit, the verifier can check model identity
without rerunning the full model.
Performance on the corrected replay path
Measured on Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8. Attention mismatch is single-digit and bounded.
Verifier cost · Llama 70B
Online tracing overhead
Attention corridor · Qwen2.5-7B-W8A8
Attention corridor · Llama-3.1-8B-W8A8
Built to sit beside real serving stacks
CommitLLM is not a replacement inference engine. The provider keeps the normal GPU path and produces request-scoped evidence alongside it.
Continuous batching and paged attention
Many user requests can share the same GPU microbatch. CommitLLM still produces per-request receipts and per-request audits.
Tensor parallelism and fused kernels
The tracing layer follows the existing execution path instead of replacing production kernels with proof-friendly substitutes.
Quantized serving
Quantization metadata is receipt-bound, and the kept path is measured on production-style W8A8 checkpoints.
Cross-request cache reuse and shortcut decoding
Cross-request prefix caching, speculative decoding, and other semantics-changing shortcuts need more protocol work. Unsupported paths should fail closed.
Four specs, one receipt
CommitLLM binds the entire deployment surface that affects outputs—not just “some model ran.”
| Spec | What it binds |
|---|---|
| input_spec_hash | Tokenizer, chat template, BOS/EOS, truncation, padding, system prompt |
| model_spec_hash | Checkpoint identity RW, quantization, LoRA/adapter, RoPE config, RMSNorm ε |
| decode_spec_hash | Sampler, temperature, top-k/p, penalties, logit bias, grammar, stop rules |
| output_spec_hash | Detokenization, cleanup, whitespace normalization |
Provenance for every deployment
Enterprise procurement
Paying for Llama 70B? Get proof the provider actually served that checkpoint, not a smaller distillation.
Regulated deployments
Banks, hospitals, legal teams—auditable chain from decision to model version, decode policy, and output.
Decentralized compute
Networks like Gensyn, Ritual, or Bittensor cannot rely on “trust the node.” CommitLLM provides the missing layer.
Agent systems
When an agent takes action, which model produced the decision becomes a liability and governance question.
Paper summary
Large language models are increasingly used in settings where integrity matters, but users still lack technical assurance that a provider actually ran the claimed model, decode policy, and output behavior. Fingerprinting and statistical heuristics can provide signals, but not exact per-response verification. Zero-knowledge proof systems provide stronger guarantees, but at prover costs that remain impractical for production LLM serving.
We present CommitLLM, a cryptographic commit-and-audit protocol for open-weight LLM inference. CommitLLM keeps the provider on the normal serving path and keeps verifier work fast and CPU-only. It combines commitment binding, direct audit, and randomized algebraic fingerprints, including Freivalds-style checks for large matrix products, rather than per-response proof generation or full re-execution. Its main costs are retained-state memory over the audit window and audit bandwidth, not per-response proving.
The protocol is commitment-bound end-to-end. Within that binding, large linear layers are verified by verifier-secret, information-theoretically sound algebraic checks, quantization/dequantization boundaries and supported nonlinear subcomputations are checked by canonical re-execution, attention is verified by bounded approximate replay, and routine prefix-state provenance is statistical unless deep audit is used. Unsupported semantics fail closed.
Code layout
The public project name is CommitLLM. Some internal crate and package paths still use the legacy verilm-* prefix while the rename is being completed.
| Component | Path |
|---|---|
| Core types and traits | crates/verilm-core |
| Key generation | crates/verilm-keygen |
| Verifier | crates/verilm-verify |
| Prover (Rust) | crates/verilm-prover |
| Python sidecar | sidecar/ |
| Python bindings | crates/verilm-py |
| Test vectors | crates/verilm-test-vectors |
| Lean formalization | lean/ |
| Paper | paper/main.pdf |