Active development · Lean 4 formalization in progress

Cryptographic provenance for LLM inference

You have no proof your LLM provider ran the model they claim. CommitLLM is a cryptographic commit-and-audit protocol that closes that gap: the provider serves normally on GPU and returns a compact receipt. A verifier checks the receipt and opened trace on CPU.

Linear shell: algebraic checks Nonlinear shell: canonical replay Attention: bounded approximate replay Prefix/KV: statistical unless deep audit
Measured on the kept path
  • Routine audit (Llama 70B) 1.3 ms/tok
  • Online tracing overhead ~12–14%
  • Full audit (1 tok, 70B) ~10 ms
  • Within 1 quant bucket >99.8%
  • Verifier CPU only
  • Provider Normal GPU

Between fingerprints and zero-knowledge proofs

Two unsatisfying extremes—and a design point between them where real deployments need to live.

Insufficient

Fingerprinting

Statistical heuristics provide evidence but not exact per-response verification. A determined provider can game them.

CommitLLM

Commit-and-audit

Commitment-bound end-to-end. Information-theoretically sound algebraic checks for large linear layers, canonical replay for supported nonlinear components, CPU-only verification.

Impractical

ZK proofs

Strong proof objects, but prover costs remain too high for production LLM serving. Impractical at scale today.


Setup once. Commit every response. Verify on challenge.

The verifier holds a secret key derived from public weights. The provider commits during normal inference. Expensive work happens only when challenged.

Phase 0 · Setup

Build the verifier key

From a public checkpoint, the verifier computes a Merkle root over weights, secret Freivalds vectors for eight matrix families (Wq, Wk, Wv, Wo, Wgate, Wup, Wdown, LM_head), and the model configuration needed for canonical replay.

Phase 1 · Commit

Serve normally, return a receipt

The provider runs inference on the normal GPU path with a tracing sidecar that captures retained state. It returns the response plus a compact receipt binding the execution trace, KV state, deployment manifest, prompt, sampling randomness, and token count.

Phase 2 · Audit

Challenge specific positions and layers

The verifier challenges token positions and layers after the commitment. The provider opens the requested region. Routine audit samples prefix state; deep audit opens everything.

Phase 3 · Verify

CPU-only checks

Embedding Merkle proof. Freivalds on shell matmuls. Exact INT8 bridge recomputation. KV provenance. Attention replay against committed post-attention output. Final-token tail from captured residual. LM-head binding. Decode and output policy replay.


What is exact, approximate, and statistical

Commitment-bound end-to-end, with explicit boundaries for each verification class. Not “uniformly exact”—honestly delineated.

Input
Exact
Embedding
Exact
Shell matmuls
Freivalds
INT8 bridges
Exact
Prefix/KV
Statistical*
Attention
Approximate
Final tail
Exact
LM head
Freivalds
Decode
Fail-closed
Exact / canonical replay Algebraic checks Approximate (FP16/BF16) Statistical → exact in deep audit Fail-closed

The attention interior remains approximate because native GPU FP16/BF16 attention is not bit-reproducible across devices or even across runs. CommitLLM constrains it strongly—shell-verified Q, K, and V on both sides, commitment-verified prefix state, independent verifier replay, cross-layer consistency through the residual stream—but does not pretend it is exact. In routine audit mode, prefix/KV provenance is statistical: Merkle binding is exact, sampled positions are shell-verified exactly, but unopened positions are covered probabilistically. Deep audit upgrades this to exact full-prefix verification. The honest claim is not “uniformly exact end-to-end” but a precisely delineated guarantee boundary.


Routine audit stays cheap. Deep audit upgrades coverage.

CommitLLM uses the same receipt in both modes. Routine audit keeps steady-state verification light; deep audit opens the full retained window and upgrades prefix provenance to exact verification.

Routine audit

Low-friction spot checks

Designed for normal operation when you want frequent verification without opening the full trace every time.

  • Freivalds-based checks on large linear layers
  • Canonical replay for supported nonlinear subcomputations
  • Sampled prefix and KV provenance with statistical coverage
  • Bounded approximate attention replay on CPU
Deep audit

Escalate when the stakes are higher

Use the same commitment, but require a larger opening. This removes the routine-audit statistical gap on the retained prefix window.

  • Full-prefix and KV openings across the retained audit window
  • Exact prefix provenance instead of sampled coverage
  • The same algebraic, replay, and decode checks as routine audit
  • Higher bandwidth and storage cost, not a different serving path
Operationally: routine audit is the default posture; deep audit is the escalation path when a response is high value, disputed, or randomly selected for full review.

Verify huge matrix multiplies cheaply

The provider claims z = W @ x for a public weight matrix W. Recomputing the full product is expensive. Freivalds’ algorithm gives a much cheaper check: the verifier precomputes v = rTW with a secret random vector r, then checks v·x =? rT·z in the finite field Fp where p = 232−5.

If z ≠ Wx, the check fails with probability ≥ 1−1/p. This is information-theoretically sound. Transformers are mostly matrix multiplication; once those multiplies are cheap to audit, the verifier can check model identity without rerunning the full model.

Wq Wk Wv Wo Wgate Wup Wdown LMhead
Interactive demo · 3×3 over Fp
Click a button to run Freivalds’ check

Performance on the corrected replay path

Measured on Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8. Attention mismatch is single-digit and bounded.

Measured today Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8
Verifier hardware CPU only
Provider path Normal GPU serving with tracing

Verifier cost · Llama 70B

Routine
1.3 ms
Full
10 ms

Online tracing overhead

Base
baseline
+Trace
+12–14%

Attention corridor · Qwen2.5-7B-W8A8

L∞
8
frac_eq
>92%
frac≤1
>99.8%

Attention corridor · Llama-3.1-8B-W8A8

L∞
9
frac_eq
94–96%
frac≤1
>99.9%

Built to sit beside real serving stacks

CommitLLM is not a replacement inference engine. The provider keeps the normal GPU path and produces request-scoped evidence alongside it.

Supported now

Continuous batching and paged attention

Many user requests can share the same GPU microbatch. CommitLLM still produces per-request receipts and per-request audits.

Supported now

Tensor parallelism and fused kernels

The tracing layer follows the existing execution path instead of replacing production kernels with proof-friendly substitutes.

Supported now

Quantized serving

Quantization metadata is receipt-bound, and the kept path is measured on production-style W8A8 checkpoints.

Not the current story

Cross-request cache reuse and shortcut decoding

Cross-request prefix caching, speculative decoding, and other semantics-changing shortcuts need more protocol work. Unsupported paths should fail closed.


Four specs, one receipt

CommitLLM binds the entire deployment surface that affects outputs—not just “some model ran.”

SpecWhat it binds
input_spec_hashTokenizer, chat template, BOS/EOS, truncation, padding, system prompt
model_spec_hashCheckpoint identity RW, quantization, LoRA/adapter, RoPE config, RMSNorm ε
decode_spec_hashSampler, temperature, top-k/p, penalties, logit bias, grammar, stop rules
output_spec_hashDetokenization, cleanup, whitespace normalization

Provenance for every deployment

Enterprise procurement

Paying for Llama 70B? Get proof the provider actually served that checkpoint, not a smaller distillation.

Regulated deployments

Banks, hospitals, legal teams—auditable chain from decision to model version, decode policy, and output.

Decentralized compute

Networks like Gensyn, Ritual, or Bittensor cannot rely on “trust the node.” CommitLLM provides the missing layer.

Agent systems

When an agent takes action, which model produced the decision becomes a liability and governance question.


Paper summary

Federico Carrone, Diego Kingston, Manuel Puebla, Mauro Toscano
Lambda Class · Centro de Criptografía y Seguridad Digital, UBA

Large language models are increasingly used in settings where integrity matters, but users still lack technical assurance that a provider actually ran the claimed model, decode policy, and output behavior. Fingerprinting and statistical heuristics can provide signals, but not exact per-response verification. Zero-knowledge proof systems provide stronger guarantees, but at prover costs that remain impractical for production LLM serving.

We present CommitLLM, a cryptographic commit-and-audit protocol for open-weight LLM inference. CommitLLM keeps the provider on the normal serving path and keeps verifier work fast and CPU-only. It combines commitment binding, direct audit, and randomized algebraic fingerprints, including Freivalds-style checks for large matrix products, rather than per-response proof generation or full re-execution. Its main costs are retained-state memory over the audit window and audit bandwidth, not per-response proving.

The protocol is commitment-bound end-to-end. Within that binding, large linear layers are verified by verifier-secret, information-theoretically sound algebraic checks, quantization/dequantization boundaries and supported nonlinear subcomputations are checked by canonical re-execution, attention is verified by bounded approximate replay, and routine prefix-state provenance is statistical unless deep audit is used. Unsupported semantics fail closed.


Code layout

The public project name is CommitLLM. Some internal crate and package paths still use the legacy verilm-* prefix while the rename is being completed.

ComponentPath
Core types and traitscrates/verilm-core
Key generationcrates/verilm-keygen
Verifiercrates/verilm-verify
Prover (Rust)crates/verilm-prover
Python sidecarsidecar/
Python bindingscrates/verilm-py
Test vectorscrates/verilm-test-vectors
Lean formalizationlean/
Paperpaper/main.pdf