Pure Go ML framework -- inference, training, and serving. Embed any GGUF model in your Go application with go build ./....
235 tok/s on Gemma 3 1B Q4_K_M -- 25% faster than Ollama. Zero CGo. 28 model architectures (16 families). EAGLE speculative decoding with built-in head training, QuaRot quantization, Q4_K fused GEMV (14x faster), Multi-LoRA serving, BitNet ternary inference. CUDA graph capture, Apple Metal kernels. Time-series inference 21x faster than Python. Tabular ML and time-series forecasting built in.
Decode throughput comparison against Ollama on NVIDIA DGX Spark GB10 (Grace Blackwell, sm_121, 128 GB LPDDR5x).
| Model | Size | Quant | Zerfoo (tok/s) | Ollama (tok/s) | Ratio |
|---|---|---|---|---|---|
| Gemma 3 1B | 1B | Q4_K_M | 235 | 188 | 1.25x |
| DeepSeek R1 1.5B | 1.5B | Q4_K_M | 186 | 167 | 1.11x |
| Llama 3.2 3B | 3B | Q4_K_M | 92 | 93 | 0.99x |
| Mistral 7B | 7B | Q5_K_M | 44 | 44 | 1.00x |
25% faster on small models, parity at 7B. All models produce coherent, verified output.
Methodology
- Hardware: NVIDIA DGX Spark GB10 (Grace Blackwell, sm_121, 128 GB LPDDR5x unified memory)
- Prompt: "Explain the theory of relativity in simple terms."
- Tokens: 128 decode tokens per run
- Sampling: greedy (temperature = 0)
- Runs: 3-run median
- Date: 2026-03-27
- Ollama version: 0.17.7
- Notes: All results verified for coherent output. Zerfoo uses CUDA graph capture with flash attention decode. GQA repeat fix applied (ztensor v0.6.3, zerfoo v1.25.5).
Raw results: results/benchmark-2026-03-27.json
Self-speculative decoding using a lightweight prediction head — no draft model needed. Based on EAGLE-3.
m, _ := zerfoo.Load("google/gemma-3-1b")
defer m.Close()
result, _ := m.Generate(ctx, "Explain quantum computing.",
zerfoo.WithEAGLE("eagle-head.gguf"),
)Train your own EAGLE head:
zerfoo eagle-train --model model.gguf --corpus data.txt --output eagle-head.gguf --epochs 5Hadamard rotation fused into weights at load time for uniform 4-bit quantization. Based on QuaRot.
zerfoo run --quarot model.gguf "Hello world"Reduce KV cache memory by 6-7x with Q4 or Q3 quantization:
result, _ := m.Generate(ctx, prompt,
zerfoo.WithKVDtype("q4"), // 7.5x memory reduction
)Convert any MHA/GQA model to Multi-Head Latent Attention via SVD decomposition. Reduces KV cache by 80%+. Based on TransMLA.
zerfoo transmla --rank 512 --input model.gguf --output model-mla.ggufServe multiple LoRA adapters from a single base model. Per-request adapter selection via the OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-d '{"model": "gemma3-1b:my-lora", "messages": [{"role": "user", "content": "Hello"}]}'Native support for ternary weight models ({-1, 0, 1}) where matrix multiplication becomes integer addition/subtraction. Based on BitNet b1.58.
Hardware-aligned three-path sparse attention: coarse compression, fine-grained selection, and sliding window. Fused CUDA kernel. Based on NSA.
Place shared MoE experts on GPU, offload routed experts to CPU with SIMD kernels. Predictive prefetching achieves 98% hit rate. Based on KTransformers.
m, _ := zerfoo.Load("google/gemma-3-4b") // downloads from HuggingFace
defer m.Close()
response, _ := m.Chat("Explain Go interfaces in one sentence.")
fmt.Println(response)go get github.com/zerfoo/zerfooLoad accepts HuggingFace model IDs. Models are downloaded and cached automatically:
// Download by repo ID (defaults to Q4_K_M quantization)
m, err := zerfoo.Load("google/gemma-3-4b")
// Specify a quantization variant
m, err := zerfoo.Load("google/gemma-3-4b/Q8_0")
// Or load a local GGUF file
m, err := zerfoo.Load("./models/gemma-3-1b.gguf")Stream tokens as they are generated via a channel:
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
ch, err := m.ChatStream(context.Background(), "Tell me a joke.")
if err != nil {
log.Fatal(err)
}
for tok := range ch {
if !tok.Done {
fmt.Print(tok.Text)
}
}
fmt.Println()Extract L2-normalized embeddings and compute similarity:
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
embeddings, _ := m.Embed([]string{
"Go is a statically typed language.",
"Rust has a borrow checker.",
})
score := embeddings[0].CosineSimilarity(embeddings[1])
fmt.Printf("similarity: %.4f\n", score)Constrain model output to valid JSON matching a schema:
import "github.com/zerfoo/zerfoo/generate/grammar"
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
schema := grammar.JSONSchema{
Type: "object",
Properties: map[string]*grammar.JSONSchema{
"name": {Type: "string"},
"age": {Type: "number"},
},
Required: []string{"name", "age"},
}
result, _ := m.Generate(context.Background(),
"Generate a person named Alice who is 30.",
zerfoo.WithSchema(schema),
)
fmt.Println(result.Text) // {"name": "Alice", "age": 30}Detect tool/function calls in model output (OpenAI-compatible):
import "github.com/zerfoo/zerfoo/serve"
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
tools := []serve.Tool{{
Type: "function",
Function: serve.ToolFunction{
Name: "get_weather",
Description: "Get the current weather for a city",
Parameters: json.RawMessage(`{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}`),
},
}}
result, _ := m.Generate(context.Background(),
"What is the weather in Paris?",
zerfoo.WithTools(tools...),
)
for _, tc := range result.ToolCalls {
fmt.Printf("call %s(%s)\n", tc.FunctionName, tc.Arguments)
}| Architecture | Format | Special Features |
|---|---|---|
| Gemma 3 | GGUF Q4_K | Production. CUDA graph capture, 235 tok/s |
| Gemma 3n | GGUF | Mobile-optimized variant |
| Llama 3 | GGUF | RoPE theta=500K |
| Llama 4 | GGUF | Latest generation |
| Mistral | GGUF | Sliding window attention, 44 tok/s (7B Q4_K_M) |
| Mixtral | GGUF | Mixture of Experts |
| Qwen 2 | GGUF | Attention bias, RoPE theta=1M |
| Phi 3/4 | GGUF | Partial rotary factor, Q2_K/Q3_K support |
| DeepSeek V3 | GGUF | MLA + MoE (batched) |
| Command R | GGUF | Cohere architecture |
| Falcon | GGUF | Multi-query attention |
| RWKV | GGUF | Linear attention |
| GPT-2 | GGUF | TinyStories, learned position embeddings |
| Nemotron-H | GGUF | Hybrid Mamba-2 + Attention (NVIDIA) |
| Nemotron-Cascade-2 | GGUF | Hybrid Mamba-2 + Attention + MoE (30B-A3B) |
| MiniMax M2 | GGUF | Sigmoid MoE (256 experts), QK norm |
| Mamba / Mamba 3 | GGUF | State space models (MIMO SSM) |
| Jamba | GGUF | Hybrid Mamba-Transformer |
| Whisper | GGUF | Audio transcription |
| LLaVA | GGUF | Vision-language |
| Qwen-VL | GGUF | Vision-language |
New architectures are auto-detected from GGUF metadata.
| Architecture | Package | Use Case |
|---|---|---|
| MLP / Ensemble | tabular |
Baseline tabular prediction |
| FTTransformer | tabular |
Attention-based tabular |
| TabNet | tabular |
Attentive feature selection |
| SAINT | tabular |
Self-attention + inter-sample |
| TabResNet | tabular |
Residual tabular networks |
| Architecture | Package | Use Case |
|---|---|---|
| TFT | timeseries |
Temporal Fusion Transformer |
| N-BEATS | timeseries |
Basis expansion forecasting |
| PatchTST | timeseries |
Patch-based transformer |
| Architecture | Format | Use Case |
|---|---|---|
| Granite TTM | GGUF | Zero-shot/few-shot time series forecasting |
| Granite FlowState | GGUF | Continuous forecasting, timescale-invariant |
| Granite TSPulse | GGUF | Anomaly detection, classification, imputation |
Granite Time Series models are converted from HuggingFace using granite2gguf
(part of zonnx). Supported tasks: forecasting, anomaly detection,
classification, imputation, and embedding extraction.
Train tabular and time-series models with built-in AdamW, learning rate schedulers, and early stopping:
import "github.com/zerfoo/zerfoo/tabular"
model := tabular.NewEnsemble[float32](engine, tabular.EnsembleConfig{
InputDim: 10,
OutputDim: 1,
Models: 3,
})
trainer := tabular.NewTrainer(model, engine, tabular.TrainerConfig{
LR: 0.001,
Epochs: 50,
})
trainer.Fit(ctx, trainX, trainY)
predictions, _ := model.Predict(ctx, testX)go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
zerfoo pull gemma-3-1b-q4 # download a model
zerfoo run gemma-3-1b-q4 "Hello" # generate text
zerfoo run --quarot model.gguf "Hello" # QuaRot weight fusion
zerfoo serve gemma-3-1b-q4 # OpenAI-compatible API server
zerfoo eagle-train --model m.gguf --corpus data.txt --output eagle.gguf # train EAGLE head
zerfoo transmla --input m.gguf --output m-mla.gguf # MHA→MLA conversion
zerfoo transmla-validate --original m.gguf --converted m-mla.gguf # perplexity comparison
zerfoo train -backend tabular ... # train a tabular model
zerfoo list # list cached modelsSee the examples/ directory for runnable programs:
- chat -- interactive chatbot CLI
- rag -- retrieval-augmented generation with embeddings
- json-output -- grammar-guided structured JSON output
- embedding -- embed inference in an HTTP server
- api-server -- standalone API server
- inference -- basic text generation
- streaming -- token streaming
- fine-tuning -- LoRA fine-tuning
- automl -- automated model selection
- timeseries -- time-series forecasting
- distributed-training -- multi-node training
- agentic-tool-use -- function calling agent
- audio-transcription -- Whisper transcription
Full documentation at zerfoo.feza.ai/docs/
- Getting Started -- install, pull a model, run inference
- Tutorials -- step-by-step guides
- API Reference -- generate, inference, serve APIs
- Cookbooks -- 12 runnable code recipes
- Architecture -- GPU setup, architecture overview
- Benchmarks -- throughput numbers
- Blog -- development updates and deep dives
- CONTRIBUTING.md -- how to contribute
Apache 2.0