zerfoo

Pure Go ML framework -- inference, training, and serving. Embed any GGUF model in your Go application with go build ./....

235 tok/s on Gemma 3 1B Q4_K_M -- 25% faster than Ollama. Zero CGo. 28 model architectures (16 families). EAGLE speculative decoding with built-in head training, QuaRot quantization, Q4_K fused GEMV (14x faster), Multi-LoRA serving, BitNet ternary inference. CUDA graph capture, Apple Metal kernels. Time-series inference 21x faster than Python. Tabular ML and time-series forecasting built in.

Benchmarks

Decode throughput comparison against Ollama on NVIDIA DGX Spark GB10 (Grace Blackwell, sm_121, 128 GB LPDDR5x).

Model	Size	Quant	Zerfoo (tok/s)	Ollama (tok/s)	Ratio
Gemma 3 1B	1B	Q4_K_M	235	188	1.25x
DeepSeek R1 1.5B	1.5B	Q4_K_M	186	167	1.11x
Llama 3.2 3B	3B	Q4_K_M	92	93	0.99x
Mistral 7B	7B	Q5_K_M	44	44	1.00x

25% faster on small models, parity at 7B. All models produce coherent, verified output.

Methodology

Hardware: NVIDIA DGX Spark GB10 (Grace Blackwell, sm_121, 128 GB LPDDR5x unified memory)
Prompt: "Explain the theory of relativity in simple terms."
Tokens: 128 decode tokens per run
Sampling: greedy (temperature = 0)
Runs: 3-run median
Date: 2026-03-27
Ollama version: 0.17.7
Notes: All results verified for coherent output. Zerfoo uses CUDA graph capture with flash attention decode. GQA repeat fix applied (ztensor v0.6.3, zerfoo v1.25.5).

Raw results: results/benchmark-2026-03-27.json

Advanced Inference Features

EAGLE Speculative Decoding

Self-speculative decoding using a lightweight prediction head — no draft model needed. Based on EAGLE-3.

m, _ := zerfoo.Load("google/gemma-3-1b")
defer m.Close()
result, _ := m.Generate(ctx, "Explain quantum computing.",
    zerfoo.WithEAGLE("eagle-head.gguf"),
)

Train your own EAGLE head:

zerfoo eagle-train --model model.gguf --corpus data.txt --output eagle-head.gguf --epochs 5

QuaRot Weight Fusion

Hadamard rotation fused into weights at load time for uniform 4-bit quantization. Based on QuaRot.

zerfoo run --quarot model.gguf "Hello world"

Quantized KV Cache

Reduce KV cache memory by 6-7x with Q4 or Q3 quantization:

result, _ := m.Generate(ctx, prompt,
    zerfoo.WithKVDtype("q4"),  // 7.5x memory reduction
)

TransMLA — MHA-to-MLA Conversion

Convert any MHA/GQA model to Multi-Head Latent Attention via SVD decomposition. Reduces KV cache by 80%+. Based on TransMLA.

zerfoo transmla --rank 512 --input model.gguf --output model-mla.gguf

Multi-LoRA Serving

Serve multiple LoRA adapters from a single base model. Per-request adapter selection via the OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -d '{"model": "gemma3-1b:my-lora", "messages": [{"role": "user", "content": "Hello"}]}'

BitNet Ternary Inference

Native support for ternary weight models ({-1, 0, 1}) where matrix multiplication becomes integer addition/subtraction. Based on BitNet b1.58.

Native Sparse Attention (NSA)

Hardware-aligned three-path sparse attention: coarse compression, fine-grained selection, and sliding window. Fused CUDA kernel. Based on NSA.

Hybrid CPU/GPU MoE

Place shared MoE experts on GPU, offload routed experts to CPU with SIMD kernels. Predictive prefetching achieves 98% hit rate. Based on KTransformers.

Quick Start

m, _ := zerfoo.Load("google/gemma-3-4b")  // downloads from HuggingFace
defer m.Close()
response, _ := m.Chat("Explain Go interfaces in one sentence.")
fmt.Println(response)

Installation

go get github.com/zerfoo/zerfoo

HuggingFace Download

Load accepts HuggingFace model IDs. Models are downloaded and cached automatically:

// Download by repo ID (defaults to Q4_K_M quantization)
m, err := zerfoo.Load("google/gemma-3-4b")

// Specify a quantization variant
m, err := zerfoo.Load("google/gemma-3-4b/Q8_0")

// Or load a local GGUF file
m, err := zerfoo.Load("./models/gemma-3-1b.gguf")

Streaming

Stream tokens as they are generated via a channel:

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

ch, err := m.ChatStream(context.Background(), "Tell me a joke.")
if err != nil {
    log.Fatal(err)
}
for tok := range ch {
    if !tok.Done {
        fmt.Print(tok.Text)
    }
}
fmt.Println()

Embeddings

Extract L2-normalized embeddings and compute similarity:

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

embeddings, _ := m.Embed([]string{
    "Go is a statically typed language.",
    "Rust has a borrow checker.",
})
score := embeddings[0].CosineSimilarity(embeddings[1])
fmt.Printf("similarity: %.4f\n", score)

Structured Output

Constrain model output to valid JSON matching a schema:

import "github.com/zerfoo/zerfoo/generate/grammar"

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

schema := grammar.JSONSchema{
    Type: "object",
    Properties: map[string]*grammar.JSONSchema{
        "name": {Type: "string"},
        "age":  {Type: "number"},
    },
    Required: []string{"name", "age"},
}

result, _ := m.Generate(context.Background(),
    "Generate a person named Alice who is 30.",
    zerfoo.WithSchema(schema),
)
fmt.Println(result.Text) // {"name": "Alice", "age": 30}

Tool Calling

Detect tool/function calls in model output (OpenAI-compatible):

import "github.com/zerfoo/zerfoo/serve"

m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()

tools := []serve.Tool{{
    Type: "function",
    Function: serve.ToolFunction{
        Name:        "get_weather",
        Description: "Get the current weather for a city",
        Parameters:  json.RawMessage(`{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}`),
    },
}}

result, _ := m.Generate(context.Background(),
    "What is the weather in Paris?",
    zerfoo.WithTools(tools...),
)

for _, tc := range result.ToolCalls {
    fmt.Printf("call %s(%s)\n", tc.FunctionName, tc.Arguments)
}

Supported Models

LLM Inference (28 architectures, 16 model families)

Architecture	Format	Special Features
Gemma 3	GGUF Q4_K	Production. CUDA graph capture, 235 tok/s
Gemma 3n	GGUF	Mobile-optimized variant
Llama 3	GGUF	RoPE theta=500K
Llama 4	GGUF	Latest generation
Mistral	GGUF	Sliding window attention, 44 tok/s (7B Q4_K_M)
Mixtral	GGUF	Mixture of Experts
Qwen 2	GGUF	Attention bias, RoPE theta=1M
Phi 3/4	GGUF	Partial rotary factor, Q2_K/Q3_K support
DeepSeek V3	GGUF	MLA + MoE (batched)
Command R	GGUF	Cohere architecture
Falcon	GGUF	Multi-query attention
RWKV	GGUF	Linear attention
GPT-2	GGUF	TinyStories, learned position embeddings
Nemotron-H	GGUF	Hybrid Mamba-2 + Attention (NVIDIA)
Nemotron-Cascade-2	GGUF	Hybrid Mamba-2 + Attention + MoE (30B-A3B)
MiniMax M2	GGUF	Sigmoid MoE (256 experts), QK norm
Mamba / Mamba 3	GGUF	State space models (MIMO SSM)
Jamba	GGUF	Hybrid Mamba-Transformer
Whisper	GGUF	Audio transcription
LLaVA	GGUF	Vision-language
Qwen-VL	GGUF	Vision-language

New architectures are auto-detected from GGUF metadata.

Tabular ML

Architecture	Package	Use Case
MLP / Ensemble	`tabular`	Baseline tabular prediction
FTTransformer	`tabular`	Attention-based tabular
TabNet	`tabular`	Attentive feature selection
SAINT	`tabular`	Self-attention + inter-sample
TabResNet	`tabular`	Residual tabular networks

Time-Series Forecasting

Architecture	Package	Use Case
TFT	`timeseries`	Temporal Fusion Transformer
N-BEATS	`timeseries`	Basis expansion forecasting
PatchTST	`timeseries`	Patch-based transformer

IBM Granite Time Series

Architecture	Format	Use Case
Granite TTM	GGUF	Zero-shot/few-shot time series forecasting
Granite FlowState	GGUF	Continuous forecasting, timescale-invariant
Granite TSPulse	GGUF	Anomaly detection, classification, imputation

Granite Time Series models are converted from HuggingFace using granite2gguf (part of zonnx). Supported tasks: forecasting, anomaly detection, classification, imputation, and embedding extraction.

Training

Train tabular and time-series models with built-in AdamW, learning rate schedulers, and early stopping:

import "github.com/zerfoo/zerfoo/tabular"

model := tabular.NewEnsemble[float32](engine, tabular.EnsembleConfig{
    InputDim:  10,
    OutputDim: 1,
    Models:    3,
})
trainer := tabular.NewTrainer(model, engine, tabular.TrainerConfig{
    LR:     0.001,
    Epochs: 50,
})
trainer.Fit(ctx, trainX, trainY)
predictions, _ := model.Predict(ctx, testX)

CLI

go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest

zerfoo pull gemma-3-1b-q4              # download a model
zerfoo run gemma-3-1b-q4 "Hello"       # generate text
zerfoo run --quarot model.gguf "Hello" # QuaRot weight fusion
zerfoo serve gemma-3-1b-q4             # OpenAI-compatible API server
zerfoo eagle-train --model m.gguf --corpus data.txt --output eagle.gguf  # train EAGLE head
zerfoo transmla --input m.gguf --output m-mla.gguf  # MHA→MLA conversion
zerfoo transmla-validate --original m.gguf --converted m-mla.gguf  # perplexity comparison
zerfoo train -backend tabular ...      # train a tabular model
zerfoo list                             # list cached models

Examples

See the examples/ directory for runnable programs:

chat -- interactive chatbot CLI
rag -- retrieval-augmented generation with embeddings
json-output -- grammar-guided structured JSON output
embedding -- embed inference in an HTTP server
api-server -- standalone API server
inference -- basic text generation
streaming -- token streaming
fine-tuning -- LoRA fine-tuning
automl -- automated model selection
timeseries -- time-series forecasting
distributed-training -- multi-node training
agentic-tool-use -- function calling agent
audio-transcription -- Whisper transcription

Documentation

Full documentation at zerfoo.feza.ai/docs/

Getting Started -- install, pull a model, run inference
Tutorials -- step-by-step guides
API Reference -- generate, inference, serve APIs
Cookbooks -- 12 runnable code recipes
Architecture -- GPU setup, architecture overview
Benchmarks -- throughput numbers
Blog -- development updates and deep dives
CONTRIBUTING.md -- how to contribute

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3,014 Commits
.claude		.claude
.github		.github
autoopt		autoopt
benchmarks		benchmarks
bin		bin
causal		causal
cloud		cloud
cmd		cmd
compliance		compliance
config		config
crossasset		crossasset
data		data
deploy		deploy
distributed		distributed
docs		docs
examples		examples
features		features
federated		federated
generate		generate
gnn		gnn
gp		gp
health		health
inference		inference
infra/terraform/zerfoo-cloud		infra/terraform/zerfoo-cloud
integration		integration
integrations		integrations
internal		internal
layers		layers
marketplace		marketplace
meta		meta
mobile		mobile
model		model
modelcache		modelcache
modeldsl		modeldsl
monitor		monitor
provenance		provenance
recover		recover
regime		regime
registry		registry
results		results
rl		rl
scripts		scripts
security		security
serve		serve
shared		shared
shutdown		shutdown
support		support
synth		synth
tabular		tabular
testing		testing
tests		tests
timeseries		timeseries
training		training
.claude-checkpoint.md		.claude-checkpoint.md
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
api.go		api.go
api_test.go		api_test.go
api_tool_call_test.go		api_tool_call_test.go
debug-infer		debug-infer
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
release-please-config.json		release-please-config.json
zerfoo		zerfoo
zerfoo-train		zerfoo-train
zerfoo.go		zerfoo.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zerfoo

Benchmarks

Advanced Inference Features

EAGLE Speculative Decoding

QuaRot Weight Fusion

Quantized KV Cache

TransMLA — MHA-to-MLA Conversion

Multi-LoRA Serving

BitNet Ternary Inference

Native Sparse Attention (NSA)

Hybrid CPU/GPU MoE

Quick Start

Installation

HuggingFace Download

Streaming

Embeddings

Structured Output

Tool Calling

Supported Models

LLM Inference (28 architectures, 16 model families)

Tabular ML

Time-Series Forecasting

IBM Granite Time Series

Training

CLI

Examples

Documentation

License

About

Uh oh!

Releases 53

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zerfoo

Benchmarks

Advanced Inference Features

EAGLE Speculative Decoding

QuaRot Weight Fusion

Quantized KV Cache

TransMLA — MHA-to-MLA Conversion

Multi-LoRA Serving

BitNet Ternary Inference

Native Sparse Attention (NSA)

Hybrid CPU/GPU MoE

Quick Start

Installation

HuggingFace Download

Streaming

Embeddings

Structured Output

Tool Calling

Supported Models

LLM Inference (28 architectures, 16 model families)

Tabular ML

Time-Series Forecasting

IBM Granite Time Series

Training

CLI

Examples

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 53

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages