Pure Go ML framework -- inference, training, and serving. Embed any GGUF model in your Go application with go build ./....
244 tok/s on Gemma 3 1B Q4_K_M (95% memory bandwidth utilization) -- 20% faster than Ollama. Zero CGo. 20 model architectures. Tabular ML and time-series forecasting built in.
m, _ := zerfoo.Load("google/gemma-3-4b") // downloads from HuggingFace
defer m.Close()
response, _ := m.Chat("Explain Go interfaces in one sentence.")
fmt.Println(response)go get github.com/zerfoo/zerfooLoad accepts HuggingFace model IDs. Models are downloaded and cached automatically:
// Download by repo ID (defaults to Q4_K_M quantization)
m, err := zerfoo.Load("google/gemma-3-4b")
// Specify a quantization variant
m, err := zerfoo.Load("google/gemma-3-4b/Q8_0")
// Or load a local GGUF file
m, err := zerfoo.Load("./models/gemma-3-1b.gguf")Stream tokens as they are generated via a channel:
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
ch, err := m.ChatStream(context.Background(), "Tell me a joke.")
if err != nil {
log.Fatal(err)
}
for tok := range ch {
if !tok.Done {
fmt.Print(tok.Text)
}
}
fmt.Println()Extract L2-normalized embeddings and compute similarity:
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
embeddings, _ := m.Embed([]string{
"Go is a statically typed language.",
"Rust has a borrow checker.",
})
score := embeddings[0].CosineSimilarity(embeddings[1])
fmt.Printf("similarity: %.4f\n", score)Constrain model output to valid JSON matching a schema:
import "github.com/zerfoo/zerfoo/generate/grammar"
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
schema := grammar.JSONSchema{
Type: "object",
Properties: map[string]*grammar.JSONSchema{
"name": {Type: "string"},
"age": {Type: "number"},
},
Required: []string{"name", "age"},
}
result, _ := m.Generate(context.Background(),
"Generate a person named Alice who is 30.",
zerfoo.WithSchema(schema),
)
fmt.Println(result.Text) // {"name": "Alice", "age": 30}Detect tool/function calls in model output (OpenAI-compatible):
import "github.com/zerfoo/zerfoo/serve"
m, _ := zerfoo.Load("google/gemma-3-4b")
defer m.Close()
tools := []serve.Tool{{
Type: "function",
Function: serve.ToolFunction{
Name: "get_weather",
Description: "Get the current weather for a city",
Parameters: json.RawMessage(`{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}`),
},
}}
result, _ := m.Generate(context.Background(),
"What is the weather in Paris?",
zerfoo.WithTools(tools...),
)
for _, tc := range result.ToolCalls {
fmt.Printf("call %s(%s)\n", tc.FunctionName, tc.Arguments)
}| Architecture | Format | Special Features |
|---|---|---|
| Gemma 3 | GGUF Q4_K | Production. CUDA graph capture, 244 tok/s |
| Gemma 3n | GGUF | Mobile-optimized variant |
| Llama 3 | GGUF | RoPE theta=500K |
| Llama 4 | GGUF | Latest generation |
| Mistral | GGUF | Sliding window attention |
| Mixtral | GGUF | Mixture of Experts |
| Qwen 2 | GGUF | Attention bias, RoPE theta=1M |
| Phi 3/4 | GGUF | Partial rotary factor |
| DeepSeek V3 | GGUF | MLA + MoE (batched) |
| Command R | GGUF | Cohere architecture |
| Falcon | GGUF | Multi-query attention |
| RWKV | GGUF | Linear attention |
| Mamba / Mamba 3 | GGUF | State space models (MIMO SSM) |
| Jamba | GGUF | Hybrid Mamba-Transformer |
| Whisper | GGUF | Audio transcription |
| LLaVA | GGUF | Vision-language |
| Qwen-VL | GGUF | Vision-language |
New architectures are auto-detected from GGUF metadata.
| Architecture | Package | Use Case |
|---|---|---|
| MLP / Ensemble | tabular |
Baseline tabular prediction |
| FTTransformer | tabular |
Attention-based tabular |
| TabNet | tabular |
Attentive feature selection |
| SAINT | tabular |
Self-attention + inter-sample |
| TabResNet | tabular |
Residual tabular networks |
| Architecture | Package | Use Case |
|---|---|---|
| TFT | timeseries |
Temporal Fusion Transformer |
| N-BEATS | timeseries |
Basis expansion forecasting |
| PatchTST | timeseries |
Patch-based transformer |
Train tabular and time-series models with built-in AdamW, learning rate schedulers, and early stopping:
import "github.com/zerfoo/zerfoo/tabular"
model := tabular.NewEnsemble[float32](engine, tabular.EnsembleConfig{
InputDim: 10,
OutputDim: 1,
Models: 3,
})
trainer := tabular.NewTrainer(model, engine, tabular.TrainerConfig{
LR: 0.001,
Epochs: 50,
})
trainer.Fit(ctx, trainX, trainY)
predictions, _ := model.Predict(ctx, testX)go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest
zerfoo pull gemma-3-1b-q4 # download a model
zerfoo run gemma-3-1b-q4 "Hello" # generate text
zerfoo serve gemma-3-1b-q4 # OpenAI-compatible API server
zerfoo train -backend tabular ... # train a tabular model
zerfoo list # list cached modelsSee the examples/ directory for runnable programs:
- chat -- interactive chatbot CLI
- rag -- retrieval-augmented generation with embeddings
- json-output -- grammar-guided structured JSON output
- embedding -- embed inference in an HTTP server
- api-server -- standalone API server
- inference -- basic text generation
- streaming -- token streaming
- fine-tuning -- LoRA fine-tuning
- automl -- automated model selection
- timeseries -- time-series forecasting
- distributed-training -- multi-node training
- agentic-tool-use -- function calling agent
- audio-transcription -- Whisper transcription
- Getting Started -- full walkthrough: install, pull a model, run inference via CLI and library
- GPU Setup -- configure CUDA, ROCm, or OpenCL for hardware-accelerated inference
- Benchmarks -- throughput numbers across models and hardware
- Design -- architecture overview and key design decisions
- Blog -- development updates and deep dives
- CONTRIBUTING.md -- how to contribute
Apache 2.0