kin-infer

Universal transformer inference engine -- pure Rust, GPU-accelerated.

kin-infer is a standalone inference engine extracted from the Kin ecosystem. It loads any HuggingFace safetensors model and runs both encoder and decoder-only architectures with custom GPU compute shaders (Metal on macOS, CUDA on Linux/Windows). No external ML frameworks -- custom MSL shaders and PTX kernels, direct platform API calls.

Alpha -- APIs will evolve. The engine is proven: it powers Kin's embedding pipeline and has been validated against reference implementations across all supported architectures.

What kin-infer Does

Encoder architectures: BERT, RoBERTa, ALBERT, DeBERTa, T5 (encoder), Jina, nomic-embed, BGE, GTE
Decoder-only architectures: LLaMA, Mistral, Gemma, GPT-2, Phi, Qwen2
Weight formats: safetensors (single or sharded), F32/F16/BF16/Q8_0/Q4_0
Positional encodings: learned, ALiBi, RoPE, relative bias (T5), disentangled (DeBERTa)
Attention: MHA, GQA, MQA
Normalization: LayerNorm, RMSNorm
FFN: GELU, SwiGLU, GeGLU, ReGLU
Generation: autoregressive decoding with KV cache, temperature, top-k, top-p, repetition penalty
SIMD: ARM NEON and x86 AVX2+FMA accelerated dot products
GPU: Apple Metal (M1/M2/M3), NVIDIA CUDA (via driver API, no toolkit needed)
GPU ops: matmul, softmax, layer_norm, rms_norm, GELU, SiLU, RoPE -- custom shaders

Quick Start

# Prerequisites: Rust stable
git clone https://github.com/firelock-ai/kin-infer.git
cd kin-infer

# CPU only
cargo build --release

# macOS with Metal GPU acceleration (M1/M2/M3)
cargo build --release --features metal

# Linux/Windows with NVIDIA CUDA GPU
cargo build --release --features cuda

# Run tests
cargo test --features metal   # macOS
cargo test --features cuda    # Linux/Windows with NVIDIA GPU
cargo test                    # CPU only

As a dependency

[dependencies]
kin-infer = { git = "https://github.com/firelock-ai/kin-infer" }

# With Metal GPU (macOS)
kin-infer = { git = "https://github.com/firelock-ai/kin-infer", features = ["metal"] }

# With CUDA GPU (Linux/Windows)
kin-infer = { git = "https://github.com/firelock-ai/kin-infer", features = ["cuda"] }

Usage

use kin_infer::{BertConfig, BertModel, SamplingParams};
use std::path::Path;

// Load config
let config: BertConfig = serde_json::from_str(
    &std::fs::read_to_string("model/config.json").unwrap()
).unwrap();

// Load model
let model = BertModel::load(Path::new("model/model.safetensors"), config).unwrap();

// Encoder: generate embeddings
let token_ids = vec![vec![101, 2023, 2003, 1037, 3231, 102]];
let attention_masks = vec![vec![1, 1, 1, 1, 1, 1]];
let embeddings = model.forward(&token_ids, &attention_masks).unwrap();

// Decoder: generate text
let prompt = vec![1, 15043, 29892]; // "<s>Hello,"
let mut params = SamplingParams::default();
let generated = model.generate(&prompt, 64, &mut params).unwrap();

GPU device discovery

use kin_infer::gpu;

// Discover all available devices
for device in gpu::discover_devices() {
    println!("{}", device);  // e.g. "Apple M1 Pro (Metal, 10922MB, unified)"
}

// Auto-select best backend (Metal > CUDA > CPU)
let compute = gpu::create_compute();
println!("Using: {} on {}", compute.backend(), compute.device_name());

Status

Proven now:

Full encoder forward pass with mean pooling and L2 normalization
Full decoder-only forward pass with KV cache
Autoregressive generation with configurable sampling
All major attention variants (MHA, GQA, MQA, ALiBi, RoPE, T5 relative, DeBERTa disentangled)
SIMD-accelerated math primitives (ARM NEON, x86 AVX2+FMA)
Apple Metal GPU compute (custom MSL shaders, tested on M1/M2/M3)
NVIDIA CUDA GPU compute (PTX kernels, driver API — no toolkit required)
Auto device discovery and transparent CPU fallback

Still hardening:

GPU-accelerated forward pass integration (shaders work, wiring in progress)
Batch decoding
Streaming generation

Ecosystem

Component	Description
kin	Semantic VCS -- primary consumer
kin-db	Graph engine substrate
kin-infer	Inference engine (this repo)
kin-vfs	Virtual filesystem

Contributing

Contributions welcome. Please open an issue before submitting large changes.

License

Apache-2.0.

Created by Troy Fortin at Firelock, LLC.

"So neither the one who plants nor the one who waters is anything, but only God, who makes things grow." -- 1 Corinthians 3:7

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.kin		.kin
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kin-infer

What kin-infer Does

Quick Start

As a dependency

Usage

GPU device discovery

Status

Ecosystem

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kin-infer

What kin-infer Does

Quick Start

As a dependency

Usage

GPU device discovery

Status

Ecosystem

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages