High-Performance AI Inference Runtime
Geodessical is a C-based inference runtime for GGUF language models. It can run as a normal host application (Windows/Linux) and shares core inference code with TensorOS. The focus is straightforward: predictable runtime behavior, low overhead, and transparent performance tuning.
- GGUF model loading — Qwen, LLaMA, Gemma, SmolLM, Mistral, Phi-2/3/3.5
- Quantization — Q4_0, Q8_0, F16, F32 weight formats
- JIT-compiled kernels — Native x86_64 SSE2/AVX2 forward pass kernels
- SMP parallel GEMV — Multi-threaded matrix-vector multiply across all CPU cores
- Host-mode runtime — Memory-mapped model loading, native threads, cross-platform
- Bare-metal mode — Still boots as a standalone OS via Multiboot1
The numbers below are from an actual local run on April 13, 2026.
- CPU: AMD Ryzen 9 7940HS (8 cores / 16 threads)
- GPU: NVIDIA GeForce RTX 4070 Laptop GPU (8 GB class; runtime reported ~7052 MB free)
- RAM: 32 GB
- OS: Windows (host mode)
- Prompt:
Write a 500-word explanation of how compilers optimize loops, in plain English. - Max generation: 256 tokens
- Single run per engine/model (no averaging in this table)
| Engine | Model | Throughput Metric | Measured Value |
|---|---|---|---|
| Geodessical | google_gemma-4-E2B-it-Q4_0.gguf |
End-to-end generation rate | 92.5 tok/s |
| Geodessical | google_gemma-4-E2B-it-Q4_0.gguf |
Decode-only rate | 107.7 tok/s |
| Ollama | gemma3:4b |
Eval rate (eval_count / eval_duration) |
75.36 tok/s |
| Ollama | gemma4:latest |
Eval rate (eval_count / eval_duration) |
30.21 tok/s |
- Geodessical end-to-end (92.5 tok/s) vs Ollama
gemma3:4b(75.36 tok/s): +22.7% - Geodessical end-to-end (92.5 tok/s) vs Ollama
gemma4:latest(30.21 tok/s): +206.2%
Geodessical:
.\build_host\geodessical.exe "C:\Users\legom\TensorOS\models\google_gemma-4-E2B-it-Q4_0.gguf" -p "Write a 500-word explanation of how compilers optimize loops, in plain English." -n 256Ollama (gemma3:4b):
$body = @{ model = 'gemma3:4b'; prompt = 'Write a 500-word explanation of how compilers optimize loops, in plain English.'; stream = $false; options = @{ num_predict = 256; temperature = 0.7 } } | ConvertTo-Json -Depth 6
$r = Invoke-RestMethod -Uri 'http://localhost:11434/api/generate' -Method Post -ContentType 'application/json' -Body $body
[math]::Round(($r.eval_count / ($r.eval_duration / 1e9)), 2)Ollama (gemma4:latest):
$body = @{ model = 'gemma4:latest'; prompt = 'Write a 500-word explanation of how compilers optimize loops, in plain English.'; stream = $false; options = @{ num_predict = 256; temperature = 0.7 } } | ConvertTo-Json -Depth 6
$r = Invoke-RestMethod -Uri 'http://localhost:11434/api/generate' -Method Post -ContentType 'application/json' -Body $body
[math]::Round(($r.eval_count / ($r.eval_duration / 1e9)), 2)Notes:
- This is a practical runtime comparison, not a strict model-equivalence benchmark.
- Geodessical and Ollama model packages are not byte-identical here, so use these results as operational guidance, not a canonical leaderboard.
$ ./geodessical phi3.5-mini-q4_0.gguf -p "What is an operating system?"
Geodessical v0.4.0 "Axon"
High-Performance AI Inference Runtime
[CPU] SSE2=1 AVX2=1 FMA=1 AVX512=0
[SMP] 8 CPUs online (7 workers + BSP)
[GD] Loading model: phi3.5-mini-q4_0.gguf
[GD] Mapped 2081 MB
[LLM] Model: Phi 3.5 Mini Instruct (phi3)
[LLM] 32 layers, 3072-dim, 32064 vocab, 32 heads
[GD] Model loaded in 1240 ms
[GD] Prompt: "What is an operating system?"
An operating system (OS) is a complex piece of software that manages...
| Tool | Purpose | Install |
|---|---|---|
zig (0.15+) |
C compiler | ziglang.org/download |
# Build the hosted runtime
.\build_host.ps1
# Run with a GGUF model
.\build_host\geodessical.exe phi3.5.gguf -p "Hello world"
# Interactive chat mode
.\build_host\geodessical.exe phi3.5.gguf -iOr with CMake (if GCC/Clang available):
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./geodessical phi3.5.gguf -i# Build the bare-metal kernel + run in QEMU
.\build.ps1 -Run
# QEMU flags: -machine q35,accel=whpx -cpu EPYC-v4 -smp 4 -m 8G
# -drive file=phi3.5.gguf,format=raw,if=virtioGeodessical <model.gguf> [options]
Options:
-p, --prompt <text> Prompt text (default: interactive)
-n, --tokens <num> Max tokens to generate (default: 128)
-t, --threads <num> Thread count (default: all CPUs)
--temp <float> Temperature (default: 0.7)
--top-k <int> Top-K sampling (default: 40)
--top-p <float> Nucleus sampling (default: 0.9)
-i, --interactive Interactive chat mode
-h, --help Show this help
Geodessical operates in two modes:
┌─────────────────────────────────────────────────┐
│ geodessical.exe / Geodessical │
│ CLI: model load, prompt, interactive chat │
├─────────────────────────────────────────────────┤
│ HAL (Hardware Abstraction Layer) │
│ ┌───────────┬───────────┬──────────────────┐ │
│ │ Memory │ Threading │ CPU Detection │ │
│ │ malloc │ Win32/ │ CPUID: SSE2, │ │
│ │ aligned │ pthreads │ AVX2, FMA, │ │
│ │ mmap │ workers │ AVX-512 │ │
│ └───────────┴───────────┴──────────────────┘ │
├─────────────────────────────────────────────────┤
│ Inference Engine (shared with bare-metal) │
│ ┌──────┬─────────┬──────┬──────┬───────────┐ │
│ │ GGUF │ BPE │ JIT │ SMP │ Forward │ │
│ │parse │tokenize │ x86 │GEMV │ pass │ │
│ └──────┴─────────┴──────┴──────┴───────────┘ │
└─────────────────────────────────────────────────┘
The full TensorOS kernel boots via Multiboot1, runs on x86_64/ARM64, and includes the AI shell, tensor scheduler, native git, GPU drivers, and everything documented in the TensorOS README.
Geodessical/
├── host/ # Host-mode runtime (NEW)
│ ├── hal.h # Hardware Abstraction Layer header
│ ├── hal.c # Cross-platform HAL implementation
│ ├── main.c # CLI entry point
│ └── shims/ # Include shims (kernel→HAL redirect)
│ └── kernel/... # Shim headers for all kernel includes
├── runtime/
│ ├── nn/
│ │ ├── llm.c # Full LLM inference engine
│ │ ├── llm.h # Model types and API
│ │ └── gguf.c # GGUF format parser
│ └── jit/
│ ├── x86_jit.c # x86_64 JIT code emitter
│ └── llm_jit.c # JIT forward kernels
├── kernel/ # Bare-metal kernel (TensorOS heritage)
├── boot/ # Bootloader (Multiboot1, ARM64)
├── build_host.ps1 # Host-mode build script (Zig CC)
├── build.ps1 # Bare-metal build script
└── CMakeLists.txt # CMake build (GCC/Clang)
-
Model Loading: Memory-maps the GGUF file (no copy), parses metadata, maps tensor pointers directly into the file.
-
Tokenization: BPE tokenizer built from GGUF vocabulary with an O(1) hash table lookup and merge-based encoding.
-
Forward Pass: Full transformer forward pass with RMSNorm → QKV projection → RoPE → GQA attention → SwiGLU FFN → LM head.
-
JIT Compilation: On first inference, six x86_64 SIMD kernels are JIT-compiled (vadd, dot, axpy, fused_silu_mul, rope, rmsnorm) — eliminating per-element function call overhead.
-
SMP Dispatch: Matrix-vector multiplies are partitioned across all CPU cores via the HAL's thread pool.
-
Sampling: Temperature-scaled softmax with top-k/top-p nucleus sampling and optional greedy decoding.
Current GGUF coverage in this runtime includes:
| Model | Architecture | Tested |
|---|---|---|
| Gemma 4 E2B It | gemma4 | ✅ |
| Phi-3.5 Mini Instruct | phi3 | ✅ |
| Qwen2.5 | qwen2 | ✅ |
| LLaMA 3 | llama | ✅ |
| Gemma 2 | gemma | ✅ |
| SmolLM 2 | llama | ✅ |
| Mistral | llama | ✅ |
| Phi-2 | phi2 | ✅ |
Quantization: Q4_0, Q8_0, F16, F32
Geodessical evolved from TensorOS, a bare-metal AI operating system. The core inference engine, GGUF parser, BPE tokenizer, JIT compiler, and SMP parallel GEMV are shared between both projects. Geodessical adds the HAL layer to run the same inference code as a native application on Windows and Linux.
MIT