All notable changes to TensorOS are documented in this file. The focus here is code and measured behavior, not release-note marketing.
Production host runtime with geometric inference research integration. Geodessical v0.6.0 "Synapse" ships as a fully featured host-mode inference engine while running the Axiom Beta-3 OTT survey pipeline in parallel. Peak decode reaches 107.7 tok/s on Gemma 4 E2B (RTX 4070 Laptop), 22.7% ahead of Ollama gemma3:4b on the same hardware.
Phase 5 geodesic pilot now uses deterministic model next-token generation as the target embedding instead of random vocabulary selection. This aligns the pilot with actual decode behavior and makes top1/MRR metrics directly meaningful.
Telemetry: oracle_target_count vs random_target_count in JSON report.
Knowledge-injection warp accumulations now survive process restarts via
axiom_warp_state.dat. Warp points accumulated across sessions; threshold-triggered
manifold recomputation runs in post-Phase-5 control flow (no Phase-5 coupling).
- Uncertainty-based candidate selection with early stop after sustained low uncertainty
- Adaptive model-oracle budget in fast mode: 2–4 calls (down from 16)
- Stricter fast-mode uncertainty floor for oracle trigger
- Result: Phase 4 wall time: 909 ms → 669 ms (−26%)
Curvature-informed initial velocity prior in Phase 5 (bounded local acceleration from interpolated Christoffel symbols). Adaptive geodesic retry with step/velocity damping.
| Metric | Previous (cap=16) | Current |
|---|---|---|
| Total time | ~1218 ms | ~977 ms |
| Phase 4 | ~909 ms | ~669 ms |
| Phase 5 | — | ~43 ms |
| MRR | ~0.032 | ~0.067 |
LRU hidden-state cache (keyed by token_id × layer) reduces Phase 3 manifold recomputation from 197 s cold → 0.17 s warm (−99.9%) on SmolLM2. Full Phase 3 + Phase 4 refresh now triggerable without prohibitive cost.
enable_knowledge_injection, injection_alpha, injection_sigma, injection_points
controls added. Applies OTT-style local Christoffel warp with Gaussian distance decay.
Warp accumulation + recalc trigger plumbing fully implemented; training-time coupling
pending.
--axiom-fast activates:
embedding_samples ≤ 64metric_sample_points ≤ 64oracle_calls_max ≤ 12geodesic_test_tokens ≤ 8geodesic_vocab_probe ≤ 512
| Config | Total | Phase5 | ID | top1 | MRR |
|---|---|---|---|---|---|
| samples=64, probe=256 | 543 ms | 59 ms | 14 | 0.000 | 0.0153 |
| samples=128, probe=512 | 1013 ms | 69 ms | 16 | 0.000 | 0.0000 |
| samples=256, probe=1024 | 3209 ms | 15 ms | 41 | 0.000 | 0.0000 |
| Metric | Value | Context |
|---|---|---|
| Decode (Gemma4 E2B, GPU, long/512) | 107.7 tok/s | RTX 4070 Laptop, decode-only |
| End-to-end (Gemma4 E2B, GPU) | 92.5 tok/s | Includes prefill, 256 tokens |
| vs Ollama gemma3:4b | +22.7% | Same prompt, same hardware |
| vs Ollama gemma4:latest | +206.2% | Same prompt, same hardware |
| SmolLM2-135M GPU (long) | 174–271 tok/s | Q8_0, variable prompt length |
- Gemma4 architecture: interleaved sliding-window attention (ISWA), dual RoPE bases, doubled FFN layers 15+
--ott-fast: speed-first OTT (spec-decode batch=16, AttnRes, fast axiom, max TPS)--ott-speculative: geodesic spec-decode (batch=2, geodesic drafts + transformer verify)--ott-perfect: exact greedy rollout upper bound (100% draft acceptance rate)--ott-full: full OTT pipeline (axiom + geodesic-first + AttnRes + OneDecode prep)--ott-theorem: adds depth-attn to ott-full for maximum reasoning quality--one-decode: bake geodesic flow map once →ott_one_decode.binfor instant decode--ott-od: OTT-OD protocol — OneDecode map as speculative draft source--ott-swarm <K>: OD-SWARM fan-out (K candidates per draft slot)--attnres/--attnres-strength: attention residual depth stabilization--depth-attn/--depth-attn-strength/--depth-attn-window: depth-wise residual cross-layer attention--no-think,--force-think,--show-think: thinking token control for reasoning models- CUDA: dynamic DLL dispatch (
cuda_kernels.dll), ~50 GPU operations, CUDA Graph capture - CUDA: fused QKV (triple_q4_0), batch prefill, add_rmsnorm, iswa_combine, async transfers
- CUDA: uploads Q4_0, Q4_1, Q8_0, Q6_K, F16, BF16, F32 (expanded from Q4_0/Q8_0 only)
- 13 JIT SSE2 kernels (added: gelu, layernorm, q8_0_gemv, q4_0_q8_0_gemv)
--axiom-gpuflag: runs Phase 3/5 matrix ops on CUDA device--ctx-size: user override for context window size--log-level: verbosity control (0=quiet to 3=trace)- OTT readiness report (
ott_readiness_report.json) with subsection flags - JSON axiom report:
phase5_geodesic.oracle_target_count,warp_points_accumulated
Host-mode runtime + CUDA GPU offload + speculative execution. First release that runs on Windows/Linux as a native host application without a bootable kernel image. Introduced CUDA GPU dispatch (RTX 4070: 29% decode speedup over CPU), five speculative execution techniques, and an HTTP API server for programmatic access.
Full hardware abstraction layer for running as a host process:
- Memory-mapped model loading (GGUF mmap, no copy into heap)
- Native POSIX/Win32 threads replacing bare-metal SMP dispatch
host/main.cCLI: flags for prompt, token count, temperature, GPU mode- Cross-platform build:
build_host.ps1/build_host.sh
Selective dispatch to RTX/Quadro/A-series GPUs via CUDA runtime:
- Threshold:
out_dim ≥ 8192(captures all large projection layers) - Gemma-4 E2B: GPU decode ~14.5 tok/s at launch; improves to 92.5 tok/s by v0.5
cudaMemcpyweight staging on first call; cached for subsequent tokens- Fallback: CPU AVX2 path for sub-threshold layers
| Technique | Principle |
|---|---|
| Adaptive Precision Cascade (APC) | Low entropy inputs fast-path at INT16; high entropy re-runs at FP32 |
| Speculative Layer Fusion (SLF) | Skip matmul when layer input signature matches cached activation |
| Entropy-Aware Neuron Pruning (EANP) | Zero-entropy neurons pruned at runtime (no retraining) |
| Compute DAG Scheduling | Tomasulo-inspired tensor dependency DAG with resource ordering |
| Confidence-Gated Early Exit | Execution depth proportional to input difficulty |
These techniques operate within the SNE engine (runtime/nn/speculative.c) as
microarchitecture-level acceleration of neural inference, independent of the
OTT speculative decode path.
REST API on localhost:8080:
POST /v1/generate— single-turn text completionPOST /v1/chat— multi-turn conversation (OpenAI-compatible)GET /v1/models— list loaded modelsGET /v1/version— runtime version string
Added GGUF loaders for: Qwen2.5, LLaMA 3, Gemma 2, SmolLM 2, Mistral
Placeholder geometry survey with architecture-heuristic manifold ID and surrogate curvature metrics. Not yet using real model weights. Serves as integration scaffold for Beta-2 real-geometry implementation.
| Metric | v0.3.0 | v0.4.0 | Improvement |
|---|---|---|---|
| Decode speed (CPU) | 162 ms/tok | 138 ms/tok | 1.2× |
| Decode speed (GPU) | N/A | 69 ms/tok | GPU enabled |
| Host binary size | N/A | ~1.1 MB | Host target |
| Supported models | 2 | 7 | +5 architectures |
2.8× performance improvement. Decode speed improved from 454 ms/tok to 162 ms/tok through SMP parallel GEMV, JIT-compiled forward kernels, and critical bug fixes in both the JIT loop counters and the SMP trampoline.
Page tables relocated from 0x1000 to 0x10000 (18 pages). The original address
collided with the BIOS Data Area, causing AP bootstrap failures. All 4 CPUs now
come online reliably.
All JIT forward kernels (vadd, dot, vmul, rmsnorm) emitted vecs/2 as the
loop count instead of vecs. This halved the computation, producing incorrect results
for every JIT-accelerated operation.
Changed AP entry from jmp rax to call rax to ensure 16-byte stack alignment
required by the System V ABI. Misalignment caused SSE2 movaps faults on APs.
Six native x86_64 kernels lazy-compiled on first LLM inference:
vadd(dim=3072) — residual connectionsdot(head_dim=96) — attention score computationaxpy(head_dim=96) — attention value accumulationfused_silu_mul(ff_dim=8192) — FFN gate ⊙ up projectionrope(head_dim=96) — rotary position encodingrmsnorm(dim=3072) — RMS normalization
Emitted into a 2 MB W^X code pool (max 64 concurrent buffers).
smp_dispatch()partitions GEMV rows across all online CPUs- Supports Q4_0 and Q8_0 fused AVX2 GEMV paths
- Dispatches when
ncpu > 1 && out_dim >= 64 - BSP + APs synchronized via
smp_wait_all()
7 new AVX2 integer SIMD instruction emitters added to the JIT engine. Integer Q4×Q8 GEMV compiler implemented (disabled pending correctness verification).
| Metric | v0.2.0 | v0.3.0 | Improvement |
|---|---|---|---|
| Decode speed | 454 ms/tok | 162 ms/tok | 2.8× faster |
| CPUs used | 1 | 4 | SMP dispatch |
| JIT kernels | 0 | 6 | Forward pass JIT |
| JIT pool | 1 MB | 2 MB | Doubled capacity |
First coherent LLM inference achieved. Phi-3.5 Mini Instruct (3.8B params, Q4_0) now generates correct English text on bare-metal x86_64, running under QEMU WHPX at ~800 ms/tok (454 ms/tok decode, 5.5s prefill for 12 tokens).
Prompt: "What is an operating system?"
Output: "An operating system (OS) is a complex piece of software that man..."
This release fixes critical numerical bugs in quantized inference that produced garbage output since the initial LLM integration.
All Q4_0 (4-bit quantized) dequantization code used an interleaved nibble layout instead of the GGML standard layout. For each 32-element block packed into 16 bytes:
- Wrong (interleaved):
out[2*j] = lo_nibble, out[2*j+1] = hi_nibble - Correct (GGML standard):
out[j] = lo_nibble, out[j+16] = hi_nibble
This caused element-order corruption at every F32 boundary (RMSNorm multiply, residual connections), compounding through all 32 transformer layers. Aggregate statistics (min/max/sum) were identical since values were just permuted within blocks, making the bug invisible to earlier verification.
Files fixed:
runtime/nn/llm.c—llm_embed()Q4_0 caseruntime/nn/llm.c—q4_0_dot32()(SSE2 + aarch64 paths)runtime/nn/llm.c—q4_1_dot32()(SSE2 + aarch64 paths)runtime/nn/llm.c—q4_0_dot32_avx2()(AVX2 8-wide path)runtime/nn/llm.c—llm_gemv_q4_fused_avx2()(4-row batched GEMV)runtime/nn/llm.c—llm_gemv_q4_fused_range_avx2()(parallel worker GEMV)runtime/nn/llm.c— AVX2 helper replaced:q4_unpack_v8f→q4_unpack_lo_v8f+q4_unpack_hi_v8f
Hardcoded 1e-6f replaced with model-specific epsilon loaded from GGUF metadata
(general.rms_norm_eps). Phi-3.5 uses 1e-5.
Complete rewrites of sinf, cosf, expf, logf, sqrtf — custom bare-metal
implementations had catastrophic precision errors affecting RoPE frequency computation
and softmax normalization.
- Loaded
rope_factors_shortandrope_factors_longtensors from GGUF - Applied frequency scaling in
llm_rope_precompute()based on position vs original context length (4096 for Phi-3.5) - Enables correct positional encoding for extended-context models
runtime/nn/flash_attn.c— Flash Attention kernel interfaceruntime/nn/paged_attn.c— PagedAttention (vLLM-style) interfaceruntime/nn/safetensors.c— Safetensors format loaderruntime/nn/onnx.c— ONNX Runtime integration interfaceruntime/compute/vulkan_compute.c— Vulkan/WebGPU compute backend interfaceruntime/pseudocode/pseudo_stdlib.c— Pseudocode standard librarykernel/drivers/dma/pcie_dma.c— PCIe DMA enginekernel/net/distributed.c— Distributed inference networkingboot/uefi_stub.c— UEFI boot support
| Metric | Before | After |
|---|---|---|
| Decode speed | N/A (garbage) | 454 ms/tok |
| Prefill (12 tok) | N/A | 5,475 ms |
| End-to-end (16 tok) | N/A | 12,793 ms |
| Model load | 5.5s | 5.5s |
| Binary size | ~808 KB | ~808 KB |
All values now match a Python/NumPy reference implementation exactly:
| Checkpoint | Python | TensorOS | Match |
|---|---|---|---|
| Embedding abssum | 72.35 | 72.35 | ✅ |
| Embed[0] | -0.009949 | -0.009949 | ✅ |
| Q[0] after GEMV | -0.419630 | -0.419630 | ✅ |
| L0 output min | -3.5961 | -3.5961 | ✅ |
| L0 output max | 2.5063 | 2.5063 | ✅ |
| L0 output abssum | 135.21 | 135.21 | ✅ |
Previously, Q[0] was -0.014042 (30× too small) and L0 output range was [-0.37, 0.37] instead of [-3.60, 2.51] — a 10× dynamic range loss.
- Multiboot1 bootloader with x86_64 long mode + SSE2
- SMP bootstrap (INIT-SIPI-SIPI)
- Tensor-aware memory manager (heap + arena + slab)
- GGUF model format parser
- Complete transformer forward pass
- Q4_0, Q4_1, Q6_K, Q8_0 quantization support
- AVX2+FMA SIMD acceleration with 4-row batched GEMV
- x86_64 JIT code generator for Q8_0 GEMV kernels
- SentencePiece and BPE tokenizers
- Temperature sampling with top-k and nucleus filtering
- Virtio-blk and virtio-net drivers
- ARP/IPv4/UDP/ICMP network stack
- AI shell with 20+ commands
- Pseudocode JIT compiler (lexer, parser, IR, 4-tier optimization)
- Model package manager
- ARM64 / Raspberry Pi 4 boot stub