I build CPU-first systems for running large language models - reducing inference cost, memory usage, and latency.
→ 8.6× faster than PyTorch CPU (INT8 + AVX2)
→ 4× lower memory footprint
→ Pure C (no ML frameworks)
All benchmarks are reproducible.
I focus on making LLM inference deployable everywhere - not just on GPUs.
LLM inference runs millions to billions of times.
Even small efficiency gains translate directly into millions of dollars saved.
Example:
- GPU inference: $0.002 / request
- Optimized CPU inference: $0.0003 / request
At 10M requests/day:
→ ~$17,000 saved per day
→ ~$6M saved per year
I work on making this shift possible.
- Transformer inference engine in C (forward + backward pass)
- Cache-aware attention kernels (tiled, memory-optimized)
- INT8/low-bit quantization pipelines
- AVX2 SIMD optimized matmul & kernels
- Arena-based memory allocator (zero fragmentation)
- KV-cache optimized for long sequence inference
-
Contributor to
ggml-org/llama.cpp- Fixed integer type inconsistencies in split helpers
- PR: ggml-org/llama.cpp#18894
-
Built a CPU-first LLM engine:
- Explicit memory layout control
- Quantized inference
- Benchmarked against baseline implementations
- Attention performance on CPUs
- Memory bandwidth vs compute bottlenecks
- Cache locality (L1/L2/L3 behavior)
- SIMD utilization efficiency
- Operator fusion & kernel optimization
- Auto-vectorization vs hand-written intrinsics
Making LLM systems:
- Cheaper
- Faster
- Deployable on commodity hardware
If you work on ML systems, inference engines, or compilers - let’s connect.
