Skip to content
View MaheshJakkala's full-sized avatar

Organizations

@gnss-error-lab

Block or report MaheshJakkala

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
MaheshJakkala/README.md

Hi, I’m Mahesh 👋

I build CPU-first systems for running large language models - reducing inference cost, memory usage, and latency.

→ 8.6× faster than PyTorch CPU (INT8 + AVX2)
→ 4× lower memory footprint
→ Pure C (no ML frameworks)

All benchmarks are reproducible.

I focus on making LLM inference deployable everywhere - not just on GPUs.


💰 Why this matters

LLM inference runs millions to billions of times.

Even small efficiency gains translate directly into millions of dollars saved.

Example:

  • GPU inference: $0.002 / request
  • Optimized CPU inference: $0.0003 / request

At 10M requests/day:

→ ~$17,000 saved per day
→ ~$6M saved per year

I work on making this shift possible.


⚙️ Systems I build

  • Transformer inference engine in C (forward + backward pass)
  • Cache-aware attention kernels (tiled, memory-optimized)
  • INT8/low-bit quantization pipelines
  • AVX2 SIMD optimized matmul & kernels
  • Arena-based memory allocator (zero fragmentation)
  • KV-cache optimized for long sequence inference

🧪 Proof of work

  • Contributor to ggml-org/llama.cpp

  • Built a CPU-first LLM engine:

    • Explicit memory layout control
    • Quantized inference
    • Benchmarked against baseline implementations

🔬 Current focus

  • Attention performance on CPUs
  • Memory bandwidth vs compute bottlenecks
  • Cache locality (L1/L2/L3 behavior)
  • SIMD utilization efficiency
  • Operator fusion & kernel optimization
  • Auto-vectorization vs hand-written intrinsics

🎯 What I care about

Making LLM systems:

  • Cheaper
  • Faster
  • Deployable on commodity hardware

If you work on ML systems, inference engines, or compilers - let’s connect.

Pinned Loading

  1. llm-c-transformer llm-c-transformer Public

    🚀 Fastest CPU LLM Inference Engine (C + INT8 + AVX2) -- 8.6× Faster than PyTorch CPU, 4× Less Memory: custom tensor lib, INT8 post-training quantization, AVX2 SIMD matmul (3.1× faster, 4× less memo…

    C

  2. cpu-transformer-inference-c cpu-transformer-inference-c Public

    CPU-first Transformer inference engine in pure C with quantization, tiled attention, arena memory allocator, operator fusion, and reproducible benchmarking. A systems-level study of memory-efficien…

    C

  3. llama.cpp llama.cpp Public

    Forked from ggml-org/llama.cpp

    LLM inference in C/C++

    C++

  4. cache22-c-server cache22-c-server Public

    Built a high-performance key-value store in C - optimized from 15K to 145K ops/sec using epoll (10× improvement).

    C