TurboQuant 3-bit KV Cache Compression for vLLM on x86 Discrete GPUs
CUDA-accelerated 3-bit Lloyd-Max KV cache compression for x86 discrete GPUs. Achieves 5.12x compression ratio with 0.983 cosine similarity and 20x faster encode than PyTorch fallback via fused CUDA kernels. Works with any vLLM model — standard attention, MoE, GDN/Mamba hybrid, MLA — by hooking at the universal reshape_and_cache_flash() level.
Separate from the GB10 ARM version which uses pure numpy on unified memory. This repo targets x86 discrete VRAM (RTX 6000, RTX 4090, A100, H100, etc.) where data lives in GPU VRAM and compression must run on-GPU.
| Metric | Value |
|---|---|
| Compression ratio | 5.12x (256 B bf16 → 52 B per 128-dim vector) |
| Reconstruction quality | 0.983 cosine similarity |
| Encode speed (CUDA) | 0.006 ms / 331M vectors/s |
| Decode speed (CUDA) | 0.004 ms / 500M+ vectors/s |
| CUDA vs PyTorch speedup | 20x encode, 27x decode |
| Live test | 114,600 cache writes, 0 errors |
| Throughput (RTX 6000) | 32.0 tok/s with compression (46% faster than GB10) |
| Model tested | Qwen3.5-35B-A3B (MoE + GDN hybrid) |
Based on TurboQuant (Zandieh et al., Google, 2025):
KV vector [128 dim, bf16] → L2 norm → √D scale → Lloyd-Max 3-bit → bit-pack
256 bytes (original) → 4 bytes radius + 48 bytes packed = 52 bytes
Compression: 4.92x | Cosine similarity: 0.983 | MSE: 0.034
For D=256 (GB10 Qwen3.5): 512B → 100B = 5.12x
| Method | Bits | Ratio | Cosine | Reference |
|---|---|---|---|---|
| KIVI (2024) | 2 | 4x | ~0.95 | arxiv:2406.03482 |
| PolarQuant (2025) | 3 | 4.2x | ~0.98 | arxiv:2502.02617 |
| TurboQuant (2025) | 3 | 5.12x | 0.983 | arxiv:2504.19874 |
| ManthanQuant x86 | 3 | 5.12x | 0.983 | This work (CUDA kernels on discrete GPUs) |
| GB10 (ARM Unified Memory) | x86 (Discrete VRAM) | |
|---|---|---|
| Compression runs on | ARM CPU (numpy) | GPU (CUDA kernels) |
.cpu() cost |
Free (shared memory) | Expensive (PCIe) |
| CUDA kernel conflicts | Yes (Triton/Flash) | No (separate VRAM) |
| Encode speed | ~22 tok/s (numpy) | 331M vec/s (CUDA) |
| Repository | manthanquant | This repo |
┌─────────────────────────────────────┐
│ vLLM Model Forward (any model) │
│ Llama, Qwen, Gemma, Mamba, MLA... │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Attention Backend (any) │
│ Flash / Triton / GDN / Mamba / MLA │
└──────────────┬──────────────────────┘
│ writes K,V to paged cache
▼
┌─────────────────────────────────────┐
│ reshape_and_cache_flash() │ ← ALL standard attention writes here
│ (vllm._custom_ops) │
└──────────────┬──────────────────────┘
│
▼
┌═════════════════════════════════════┐
║ ManthanQuant TurboQuant (HOOK) ║ ← WE INTERCEPT HERE
║ ║
║ HOT tier: bf16 paged cache ║ vLLM uses this for attention
║ COLD tier: 3-bit compressed ║ 5.12x smaller shadow cache
║ ║
║ Block written → compress to COLD ║
║ (on GPU, 0.006ms per block) ║
╚═════════════════════════════════════╝
This hooks below all attention backends, so it works with every model architecture automatically.
# Clone
git clone https://github.com/atcuality2021/manthanquant-x86.git
cd manthanquant-x86
# Install (PyTorch-only, works immediately)
pip install -e .
# Install with CUDA kernels (20x faster, recommended)
MANTHANQUANT_BUILD_CUDA=1 pip install -e . --no-build-isolation
# If your system GCC is too new for nvcc:
CC=gcc-14 CXX=g++-14 CUDA_HOME=/usr/local/cuda-12.9 \
MANTHANQUANT_BUILD_CUDA=1 TORCH_CUDA_ARCH_LIST="8.9 12.0" \
pip install -e . --no-build-isolationpython -m manthanquant.serve serve /path/to/model \
--port 8200 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.79 \
--max-model-len 16384 \
--api-key YOUR_KEYThe launcher automatically sets up sitecustomize.py deferred patching so the compression hooks activate in vLLM's EngineCore child process.
Real measurements on NVIDIA RTX PRO 6000 Blackwell. All results from live Qwen3.5-35B-A3B inference.
Both running Qwen3.5-35B-A3B, unique prompts (no cache hits), temperature=0.
| Test | RTX 6000 + MQ (tok/s) | GB10 Baseline (tok/s) | Delta |
|---|---|---|---|
| Math (100 tok) | 31.6 | 21.4 | +48% |
| Code Generation (300 tok) | 32.2 | 22.1 | +46% |
| Reasoning (200 tok) | 32.0 | 21.7 | +47% |
| Summarization (300 tok) | 31.6 | 22.1 | +43% |
| Long Generation (500 tok) | 32.4 | 22.1 | +47% |
| Architecture Design (400 tok) | 31.9 | 22.1 | +44% |
| Average | 32.0 | 21.9 | +46% |
| Node | Tokens | Time | Tok/s |
|---|---|---|---|
| RTX 6000 + ManthanQuant | 1,000 | 31.62s | 31.6 |
| GB10 Baseline | 1,000 | 44.83s | 22.3 |
| Node | Completion | Prompt | Time | Tok/s |
|---|---|---|---|---|
| RTX 6000 + MQ | 500 | 125 | 16.06s | 31.1 |
| GB10 Baseline | 500 | 127 | 22.73s | 21.9 |
200 tokens per user, unique prompts, ManthanQuant 3-bit compression active.
| Users | Success | Agg tok/s | Per-user tok/s | Wall Time |
|---|---|---|---|---|
| 1 | 1/1 | 31.4 | 31.4 | 6.3s |
| 2 | 2/2 | 58.9 | 29.4 | 6.8s |
| 4 | 4/4 | 115.9 | 28.9 | 6.9s |
| 6 | 6/6 | 86.8 | 14.4 | 13.8s |
| 8 | 8/8 | 116.6 | 14.5 | 13.7s |
| 10 | 10/10 | 97.1 | 9.7 | 20.6s |
| 15 | 15/15 | 109.9 | 7.3 | 27.3s |
| 20 | 20/20 | 117.6 | 5.8 | 34.0s |
Sweet spot: 4 users -- 116 agg tok/s, 28.9 per-user tok/s, 0 errors. Max throughput: 8-20 users -- 117 agg tok/s, all requests succeed.
| Users | Per-user tok/s | Time for 100 tok | Time for 500 tok | Time for 1000 tok |
|---|---|---|---|---|
| 1 | 31.4 | 3.2s | 16s | 32s |
| 4 | 28.9 | 3.5s | 17s | 35s |
| 8 | 14.5 | 6.9s | 34s | 69s |
| 15 | 7.3 | 13.7s | 68s | 137s |
| Users | RTX 6000 + MQ (agg tok/s) | GB10 + MQ (agg tok/s) | RTX Advantage |
|---|---|---|---|
| 1 | 31.4 | 21.7 | +45% |
| 4 | 115.9 | 61.1 | +90% |
| 8 | 116.6 | 101.3 | +15% |
| 15 | 109.9 | 95.7 | +15% |
| 20 | 117.6 | 88.5 | +33% |
| Operation | PyTorch (Phase 1) | CUDA Kernel (Phase 2) | Speedup |
|---|---|---|---|
| Encode (N=2048, D=128) | 0.122 ms | 0.006 ms | 19.8x |
| Decode (N=2048, D=128) | 0.110 ms | 0.004 ms | 26.6x |
| Encode throughput | 16.7M vec/s | 331.7M vec/s |
[ManthanQuant pid=491422] calls=152800 compressed=986 ratio=5.12x saved=49.59MB
- 152,800 cache write operations intercepted
- 986 blocks compressed to COLD tier
- 5.12x compression ratio (matches theoretical Lloyd-Max bound)
- 49.59 MB saved in shadow cache
- 0 errors across entire benchmark session (20+ concurrent users)
Lloyd-Max minimizes MSE for a given source distribution and number of levels. For N(0,1) with 8 levels (3 bits):
- Centroids:
[-2.152, -1.344, -0.756, -0.245, 0.245, 0.756, 1.344, 2.152] - Boundaries:
[-1.748, -1.050, -0.501, 0.000, 0.501, 1.050, 1.748] - MSE: 0.03455
After L2 normalization, each element has std ≈ 1/sqrt(D). Lloyd-Max centroids are optimized for N(0,1). Scaling by sqrt(D) maps elements to the expected distribution.
For D=128, bf16:
Original: 128 × 2 = 256 bytes
Compressed: 4 (radius) + ceil(128×3/32)×4 = 4 + 48 = 52 bytes
Ratio: 256 / 52 = 4.92x
For D=256, bf16:
Original: 256 × 2 = 512 bytes
Compressed: 4 + ceil(256×3/32)×4 = 4 + 96 = 100 bytes
Ratio: 512 / 100 = 5.12x
cos(v, q) ≥ 1 - ε/2 = 1 - 0.0345/2 = 0.983
Empirically measured: 0.983 (matches bound)
3-bit values can straddle int32 word boundaries (e.g., coordinate 10: bit_pos=30, needs bits 30-32). The encoder splits boundary-crossing values:
- Lower bits → primary word via
atomicOr - Upper bits → next word via
atomicOr
This was a critical bug fix — without it, ~12 out of 128 coordinates get corrupted, dropping cosine from 0.98 to 0.86.
| Architecture | Models | Status |
|---|---|---|
| Standard Attention | Llama, Gemma, Mistral | Supported |
| MoE + GDN Hybrid | Qwen3.5-35B-A3B | Live tested |
| MoE Standard | Mixtral, DBRX | Supported |
| MLA | DeepSeek-V2/V3 | Planned |
| Mamba/SSM | Mamba, Jamba | Planned |
Works with any model that uses vLLM's reshape_and_cache_flash() for KV cache writes.
manthanquant-x86/
├── manthanquant/
│ ├── __init__.py # Package (v0.2.0)
│ ├── serve.py # Drop-in vLLM launcher with compression
│ ├── core/
│ │ ├── __init__.py
│ │ └── quantizer.py # TurboQuant encoder/decoder (PyTorch + CUDA)
│ ├── vllm_integration/
│ │ ├── __init__.py
│ │ ├── compressed_cache.py # Two-tier HOT/COLD cache manager
│ │ └── patch.py # vLLM hooks (deferred patching)
│ └── tests/
│ ├── __init__.py
│ └── test_quantizer.py # 24 tests (correctness + quality + cache)
├── csrc/
│ ├── turboquant_kernel.cu # CUDA encode/decode kernels (SM 8.9, 12.0)
│ └── bindings.cpp # pybind11 bindings
├── scripts/
│ ├── vllm_serve_with_compression.py # Launcher with sitecustomize patching
│ ├── benchmark_full.py # Full benchmark suite
│ ├── live_test.py # Quick live test
│ └── install_autoload.py # .pth file installer
├── benchmarks/
│ └── benchmark_manthanquant_x86_20260403.md
├── setup.py # With optional CUDA build
├── LICENSE # Apache 2.0
└── README.md
- 3-bit Lloyd-Max encode/decode with CUDA kernels (20x speedup)
- Shadow compressed cache with HOT/COLD two-tier architecture
- vLLM integration via
sitecustomizedeferred patching (works in EngineCore child process) - 114,600 cache writes, 458 blocks compressed, 0 errors on Qwen3.5-35B-A3B
- 24/24 unit tests passing with CUDA backend
- Supports SM 8.9 (Ada/RTX 4090) and SM 12.0 (Blackwell/RTX 6000)
- Memory savings: Shadow cache runs alongside bf16 paged cache (no blocks freed yet)
- Compressed decode: Attention still reads from bf16; compressed cache is not used for attention
- MLA model support: DeepSeek models need
concat_and_cache_mlahook - Mamba/SSM state compression: GDN/linear attention state not compressed
| Version | Status | Description |
|---|---|---|
| v0.1 | Done | Phase 1: PyTorch encoder/decoder, vLLM hooks, 24 tests |
| v0.2 | Current | Phase 2: CUDA kernels (20x speedup), full benchmark |
| v0.3 | Next | Hot/cold LRU eviction — free bf16 blocks, decompress on demand |
| v0.4 | Planned | Fused decompress+attend kernel (8x over fp32, per TurboQuant paper) |
| v0.5 | Planned | MLA + Mamba/SSM state compression |
| v1.0 | Planned | Production-ready with real memory savings |
| Component | Details |
|---|---|
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB discrete, SM 12.0) |
| Model | Qwen3.5-35B-A3B (MoE + GDN hybrid, 35B total, ~3B active) |
| vLLM | v0.19.0 |
| Python | 3.13 |
| CUDA | 12.9 (nvcc), 12.8 (PyTorch) |
| GCC | 14.3 |
- TurboQuant: Zandieh et al., "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (2025). arxiv:2504.19874
- QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (2024). arxiv:2406.03482
- PolarQuant: Han et al., "PolarQuant: Quantizing KV Caches with Polar Transformation" (2025). arxiv:2502.02617
- Lloyd-Max: S.P. Lloyd, "Least squares quantization in PCM" (1982). J. Max, "Quantizing for minimum distortion" (1960).
- PagedAttention: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023).
-
Universal vLLM Hook: Intercept at
reshape_and_cache_flash()level — works with ANY model architecture (standard attention, MoE, GDN hybrid, Mamba). Previous approaches required per-backend monkey-patching. -
Boundary-Safe Bit-Packing: 3-bit values crossing int32 word boundaries are split across words. Without this, ~10% of coordinates corrupt, dropping cosine from 0.98 to 0.86.
-
Fused CUDA Kernels: Single-launch encode (norm→scale→quantize→bitpack) and decode (unpack→lookup→scale) achieving 331M vectors/s — 20x faster than vectorized PyTorch ops.
-
Sitecustomize Deferred Patching: Patches vLLM in child processes (EngineCore) by installing a temporary
sitecustomize.pythat activates when_custom_opsloads. Avoids circular imports and multiprocessing spawn issues. -
Cross-Architecture Testing: Benchmarked on both RTX 6000 Blackwell (x86) and GB10 (ARM) with the same model, proving 46% throughput advantage with compression.
Apache 2.0. See LICENSE.
- Claude Code — AI pair programmer (Anthropic Claude Opus 4.6, 1M context)
- vLLM v0.19 — LLM inference engine
- NVIDIA RTX PRO 6000 Blackwell — 96 GB discrete VRAM
- ManthanQuant GB10 — Sister project for ARM unified memory