Skip to content

feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638

Merged
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:kv-tiered-v1-mirror
Mar 13, 2026
Merged

feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:kv-tiered-v1-mirror

Conversation

@sriumcp
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp commented Mar 13, 2026

Summary

Replace BLIS's inverted tiered KV cache model with vLLM v1 OffloadingConnector semantics:

  • GPU→CPU mirroring — after each batch step, in-use full blocks are copied to CPU (GPU prefix cache untouched)
  • Targeted CPU→GPU reload — on GPU allocation failure, only prefix-matching blocks are reloaded using hierarchical hash chaining
  • GPU prefix cache preservationReleaseKVBlocks no longer triggers offload; freed blocks stay on GPU with hashes intact

Why this matters

The old model actively removed prefix hashes from GPU when offloading to CPU, artificially suppressing GPU cache hit rates and causing pathological offload-reload thrashing. The new model preserves GPU cache and uses CPU as a secondary tier that extends prefix lifetime beyond GPU eviction — matching production vLLM v1 behavior.

Fixes #510, fixes #511

🤖 Generated with Claude Code

sriumcp and others added 11 commits March 12, 2026 20:04
- Add MirrorToCPU(batch []*Request) to KVStore interface
- Implement no-op on KVCacheState (single-tier)
- Add stub on TieredKVCache (full impl in later task)
- R13: both implementations satisfy interface

Co-Authored-By: Claude <[email protected]>
- Replace offloadedBlock/cpuTier with cpuBlock/cpuTier
- Hash-keyed map + doubly-linked list for O(1) operations
- Pre-allocated token slices eliminate per-mirror GC pressure
- store/touch/lookup/evict all O(1)
- R3: newCpuTier validates capacity > 0, blockSize > 0
- BC-7: deprecation warning for KVOffloadThreshold
- Remove old TieredKVCache fields (offloadThreshold, offloadCount,
  thrashingCount, clock). Stub old methods for compilation.
- Delete all 14 old tiered_test.go tests (tested inverted semantics)
- KVThrashingRate repurposed to cpuEvictionCount/mirrorCount

Note: store_test.go setupTieredWithLatency tests expected to fail
until Task 7 rewrites the helper.

Co-Authored-By: Claude <[email protected]>
- Replace tryReloadFromCPU with reloadPrefixFromCPU
- Compute hierarchical hashes for requesting prefix only
- maxReloads = countFreeBlocks() prevents hash destruction
- CPU blocks touched on reload to refresh LRU recency
- Transfer latency accumulates per reloaded block
- Unrelated CPU blocks are never touched

Co-Authored-By: Claude <[email protected]>
- Store newly-completed full blocks to CPU tier
- Touch existing blocks to refresh LRU recency
- GPU HashToBlock never modified (read-only copy)
- Skip partial blocks and unhashed blocks
- Nil/empty batch safe

Co-Authored-By: Claude <[email protected]>
- Verify ReleaseKVBlocks preserves hashes on GPU free list
- GetCachedBlocks still finds prefix after release
- No offload triggered (maybeOffload removed in Task 2)

Co-Authored-By: Claude <[email protected]>
- Insert MirrorToCPU between executeBatchStep and processCompletions
- BC-5 test: CPU extends GPU prefix lifetime through eviction+reload
- KVThrashingRate tests: CPU eviction rate + R11 zero-guard

Co-Authored-By: Claude <[email protected]>
- Add INV-4 conservation test through full mirror+reload lifecycle
- Rewrite setupTieredWithLatency to use mirror+reload (not offload)
- All 27 kv tests pass, full sim suite green

Co-Authored-By: Claude <[email protected]>
- Deprecate KVOffloadThreshold in config.go
- Update tiered.go description in CLAUDE.md (mirror/reload, vLLM v1)
- Update KVStore interface description (12 methods)
- Update register_test.go stale threshold comment

Co-Authored-By: Claude <[email protected]>
- INV-4 conservation test was `total == total` (always true).
  Now walks GPU free list independently of UsedBlockCnt to verify
  UsedBlockCnt + freeListLen == TotalBlocks.
- Add baseLat >= 0 validation in NewTieredKVCache (R3).

Co-Authored-By: Claude <[email protected]>
gpu.CacheHits already includes CPU-reloaded blocks (they appear as
GPU cache hits on the retry allocation after reload). Adding
cpuHitCount on top double-counted the same blocks. Pre-existing
bug from the old tryReloadFromCPU path, now corrected.

Co-Authored-By: Claude <[email protected]>
Found during Step 4.75 pre-commit self-audit: baseLat >= 0
validation was added but had no companion test.

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant