feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638
Merged
sriumcp merged 11 commits intoinference-sim:mainfrom Mar 13, 2026
Merged
feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp merged 11 commits intoinference-sim:mainfrom
Conversation
- Add MirrorToCPU(batch []*Request) to KVStore interface - Implement no-op on KVCacheState (single-tier) - Add stub on TieredKVCache (full impl in later task) - R13: both implementations satisfy interface Co-Authored-By: Claude <[email protected]>
- Replace offloadedBlock/cpuTier with cpuBlock/cpuTier - Hash-keyed map + doubly-linked list for O(1) operations - Pre-allocated token slices eliminate per-mirror GC pressure - store/touch/lookup/evict all O(1) - R3: newCpuTier validates capacity > 0, blockSize > 0 - BC-7: deprecation warning for KVOffloadThreshold - Remove old TieredKVCache fields (offloadThreshold, offloadCount, thrashingCount, clock). Stub old methods for compilation. - Delete all 14 old tiered_test.go tests (tested inverted semantics) - KVThrashingRate repurposed to cpuEvictionCount/mirrorCount Note: store_test.go setupTieredWithLatency tests expected to fail until Task 7 rewrites the helper. Co-Authored-By: Claude <[email protected]>
- Replace tryReloadFromCPU with reloadPrefixFromCPU - Compute hierarchical hashes for requesting prefix only - maxReloads = countFreeBlocks() prevents hash destruction - CPU blocks touched on reload to refresh LRU recency - Transfer latency accumulates per reloaded block - Unrelated CPU blocks are never touched Co-Authored-By: Claude <[email protected]>
- Store newly-completed full blocks to CPU tier - Touch existing blocks to refresh LRU recency - GPU HashToBlock never modified (read-only copy) - Skip partial blocks and unhashed blocks - Nil/empty batch safe Co-Authored-By: Claude <[email protected]>
- Verify ReleaseKVBlocks preserves hashes on GPU free list - GetCachedBlocks still finds prefix after release - No offload triggered (maybeOffload removed in Task 2) Co-Authored-By: Claude <[email protected]>
- Insert MirrorToCPU between executeBatchStep and processCompletions - BC-5 test: CPU extends GPU prefix lifetime through eviction+reload - KVThrashingRate tests: CPU eviction rate + R11 zero-guard Co-Authored-By: Claude <[email protected]>
- Add INV-4 conservation test through full mirror+reload lifecycle - Rewrite setupTieredWithLatency to use mirror+reload (not offload) - All 27 kv tests pass, full sim suite green Co-Authored-By: Claude <[email protected]>
- Deprecate KVOffloadThreshold in config.go - Update tiered.go description in CLAUDE.md (mirror/reload, vLLM v1) - Update KVStore interface description (12 methods) - Update register_test.go stale threshold comment Co-Authored-By: Claude <[email protected]>
- INV-4 conservation test was `total == total` (always true). Now walks GPU free list independently of UsedBlockCnt to verify UsedBlockCnt + freeListLen == TotalBlocks. - Add baseLat >= 0 validation in NewTieredKVCache (R3). Co-Authored-By: Claude <[email protected]>
gpu.CacheHits already includes CPU-reloaded blocks (they appear as GPU cache hits on the retry allocation after reload). Adding cpuHitCount on top double-counted the same blocks. Pre-existing bug from the old tryReloadFromCPU path, now corrected. Co-Authored-By: Claude <[email protected]>
Found during Step 4.75 pre-commit self-audit: baseLat >= 0 validation was added but had no companion test. Co-Authored-By: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace BLIS's inverted tiered KV cache model with vLLM v1 OffloadingConnector semantics:
ReleaseKVBlocksno longer triggers offload; freed blocks stay on GPU with hashes intactWhy this matters
The old model actively removed prefix hashes from GPU when offloading to CPU, artificially suppressing GPU cache hit rates and causing pathological offload-reload thrashing. The new model preserves GPU cache and uses CPU as a secondary tier that extends prefix lifetime beyond GPU eviction — matching production vLLM v1 behavior.
Fixes #510, fixes #511
🤖 Generated with Claude Code