feat(kv): replace inverted tiered cache with vLLM v1 mirror model by sriumcp · Pull Request #638 · inference-sim/inference-sim

sriumcp · 2026-03-13T03:45:13Z

Summary

Replace BLIS's inverted tiered KV cache model with vLLM v1 OffloadingConnector semantics:

GPU→CPU mirroring — after each batch step, in-use full blocks are copied to CPU (GPU prefix cache untouched)
Targeted CPU→GPU reload — on GPU allocation failure, only prefix-matching blocks are reloaded using hierarchical hash chaining
GPU prefix cache preservation — ReleaseKVBlocks no longer triggers offload; freed blocks stay on GPU with hashes intact

Why this matters

The old model actively removed prefix hashes from GPU when offloading to CPU, artificially suppressing GPU cache hit rates and causing pathological offload-reload thrashing. The new model preserves GPU cache and uses CPU as a secondary tier that extends prefix lifetime beyond GPU eviction — matching production vLLM v1 behavior.

Fixes #510, fixes #511

🤖 Generated with Claude Code

- Add MirrorToCPU(batch []*Request) to KVStore interface - Implement no-op on KVCacheState (single-tier) - Add stub on TieredKVCache (full impl in later task) - R13: both implementations satisfy interface Co-Authored-By: Claude <[email protected]>

- Replace offloadedBlock/cpuTier with cpuBlock/cpuTier - Hash-keyed map + doubly-linked list for O(1) operations - Pre-allocated token slices eliminate per-mirror GC pressure - store/touch/lookup/evict all O(1) - R3: newCpuTier validates capacity > 0, blockSize > 0 - BC-7: deprecation warning for KVOffloadThreshold - Remove old TieredKVCache fields (offloadThreshold, offloadCount, thrashingCount, clock). Stub old methods for compilation. - Delete all 14 old tiered_test.go tests (tested inverted semantics) - KVThrashingRate repurposed to cpuEvictionCount/mirrorCount Note: store_test.go setupTieredWithLatency tests expected to fail until Task 7 rewrites the helper. Co-Authored-By: Claude <[email protected]>

- Replace tryReloadFromCPU with reloadPrefixFromCPU - Compute hierarchical hashes for requesting prefix only - maxReloads = countFreeBlocks() prevents hash destruction - CPU blocks touched on reload to refresh LRU recency - Transfer latency accumulates per reloaded block - Unrelated CPU blocks are never touched Co-Authored-By: Claude <[email protected]>

- Store newly-completed full blocks to CPU tier - Touch existing blocks to refresh LRU recency - GPU HashToBlock never modified (read-only copy) - Skip partial blocks and unhashed blocks - Nil/empty batch safe Co-Authored-By: Claude <[email protected]>

- Verify ReleaseKVBlocks preserves hashes on GPU free list - GetCachedBlocks still finds prefix after release - No offload triggered (maybeOffload removed in Task 2) Co-Authored-By: Claude <[email protected]>

- Insert MirrorToCPU between executeBatchStep and processCompletions - BC-5 test: CPU extends GPU prefix lifetime through eviction+reload - KVThrashingRate tests: CPU eviction rate + R11 zero-guard Co-Authored-By: Claude <[email protected]>

- Add INV-4 conservation test through full mirror+reload lifecycle - Rewrite setupTieredWithLatency to use mirror+reload (not offload) - All 27 kv tests pass, full sim suite green Co-Authored-By: Claude <[email protected]>

- Deprecate KVOffloadThreshold in config.go - Update tiered.go description in CLAUDE.md (mirror/reload, vLLM v1) - Update KVStore interface description (12 methods) - Update register_test.go stale threshold comment Co-Authored-By: Claude <[email protected]>

- INV-4 conservation test was `total == total` (always true). Now walks GPU free list independently of UsedBlockCnt to verify UsedBlockCnt + freeListLen == TotalBlocks. - Add baseLat >= 0 validation in NewTieredKVCache (R3). Co-Authored-By: Claude <[email protected]>

gpu.CacheHits already includes CPU-reloaded blocks (they appear as GPU cache hits on the retry allocation after reload). Adding cpuHitCount on top double-counted the same blocks. Pre-existing bug from the old tryReloadFromCPU path, now corrected. Co-Authored-By: Claude <[email protected]>

Found during Step 4.75 pre-commit self-audit: baseLat >= 0 validation was added but had no companion test. Co-Authored-By: Claude <[email protected]>

sriumcp and others added 11 commits March 12, 2026 20:04

test(kv): add BC-3 GPU prefix preservation test

5c7db52

- Verify ReleaseKVBlocks preserves hashes on GPU free list - GetCachedBlocks still finds prefix after release - No offload triggered (maybeOffload removed in Task 2) Co-Authored-By: Claude <[email protected]>

test(kv): add baseLat negative validation test (R3, self-audit)

ee1294a

Found during Step 4.75 pre-commit self-audit: baseLat >= 0 validation was added but had no companion test. Co-Authored-By: Claude <[email protected]>

sriumcp merged commit eee4e8b into inference-sim:main Mar 13, 2026
1 check passed

This was referenced Mar 13, 2026

latency: clampToInt64 sentinel math.MaxInt64 can overflow when added to simulation clock #642

Open

hypothesis: tiered cache at GPU=1500 blocks should reduce preemption rate >50% vs single-tier under sustained KV pressure #317

Closed

sriumcp deleted the kv-tiered-v1-mirror branch March 19, 2026 14:24

sriumcp mentioned this pull request Apr 2, 2026

feat: Device-tier weighted prefix scoring (GPU vs CPU cached blocks) #920

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638

feat(kv): replace inverted tiered cache with vLLM v1 mirror model#638
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:kv-tiered-v1-mirror

sriumcp commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Mar 13, 2026

Summary

Why this matters

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant