Tags: inference-sim/inference-sim
Tags
feat(training): iter27 — joint CMA-ES 6-param optimization, loss 34.6… …1% (#942) CMA-ES joint search over 6 parameters (β₁ₐ, β₄, β₅, β₇, β₈, β₂ᵦ) from iter26 warm start. 141 trials, best at trial 62. Key interaction: β₄ joint-optimal at 0.752 (vs isolated 0.410), allowing β₅ (49.6→32.4) and β₇ (169→126) to decrease — coordinate descent missed the β₄/β₅/β₇ coupling. Loss: 37.42% → 34.61% (-2.81 points). TTFT RMSE: 24.34% → 22.81%, E2E RMSE: 13.09% → 11.79%. Also: inner_loop_optimize.py max-workers fix (min 4 instead of min 1) to prevent trial timeouts at high n_jobs settings. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
feat(cluster): gateway queue with saturation-gated dispatch (#882) (#897 ) * feat(cluster): gateway queue with saturation-gated dispatch (#882) Add a gateway queue between admission and routing so BLIS can hold admitted requests and dispatch them when cluster capacity opens, modeling GIE flow control behavior. New components: - SaturationDetector interface with NeverSaturated, UtilizationDetector, and ConcurrencyDetector implementations (sim/saturation.go) - GatewayQueue with FIFO and Priority dispatch ordering, capacity shedding (sim/cluster/gateway_queue.go) - Completion-triggered dispatch in the cluster event loop - Per-request GatewayQueueDelay metric (BC-8) - INV-1 conservation extended: gateway_queue_depth + gateway_queue_shed When flow control is disabled (default), behavior is identical to the current pipeline (BC-1 pass-through equivalence verified by test). Closes #882 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cluster): address review findings — CLI validation, batch dispatch, test coverage - Add CLI validation for all 7 flow control flags (R3: validate at CLI boundary). Checks NaN/Inf for float thresholds, validates detector name and dispatch order before passing to library constructors. - Add NaN/Inf guards in UtilizationDetector constructor (P2-2). - Fix BC-4: dispatch loop now fires up to delta times per batch completion instead of once, matching the contract for batch completions. - Strengthen BC-1 pass-through test: compare StillQueued, StillRunning, TimedOutRequests, and verify GatewayQueueShed==0 and GatewayQueueDepth==0. - Fix BC-8 test: use utilization detector with tight thresholds instead of NeverSaturated, so requests actually wait in the gateway queue and the test verifies GatewayEnqueueTime is set. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cluster): include RoutingRejections in flow control conservation test Conservation test was missing RoutingRejections() term, inconsistent with the deferred queue conservation tests. Add it to match INV-1 formula. Co-Authored-By: Claude Opus 4.6 <[email protected]> * perf(cluster): early-exit dispatch loop when saturated or queue empty tryDispatchFromGatewayQueue now returns bool (dispatched or not). The completion delta loop breaks early when saturated, avoiding redundant buildRouterState calls in batch-completion scenarios. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: INV-1 formula missing routing_rejections + interface{} to any - Add routing_rejections to INV-1 cluster-level formula in CLAUDE.md and invariants.md, matching the conservation test and deferred queue tests. - Use 'any' instead of 'interface{}' in gateway_queue.go heap methods for consistency with codebase convention (Go 1.18+). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn on no-op flow control config + add example commands - Warn when --flow-control is enabled but --saturation-detector is 'never' (pass-through), since this is likely not what the user intended. - Add flow control example commands to CLAUDE.md Build and Run section. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: scope CLI validation to selected detector + update architecture docs - Validate --queue-depth-threshold and --kv-cache-util-threshold only when detector is 'utilization'; validate --max-concurrency only when detector is 'concurrency'. Prevents confusing errors for irrelevant params. - Add gateway queue stage to Online Routing Pipeline Walkthrough in architecture.md (step 4, optional when --flow-control enabled). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address pre-merge review findings (Issues 1-5) 1. Fix misleading comment at cluster_event.go — dispatch succeeds when cluster is not saturated, not just with NeverSaturated. 2. Add optional Gateway Queue node to Mermaid flowchart in architecture.md. 3. Add Trace gap comment at gateway shed path + reset GatewayEnqueueTime=0 on shed-on-arrival to leave request struct clean. 4. BC-8 test: hard assertion + redesigned scenario with ConcurrencyDetector (maxConcurrency=1) and staggered arrivals to genuinely force queuing. 5. Debug log in tryDispatchFromGatewayQueue when held due to saturation; resolve EC-2 inline comments in saturation.go. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
fix(observe): calibrate prefix token ratio for BPE tokenizer fidelity (… …#834) * fix(observe): calibrate prefix token ratio to match blis run fidelity Prefix strings in `blis observe` used 1:1 word-to-token mapping, but BPE tokenizers split multi-syllable vocabulary words into ~1.67 tokens/word, inflating actual server token counts by ~60% vs what `blis run` simulates. Add a calibration request at startup that measures the server's actual tokens-per-word ratio, then scale prefix word counts accordingly so the server tokenizes them to approximately the intended token count. Fixes #832 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): check json.Encode error return in calibration tests Satisfies errcheck linter — matches existing test handler pattern. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): add calibration timeout and derive word count from vocabulary Address review feedback: - Add 30s timeout to calibration context to prevent indefinite hangs - Derive calibrationWordCount from len(prefixVocabulary) to prevent silent divergence if vocabulary is expanded Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): correct stale comment — prefixLengths stores target tokens, not words Co-Authored-By: Claude Opus 4.6 <[email protected]> * test(observe): add BC-5 test for empty prefix groups Verifies buildPrefixStrings returns empty maps when no prefix groups exist, confirming no calibration work is needed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): address review — nil safety, defer cancel, R11 guard, diagnostics, tests - Split compound nil check: guard record==nil before accessing fields - Use defer for calibCancel() to prevent context leak on panic - Add R11 division guard in buildPrefixStrings for tokensPerWord<=0 - Surface specific diagnostic when server returns 0 prompt_tokens - Add lower-bound fallback test (ratio < 1.0) - Add end-to-end test verifying suffix uses token count not word count Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs(observe): document prefix token calibration in Prefix sharing box Users will see calibration log output at startup; this provides context for what it means and how it affects prefix string generation. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>
build: Dockerfile + release workflow for ghcr.io/inference-sim/blis (#… …807) * build: add multi-stage Dockerfile for blis binary * ci: build and push ghcr.io/inference-sim/blis on version tag * fix: TARGETARCH cross-compile, non-root user, stable-only latest tag, improved comments * fix: preserve v-prefix in semver image tags to match git describe output
docs: add missing observe flags, convert inference-perf, trace-output… …, and pipeline diagram (#734) - Add --trace-output run example to CLAUDE.md (BC-3) - Add rate-mode observe example with distribution flags, auth, concurrency, and streaming options to CLAUDE.md (BC-1) - Add convert inference-perf --spec example to CLAUDE.md (BC-2) - Add observe/replay/calibrate mermaid pipeline diagram to architecture.md with cross-references to guide and config reference pages (BC-4) Fixes #720, fixes #721 Co-authored-by: Claude <[email protected]>
fix(kv): commit reloaded prefix blocks in partial-improvement path (#640 ) (#649) * fix(kv): commit reloaded prefix blocks in partial-improvement path (BC-1, BC-2) - Before fix: partial CPU reload left reloaded blocks on GPU free list with RefCount=0; subsequent popFreeBlock could evict them and clear their hashes (R1 silent data loss). Fresh blocks also chained prevHash from the pre-reload state, producing incorrect prefix hashes. - After fix: commitCachedBlocks() called for newCached[startBlock:newStartBlock] before delegating fresh allocation to gpu.AllocateKVBlocks, mirroring the full-reload branch (line 207-230) and vLLM v1 commit-before-allocate. - Uses same ceiling-division startBlock as full-reload branch (line 222) to avoid double-committing the partially-filled last block for running requests (BC-4). Fixes #640 Co-Authored-By: Claude <[email protected]> * test(kv): add partial-reload block commitment tests (BC-3, BC-5) - TestTieredKVCache_PartialReload_NewRequest_Revised: verifies new request partial reload commits prefix from block 0 with correct hash chain; committed block persists in RequestMap even when tail allocation fails - INV-4 conservation verified in both running and new-request tests Co-Authored-By: Claude <[email protected]> * docs(kv): update commitCachedBlocks docstring for partial-improvement path The method is now used in two TieredKVCache paths: the existing full-reload path (returns true immediately, no rollback needed) and the new partial-improvement path (calls AllocateKVBlocks afterwards). Add inline explanation of why the transactional gap is safe: in BLIS's single-threaded DES, once AllocateKVBlocks' pre-check passes, countFreeBlocks() cannot decrease before the allocation loop runs, making mid-loop failure impossible. Co-Authored-By: Claude <[email protected]> * refactor(kv): replace brittle line reference with descriptive comment Replace "same as line 222" with "same ceiling division as the full-range reload path above" to avoid comment rot when line numbers shift. Co-Authored-By: Claude <[email protected]> * docs(kv): strengthen commitCachedBlocks pre-check guarantee claim The docstring said "common case" for pre-check failure in the partial-improvement path. This is actually a mathematical guarantee: the original failure condition N > F implies the tail condition (N-R) > (F-R), so the tail pre-check always fails when reached via this path. Clarify the docstring accordingly. Co-Authored-By: Claude <[email protected]> * docs(kv): correct commitCachedBlocks docstring — pre-check can pass The previous docstring claimed AllocateKVBlocks 'always fails at its pre-check' in the partial-improvement path. This is incorrect when the last block before commit is partially filled: the effectiveTokens adjustment reduces the needed block count below N-R, allowing the pre-check to pass and the allocation to succeed. Correct the docstring to state that in either case (pre-check failure or success) the committed state is stable, and the single-threaded DES guarantee prevents mid-loop failure. Co-Authored-By: Claude <[email protected]> * docs(plans): add micro-plan for partial-improvement block commitment fix Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]>
feat(kv): replace inverted tiered cache with vLLM v1 mirror model (#638) * feat(kv): add MirrorToCPU to KVStore interface (BC-8) - Add MirrorToCPU(batch []*Request) to KVStore interface - Implement no-op on KVCacheState (single-tier) - Add stub on TieredKVCache (full impl in later task) - R13: both implementations satisfy interface Co-Authored-By: Claude <[email protected]> * refactor(kv): rewrite CPU tier with hash-keyed LRU (BC-4) - Replace offloadedBlock/cpuTier with cpuBlock/cpuTier - Hash-keyed map + doubly-linked list for O(1) operations - Pre-allocated token slices eliminate per-mirror GC pressure - store/touch/lookup/evict all O(1) - R3: newCpuTier validates capacity > 0, blockSize > 0 - BC-7: deprecation warning for KVOffloadThreshold - Remove old TieredKVCache fields (offloadThreshold, offloadCount, thrashingCount, clock). Stub old methods for compilation. - Delete all 14 old tiered_test.go tests (tested inverted semantics) - KVThrashingRate repurposed to cpuEvictionCount/mirrorCount Note: store_test.go setupTieredWithLatency tests expected to fail until Task 7 rewrites the helper. Co-Authored-By: Claude <[email protected]> * feat(kv): implement targeted CPU→GPU reload (BC-2, BC-6) - Replace tryReloadFromCPU with reloadPrefixFromCPU - Compute hierarchical hashes for requesting prefix only - maxReloads = countFreeBlocks() prevents hash destruction - CPU blocks touched on reload to refresh LRU recency - Transfer latency accumulates per reloaded block - Unrelated CPU blocks are never touched Co-Authored-By: Claude <[email protected]> * feat(kv): implement MirrorToCPU with touch semantics (BC-1, BC-9) - Store newly-completed full blocks to CPU tier - Touch existing blocks to refresh LRU recency - GPU HashToBlock never modified (read-only copy) - Skip partial blocks and unhashed blocks - Nil/empty batch safe Co-Authored-By: Claude <[email protected]> * test(kv): add BC-3 GPU prefix preservation test - Verify ReleaseKVBlocks preserves hashes on GPU free list - GetCachedBlocks still finds prefix after release - No offload triggered (maybeOffload removed in Task 2) Co-Authored-By: Claude <[email protected]> * feat(kv): integrate MirrorToCPU in Step() + BC-5/KVThrashingRate tests - Insert MirrorToCPU between executeBatchStep and processCompletions - BC-5 test: CPU extends GPU prefix lifetime through eviction+reload - KVThrashingRate tests: CPU eviction rate + R11 zero-guard Co-Authored-By: Claude <[email protected]> * test(kv): INV-4 conservation test + rewrite setupTieredWithLatency - Add INV-4 conservation test through full mirror+reload lifecycle - Rewrite setupTieredWithLatency to use mirror+reload (not offload) - All 27 kv tests pass, full sim suite green Co-Authored-By: Claude <[email protected]> * docs: update CLAUDE.md and config.go for tiered cache v1 model (BC-7) - Deprecate KVOffloadThreshold in config.go - Update tiered.go description in CLAUDE.md (mirror/reload, vLLM v1) - Update KVStore interface description (12 methods) - Update register_test.go stale threshold comment Co-Authored-By: Claude <[email protected]> * fix(kv): fix tautological INV-4 test + add baseLat validation (R3) - INV-4 conservation test was `total == total` (always true). Now walks GPU free list independently of UsedBlockCnt to verify UsedBlockCnt + freeListLen == TotalBlocks. - Add baseLat >= 0 validation in NewTieredKVCache (R3). Co-Authored-By: Claude <[email protected]> * fix(kv): remove CacheHitRate double-count of CPU-reloaded blocks gpu.CacheHits already includes CPU-reloaded blocks (they appear as GPU cache hits on the retry allocation after reload). Adding cpuHitCount on top double-counted the same blocks. Pre-existing bug from the old tryReloadFromCPU path, now corrected. Co-Authored-By: Claude <[email protected]> * test(kv): add baseLat negative validation test (R3, self-audit) Found during Step 4.75 pre-commit self-audit: baseLat >= 0 validation was added but had no companion test. Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]>
feat(latency): add trained-roofline backend with roofline basis funct… …ions × learned corrections (#616) * feat(latency): register trained-roofline backend name (BC-1) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12) - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to latency model documentation - CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): comprehensive trained-roofline section in latency models guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to quickstart + document non-blocking overhead pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
refactor(sim): remove PreemptionProcessingTime from LatencyModel inte… …rface (#554) (#555) Remove dead code: PreemptionProcessingTime() always returned 0 in all three backends, and PreemptionEvent was a no-op event type (Execute() only logged). The actual cost of preemption (re-prefill) is already modeled by the ProgressIndex=0 reset in batch formation. - Remove PreemptionProcessingTime() from LatencyModel interface (5→4 methods) - Remove PreemptionEvent type from event.go - Remove PreemptionDelay field from PreemptedRequest struct - Replace PreemptionEvent scheduling with inline logrus.Debugf - Preserve PreemptionCount++ metric and all preemption mechanics - Update docs: CLAUDE.md, core-engine.md, latency-models.md, extension-recipes.md, design-guidelines.md, doc.go, FINDINGS.md Fixes #554 Co-authored-by: Claude <[email protected]>
fix(docs): correct cross-model coefficient count from 4 to 7 (4 beta … …+ 3 alpha) (#479) The cross-model backend uses 7 globally-fitted coefficients, not 4: - 4 beta coefficients for step time (per-layer, KV bandwidth, MoE dispatch, TP sync) - 3 alpha coefficients for CPU overhead (pre-scheduling, per-token, output processing) All 7 are model-independent and stored in crossmodel_defaults. The "4 coefficients" framing incorrectly excluded the alpha parameters which are equally global. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
PreviousNext