Skip to content

Tags: inference-sim/inference-sim

Tags

iter27

Toggle iter27's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(training): iter27 — joint CMA-ES 6-param optimization, loss 34.6…

…1% (#942)

CMA-ES joint search over 6 parameters (β₁ₐ, β₄, β₅, β₇, β₈, β₂ᵦ) from
iter26 warm start. 141 trials, best at trial 62.

Key interaction: β₄ joint-optimal at 0.752 (vs isolated 0.410), allowing
β₅ (49.6→32.4) and β₇ (169→126) to decrease — coordinate descent missed
the β₄/β₅/β₇ coupling.

Loss: 37.42% → 34.61% (-2.81 points).
TTFT RMSE: 24.34% → 22.81%, E2E RMSE: 13.09% → 11.79%.

Also: inner_loop_optimize.py max-workers fix (min 4 instead of min 1)
to prevent trial timeouts at high n_jobs settings.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

v0.7.0

Toggle v0.7.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(cluster): gateway queue with saturation-gated dispatch (#882) (#897

)

* feat(cluster): gateway queue with saturation-gated dispatch (#882)

Add a gateway queue between admission and routing so BLIS can hold
admitted requests and dispatch them when cluster capacity opens,
modeling GIE flow control behavior.

New components:
- SaturationDetector interface with NeverSaturated, UtilizationDetector,
  and ConcurrencyDetector implementations (sim/saturation.go)
- GatewayQueue with FIFO and Priority dispatch ordering, capacity
  shedding (sim/cluster/gateway_queue.go)
- Completion-triggered dispatch in the cluster event loop
- Per-request GatewayQueueDelay metric (BC-8)
- INV-1 conservation extended: gateway_queue_depth + gateway_queue_shed

When flow control is disabled (default), behavior is identical to the
current pipeline (BC-1 pass-through equivalence verified by test).

Closes #882

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cluster): address review findings — CLI validation, batch dispatch, test coverage

- Add CLI validation for all 7 flow control flags (R3: validate at CLI
  boundary). Checks NaN/Inf for float thresholds, validates detector name
  and dispatch order before passing to library constructors.
- Add NaN/Inf guards in UtilizationDetector constructor (P2-2).
- Fix BC-4: dispatch loop now fires up to delta times per batch completion
  instead of once, matching the contract for batch completions.
- Strengthen BC-1 pass-through test: compare StillQueued, StillRunning,
  TimedOutRequests, and verify GatewayQueueShed==0 and GatewayQueueDepth==0.
- Fix BC-8 test: use utilization detector with tight thresholds instead of
  NeverSaturated, so requests actually wait in the gateway queue and the
  test verifies GatewayEnqueueTime is set.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cluster): include RoutingRejections in flow control conservation test

Conservation test was missing RoutingRejections() term, inconsistent
with the deferred queue conservation tests. Add it to match INV-1 formula.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* perf(cluster): early-exit dispatch loop when saturated or queue empty

tryDispatchFromGatewayQueue now returns bool (dispatched or not).
The completion delta loop breaks early when saturated, avoiding
redundant buildRouterState calls in batch-completion scenarios.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: INV-1 formula missing routing_rejections + interface{} to any

- Add routing_rejections to INV-1 cluster-level formula in CLAUDE.md and
  invariants.md, matching the conservation test and deferred queue tests.
- Use 'any' instead of 'interface{}' in gateway_queue.go heap methods
  for consistency with codebase convention (Go 1.18+).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn on no-op flow control config + add example commands

- Warn when --flow-control is enabled but --saturation-detector is 'never'
  (pass-through), since this is likely not what the user intended.
- Add flow control example commands to CLAUDE.md Build and Run section.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: scope CLI validation to selected detector + update architecture docs

- Validate --queue-depth-threshold and --kv-cache-util-threshold only
  when detector is 'utilization'; validate --max-concurrency only when
  detector is 'concurrency'. Prevents confusing errors for irrelevant params.
- Add gateway queue stage to Online Routing Pipeline Walkthrough in
  architecture.md (step 4, optional when --flow-control enabled).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address pre-merge review findings (Issues 1-5)

1. Fix misleading comment at cluster_event.go — dispatch succeeds when
   cluster is not saturated, not just with NeverSaturated.
2. Add optional Gateway Queue node to Mermaid flowchart in architecture.md.
3. Add Trace gap comment at gateway shed path + reset GatewayEnqueueTime=0
   on shed-on-arrival to leave request struct clean.
4. BC-8 test: hard assertion + redesigned scenario with ConcurrencyDetector
   (maxConcurrency=1) and staggered arrivals to genuinely force queuing.
5. Debug log in tryDispatchFromGatewayQueue when held due to saturation;
   resolve EC-2 inline comments in saturation.go.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

v0.6.14

Toggle v0.6.14's commit message

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
fix(observe): calibrate prefix token ratio for BPE tokenizer fidelity (

…#834)

* fix(observe): calibrate prefix token ratio to match blis run fidelity

Prefix strings in `blis observe` used 1:1 word-to-token mapping, but BPE
tokenizers split multi-syllable vocabulary words into ~1.67 tokens/word,
inflating actual server token counts by ~60% vs what `blis run` simulates.

Add a calibration request at startup that measures the server's actual
tokens-per-word ratio, then scale prefix word counts accordingly so the
server tokenizes them to approximately the intended token count.

Fixes #832

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): check json.Encode error return in calibration tests

Satisfies errcheck linter — matches existing test handler pattern.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): add calibration timeout and derive word count from vocabulary

Address review feedback:
- Add 30s timeout to calibration context to prevent indefinite hangs
- Derive calibrationWordCount from len(prefixVocabulary) to prevent
  silent divergence if vocabulary is expanded

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): correct stale comment — prefixLengths stores target tokens, not words

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* test(observe): add BC-5 test for empty prefix groups

Verifies buildPrefixStrings returns empty maps when no prefix groups
exist, confirming no calibration work is needed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): address review — nil safety, defer cancel, R11 guard, diagnostics, tests

- Split compound nil check: guard record==nil before accessing fields
- Use defer for calibCancel() to prevent context leak on panic
- Add R11 division guard in buildPrefixStrings for tokensPerWord<=0
- Surface specific diagnostic when server returns 0 prompt_tokens
- Add lower-bound fallback test (ratio < 1.0)
- Add end-to-end test verifying suffix uses token count not word count

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs(observe): document prefix token calibration in Prefix sharing box

Users will see calibration log output at startup; this provides context
for what it means and how it affects prefix string generation.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>

v0.6.13

Toggle v0.6.13's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build: Dockerfile + release workflow for ghcr.io/inference-sim/blis (#…

…807)

* build: add multi-stage Dockerfile for blis binary

* ci: build and push ghcr.io/inference-sim/blis on version tag

* fix: TARGETARCH cross-compile, non-root user, stable-only latest tag, improved comments

* fix: preserve v-prefix in semver image tags to match git describe output

v0.6.12

Toggle v0.6.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
docs: add missing observe flags, convert inference-perf, trace-output…

…, and pipeline diagram (#734)

- Add --trace-output run example to CLAUDE.md (BC-3)
- Add rate-mode observe example with distribution flags, auth, concurrency,
  and streaming options to CLAUDE.md (BC-1)
- Add convert inference-perf --spec example to CLAUDE.md (BC-2)
- Add observe/replay/calibrate mermaid pipeline diagram to architecture.md
  with cross-references to guide and config reference pages (BC-4)

Fixes #720, fixes #721

Co-authored-by: Claude <[email protected]>

v0.6.11

Toggle v0.6.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(kv): commit reloaded prefix blocks in partial-improvement path (#640

) (#649)

* fix(kv): commit reloaded prefix blocks in partial-improvement path (BC-1, BC-2)

- Before fix: partial CPU reload left reloaded blocks on GPU free list
  with RefCount=0; subsequent popFreeBlock could evict them and clear
  their hashes (R1 silent data loss). Fresh blocks also chained prevHash
  from the pre-reload state, producing incorrect prefix hashes.
- After fix: commitCachedBlocks() called for newCached[startBlock:newStartBlock]
  before delegating fresh allocation to gpu.AllocateKVBlocks, mirroring
  the full-reload branch (line 207-230) and vLLM v1 commit-before-allocate.
- Uses same ceiling-division startBlock as full-reload branch (line 222)
  to avoid double-committing the partially-filled last block for running
  requests (BC-4).

Fixes #640

Co-Authored-By: Claude <[email protected]>

* test(kv): add partial-reload block commitment tests (BC-3, BC-5)

- TestTieredKVCache_PartialReload_NewRequest_Revised: verifies new request
  partial reload commits prefix from block 0 with correct hash chain;
  committed block persists in RequestMap even when tail allocation fails
- INV-4 conservation verified in both running and new-request tests

Co-Authored-By: Claude <[email protected]>

* docs(kv): update commitCachedBlocks docstring for partial-improvement path

The method is now used in two TieredKVCache paths: the existing full-reload
path (returns true immediately, no rollback needed) and the new
partial-improvement path (calls AllocateKVBlocks afterwards). Add inline
explanation of why the transactional gap is safe: in BLIS's single-threaded
DES, once AllocateKVBlocks' pre-check passes, countFreeBlocks() cannot
decrease before the allocation loop runs, making mid-loop failure impossible.

Co-Authored-By: Claude <[email protected]>

* refactor(kv): replace brittle line reference with descriptive comment

Replace "same as line 222" with "same ceiling division as the full-range
reload path above" to avoid comment rot when line numbers shift.

Co-Authored-By: Claude <[email protected]>

* docs(kv): strengthen commitCachedBlocks pre-check guarantee claim

The docstring said "common case" for pre-check failure in the
partial-improvement path. This is actually a mathematical guarantee:
the original failure condition N > F implies the tail condition
(N-R) > (F-R), so the tail pre-check always fails when reached via
this path. Clarify the docstring accordingly.

Co-Authored-By: Claude <[email protected]>

* docs(kv): correct commitCachedBlocks docstring — pre-check can pass

The previous docstring claimed AllocateKVBlocks 'always fails at its
pre-check' in the partial-improvement path. This is incorrect when the
last block before commit is partially filled: the effectiveTokens
adjustment reduces the needed block count below N-R, allowing the
pre-check to pass and the allocation to succeed. Correct the docstring
to state that in either case (pre-check failure or success) the
committed state is stable, and the single-threaded DES guarantee
prevents mid-loop failure.

Co-Authored-By: Claude <[email protected]>

* docs(plans): add micro-plan for partial-improvement block commitment fix

Co-Authored-By: Claude <[email protected]>

---------

Co-authored-by: Claude <[email protected]>

v0.6.10

Toggle v0.6.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(kv): replace inverted tiered cache with vLLM v1 mirror model (#638)

* feat(kv): add MirrorToCPU to KVStore interface (BC-8)

- Add MirrorToCPU(batch []*Request) to KVStore interface
- Implement no-op on KVCacheState (single-tier)
- Add stub on TieredKVCache (full impl in later task)
- R13: both implementations satisfy interface

Co-Authored-By: Claude <[email protected]>

* refactor(kv): rewrite CPU tier with hash-keyed LRU (BC-4)

- Replace offloadedBlock/cpuTier with cpuBlock/cpuTier
- Hash-keyed map + doubly-linked list for O(1) operations
- Pre-allocated token slices eliminate per-mirror GC pressure
- store/touch/lookup/evict all O(1)
- R3: newCpuTier validates capacity > 0, blockSize > 0
- BC-7: deprecation warning for KVOffloadThreshold
- Remove old TieredKVCache fields (offloadThreshold, offloadCount,
  thrashingCount, clock). Stub old methods for compilation.
- Delete all 14 old tiered_test.go tests (tested inverted semantics)
- KVThrashingRate repurposed to cpuEvictionCount/mirrorCount

Note: store_test.go setupTieredWithLatency tests expected to fail
until Task 7 rewrites the helper.

Co-Authored-By: Claude <[email protected]>

* feat(kv): implement targeted CPU→GPU reload (BC-2, BC-6)

- Replace tryReloadFromCPU with reloadPrefixFromCPU
- Compute hierarchical hashes for requesting prefix only
- maxReloads = countFreeBlocks() prevents hash destruction
- CPU blocks touched on reload to refresh LRU recency
- Transfer latency accumulates per reloaded block
- Unrelated CPU blocks are never touched

Co-Authored-By: Claude <[email protected]>

* feat(kv): implement MirrorToCPU with touch semantics (BC-1, BC-9)

- Store newly-completed full blocks to CPU tier
- Touch existing blocks to refresh LRU recency
- GPU HashToBlock never modified (read-only copy)
- Skip partial blocks and unhashed blocks
- Nil/empty batch safe

Co-Authored-By: Claude <[email protected]>

* test(kv): add BC-3 GPU prefix preservation test

- Verify ReleaseKVBlocks preserves hashes on GPU free list
- GetCachedBlocks still finds prefix after release
- No offload triggered (maybeOffload removed in Task 2)

Co-Authored-By: Claude <[email protected]>

* feat(kv): integrate MirrorToCPU in Step() + BC-5/KVThrashingRate tests

- Insert MirrorToCPU between executeBatchStep and processCompletions
- BC-5 test: CPU extends GPU prefix lifetime through eviction+reload
- KVThrashingRate tests: CPU eviction rate + R11 zero-guard

Co-Authored-By: Claude <[email protected]>

* test(kv): INV-4 conservation test + rewrite setupTieredWithLatency

- Add INV-4 conservation test through full mirror+reload lifecycle
- Rewrite setupTieredWithLatency to use mirror+reload (not offload)
- All 27 kv tests pass, full sim suite green

Co-Authored-By: Claude <[email protected]>

* docs: update CLAUDE.md and config.go for tiered cache v1 model (BC-7)

- Deprecate KVOffloadThreshold in config.go
- Update tiered.go description in CLAUDE.md (mirror/reload, vLLM v1)
- Update KVStore interface description (12 methods)
- Update register_test.go stale threshold comment

Co-Authored-By: Claude <[email protected]>

* fix(kv): fix tautological INV-4 test + add baseLat validation (R3)

- INV-4 conservation test was `total == total` (always true).
  Now walks GPU free list independently of UsedBlockCnt to verify
  UsedBlockCnt + freeListLen == TotalBlocks.
- Add baseLat >= 0 validation in NewTieredKVCache (R3).

Co-Authored-By: Claude <[email protected]>

* fix(kv): remove CacheHitRate double-count of CPU-reloaded blocks

gpu.CacheHits already includes CPU-reloaded blocks (they appear as
GPU cache hits on the retry allocation after reload). Adding
cpuHitCount on top double-counted the same blocks. Pre-existing
bug from the old tryReloadFromCPU path, now corrected.

Co-Authored-By: Claude <[email protected]>

* test(kv): add baseLat negative validation test (R3, self-audit)

Found during Step 4.75 pre-commit self-audit: baseLat >= 0
validation was added but had no companion test.

Co-Authored-By: Claude <[email protected]>

---------

Co-authored-by: Claude <[email protected]>

v0.6.9

Toggle v0.6.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(latency): add trained-roofline backend with roofline basis funct…

…ions × learned corrections (#616)

* feat(latency): register trained-roofline backend name (BC-1)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15)

- Add PostDecodeFixedOverhead() int64 to LatencyModel interface
- Existing backends (blackbox, roofline, crossmodel) return 0
- Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E
- TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections
- Zero heap allocations in StepTime (19ns/op, 0 allocs/op)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14)

- Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim,
  TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients
- Derives architecture features at construction: headDim, dKV, dFF, kEff
- Table-driven error tests for all validation paths

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): add trained-roofline defaults + CLI loading (BC-10,12)

- Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas
- Add TrainedRooflineDefaults struct to cmd/default_config.go
- CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard,
  HFConfig parsing, help text)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add trained-roofline to latency model documentation

- CLAUDE.md: "Four modes", file tree, Key Data Flow
- sim/config.go: Backend field comment
- sim/latency/latency.go: package doc
- docs/concepts/core-engine.md: "four latency model backends"
- docs/concepts/glossary.md: "Four modes" + trained-roofline description
- Plan committed alongside implementation

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination

- PostDecodeFixedOverhead only applied when len(OutputTokens) > 0
- RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to
  avoid contaminating per-token average ITL with fixed overhead
- Add zero-alpha warning for trained-roofline CLI path

Caught by code review Step 4.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs(guide): comprehensive trained-roofline section in latency models guide

- Add trained-roofline section with formula, alpha model, accuracy caveats
- Update comparison table to 4 backends
- Update recommendation: trained-roofline is now the default for new models
- Update pluggable architecture to show 4 interface methods
- Fix cross-model description accuracy

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref

- Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points,
  added trained-roofline as 4th example
- Factory: defensive copy of beta/alpha slices to enforce "frozen" contract
- Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun
- Configuration reference: add trained-roofline to --latency-model flag description

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add trained-roofline to quickstart + document non-blocking overhead pattern

- Quickstart: add trained-roofline example (recommended for new models)
- recordRequestCompletion: document that E2E includes non-blocking
  PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why
  RequestCompletionTimes exceeds RequestLeftEvent timestamp

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline

All documentation working copies now mention trained-roofline consistently.
Source-of-truth map: 12/12 working copies updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

v0.6.8

Toggle v0.6.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
refactor(sim): remove PreemptionProcessingTime from LatencyModel inte…

…rface (#554) (#555)

Remove dead code: PreemptionProcessingTime() always returned 0 in all
three backends, and PreemptionEvent was a no-op event type (Execute()
only logged). The actual cost of preemption (re-prefill) is already
modeled by the ProgressIndex=0 reset in batch formation.

- Remove PreemptionProcessingTime() from LatencyModel interface (5→4 methods)
- Remove PreemptionEvent type from event.go
- Remove PreemptionDelay field from PreemptedRequest struct
- Replace PreemptionEvent scheduling with inline logrus.Debugf
- Preserve PreemptionCount++ metric and all preemption mechanics
- Update docs: CLAUDE.md, core-engine.md, latency-models.md,
  extension-recipes.md, design-guidelines.md, doc.go, FINDINGS.md

Fixes #554

Co-authored-by: Claude <[email protected]>

v0.6.7

Toggle v0.6.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(docs): correct cross-model coefficient count from 4 to 7 (4 beta …

…+ 3 alpha) (#479)

The cross-model backend uses 7 globally-fitted coefficients, not 4:
- 4 beta coefficients for step time (per-layer, KV bandwidth, MoE dispatch, TP sync)
- 3 alpha coefficients for CPU overhead (pre-scheduling, per-token, output processing)

All 7 are model-independent and stored in crossmodel_defaults. The "4 coefficients"
framing incorrectly excluded the alpha parameters which are equally global.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>