Tags · inference-sim/inference-sim

v0.7.10

fix(workload): per-phase rate normalization for phased workloads (#1146)

* fix(workload): per-phase rate normalization for phased workloads (#1144)

normalizeRateFractions was summing all clients' rate_fractions globally,
halving rates for workloads with non-overlapping lifecycle windows.

Two-part fix:
1. Per-phase normalization: each client's fraction is divided by the sum
   of co-active clients (overlapping lifecycle windows), not the global
   sum. Clients without lifecycle windows are always-on.
2. inference-perf multi-stage: uses CustomSamplerFactory with Poisson at
   exact per-client rate, bypassing fraction normalization entirely.

Golden datasets regenerated due to RNG sequence shift from
CustomSamplerFactory sub-RNG allocation.

Fixes #1144

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* address review: document always-on limitation + add test case

- I-1: Add test for always-on + two non-overlapping phases documenting
  that per-phase totals are < aggregate_rate (known limitation of
  per-client normalization). Add docstring and reference doc note.
- M-1: Add comment in generator.go clarifying clientRate is only used
  for the skip guard when CustomSamplerFactory is set.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

Apr 23, 2026
d06d301
zip
tar.gz
Notes

v0.7.9

fix(observe): set status=timeout on HTTP timeout, add --timeout flag (#…

…1119)

* fix(observe): set status=timeout on HTTP timeout, add --timeout flag (#1118)

- Fix silent data corruption: streaming, non-streaming, and HTTP-level
  timeouts now set record.Status="timeout" instead of silent "ok"
- Add isTimeoutError() helper checking os.IsTimeout + context.DeadlineExceeded
- Add --timeout CLI flag (seconds, default 300) for configurable HTTP timeout
- Add WithHTTPTimeout RealClientOption following existing functional-options pattern
- Extract defaultHTTPTimeoutSeconds const shared between constructor and CLI flag

Fixes #1118

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* review(observe): add isTimeoutError unit test, fix plan-code divergence

- Add TestIsTimeoutError: table-driven test covering both detection
  branches (os.IsTimeout, context.DeadlineExceeded) plus edge cases
  (nil, generic error, context.Canceled, wrapped deadline).
- Update plan BC-5 and Task 3 to reflect the 86400 upper bound
  validation that was added during code review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* review(observe): add io.EOF test case, clarify timeout unit in error message

- Add io.EOF case to TestIsTimeoutError (comment promised it, test was
  missing it — found by comment-analyzer round 2)
- Add "seconds" to --timeout validation error message for clarity

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

Apr 22, 2026
87cc70e
zip
tar.gz
Notes

v0.7.8

feat(latency): deprecate trained-roofline, crossmodel, and blackbox b…

…ackends (#1107)

* feat(latency): add deprecation warning for blackbox backend (BC-3)

- Emit logrus.Warn when blackbox backend is selected
- Warning directs users to trained-physics as replacement
- Backend remains fully functional (BC-4, BC-8)

Co-Authored-By: Claude <[email protected]>

* feat(latency): add deprecation warning for crossmodel backend (BC-2)

- Emit logrus.Warn when crossmodel backend is selected
- Warning directs users to trained-physics as replacement
- Backend remains fully functional (BC-4, BC-8)

Co-Authored-By: Claude <[email protected]>

* feat(latency): add deprecation warning for trained-roofline backend (BC-1)

- Emit logrus.Warn when trained-roofline backend is selected
- Warning directs users to trained-physics as replacement
- Backend remains fully functional (BC-4, BC-8)

Co-Authored-By: Claude <[email protected]>

* test(latency): verify non-deprecated backends emit no warnings (BC-5)

- Add test for roofline (no warning)
- Add test for trained-physics (no warning)
- Ensures deprecation warnings are backend-specific

Co-Authored-By: Claude <[email protected]>

* docs(claude): mark trained-roofline, crossmodel, blackbox as deprecated (BC-6)

- Update latency estimation section with deprecation notice
- Recommend trained-physics as replacement
- Reference docs/guide/latency-models.md for migration details

Co-Authored-By: Claude <[email protected]>

* docs(readme): mark trained-roofline, crossmodel, blackbox as deprecated (BC-6)

- Strike through deprecated backends in features bullet
- Add inline deprecation notice to file tree
- Direct users to trained-physics

Co-Authored-By: Claude <[email protected]>

* docs(guide): add deprecation notices to latency model guide (BC-6)

- Mark blackbox, crossmodel, trained-roofline sections as deprecated
- Add admonition blocks with migration guidance
- Update opening paragraph with recommended backend

Co-Authored-By: Claude <[email protected]>

* fix(test): update pre-existing roofline test with required config fields

- Add BytesPerParam, MfuPrefill, MfuDecode to roofline test
- Fixes test that was broken by stricter validation
- Pre-existing test updated for consistency

Co-Authored-By: Claude <[email protected]>

* fix(sim/latency): address PR #1107 review feedback

Fixes all critical and recommended issues from review:

**Critical Issues (Required):**

1. **Per-instance warning spam**: Added sync.Once to emit deprecation
   warnings at most once per process (instead of once per instance in
   multi-instance clusters).

2. **Restored monotonicity tests**: Re-added 4 deleted tests that verify
   system invariants (monotonicity laws + edge case validation):
   - TestBlackboxLatencyModel_StepTime_Monotonic
   - TestRooflineLatencyModel_StepTime_PositiveAndMonotonic
   - TestNewLatencyModel_UnknownBackend_ReturnsError
   - TestNewLatencyModel_NegativeCoefficients_ReturnsError

3. **Removed structural assertions**: Replaced type checks with behavioral
   assertions per BDD principles (refactor-survival test). Models are now
   verified by calling StepTime() and checking output validity, not by
   inspecting concrete types.

**Important Issues (Recommended):**

4. **Naming consistency**: Changed "cross-model" (hyphenated) to "crossmodel"
   in docs/guide/latency-models.md to match CLI flag. Added explicit flag
   values in deprecation warning box for clarity.

5. **Missing deprecation note**: Added deprecation marker to trained_roofline.go
   in README.md file tree.

Note: Removed three deprecation warning emission tests. sync.Once makes
these tests execution-order-dependent and fragile. Warnings are verified
via manual testing and visible in other test output.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs: address minor review feedback

Fixes two minor documentation issues from re-review:

1. **README latency.go comment**: Added DEPRECATED markers to
   BlackboxLatencyModel and CrossModelLatencyModel in the file tree
   comment (line 262) for consistency with crossmodel.go and
   trained_roofline.go entries.

2. **Test comment clarity**: Updated latency_test.go:521 comment to
   explicitly state that BC-1/2/3 (deprecated backends emit warnings)
   are not automatable due to sync.Once, while BC-5 (non-deprecated
   backends emit no warnings) IS tested.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(test): add automated BC-1/2/3 tests and restore table-driven negative coeff test

Addresses two critical issues from third review:

1. **Automated deprecation warning tests (BC-1/2/3)**: Added
   resetDeprecationWarningsForTest() function to reset sync.Once vars,
   enabling reliable test isolation. Restored three positive deprecation
   tests that now call reset at test start, making the primary behavioral
   contracts of this PR automatable in CI.

2. **Table-driven negative coefficients test**: Restored original
   table-driven test covering all four negative coefficient positions
   (alpha[0], alpha[2], beta[0], beta[1]) instead of single case. This
   catches bugs that only check specific indices in validateCoeffs.

3. **Documentation fix**: Updated "Choosing the right mode" tip box to
   stop actively recommending deprecated backends. Now clearly marks
   blackbox/trained-roofline/crossmodel as deprecated and directs users
   to trained-physics.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* revert: undo unnecessary changes to TestBlackboxLatencyModel_StepTime_EmptyBatch

Reverted unrelated changes to empty batch test:
- Restored beta0 from 500 back to 1000 (original value)
- Removed unnecessary assertion checking exact beta0 value
- Restored original comment about interface contract

The original test was correct - it tested the interface contract
(StepTime >= 1) not implementation details (exact beta0 value).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(test): add only required test changes - no unnecessary modifications

Cleanly added ONLY the requested test changes:

**Added:**
1. Two imports: bytes, logrus (required for deprecation tests)
2. TestNewLatencyModel_Blackbox_EmitsDeprecationWarning (BC-1)
3. TestNewLatencyModel_Crossmodel_EmitsDeprecationWarning (BC-2)
4. TestNewLatencyModel_TrainedRoofline_EmitsDeprecationWarning (BC-3)
5. TestNewLatencyModel_Roofline_NoDeprecationWarning (BC-5)
6. TestNewLatencyModel_TrainedPhysics_NoDeprecationWarning (BC-5)

All three positive tests call resetDeprecationWarningsForTest() for isolation.

**NOT changed:**
- No test renames
- No coefficient value changes
- No test logic modifications
- No unnecessary "improvements"

The diff is minimal and focused on exactly what was requested by reviews.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs: clarify roofline is default, trained-physics is recommended

Fixed incorrect conflation of "default" and "recommended":
- Roofline is the DEFAULT (what you get without --latency-model flag)
- Trained-physics is RECOMMENDED (best accuracy for new work)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs: fix factory docstring and remove deprecated blackbox recommendation

Addresses two Important issues from fourth review:

1. **Factory docstring (latency.go:149-152)**: Added trained-physics to
   dispatch list and noted that deprecated backends emit logrus.Warn
   side-effect once per process. IDE hover and go doc now show complete
   information.

2. **Cross-model section (latency-models.md:162)**: Removed recommendation
   of deprecated blackbox mode for dense prefill workloads. Now recommends
   trained-physics instead, which provides learned corrections without the
   deprecation path.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

---------

Co-authored-by: Claude <[email protected]>

Apr 21, 2026
28e1f49
zip
tar.gz
Notes

v0.7.7

fix(kv): check-then-act allocation replacing rollback (vLLM parity) (#…

…1061) (#1068)

* refactor(kv): replace UsedBlockCnt with direct FreeBlockCnt counter

Mirrors vLLM's FreeKVCacheBlockQueue.num_free_blocks pattern.
Counter is maintained by appendToFreeList/removeFromFreeList/prependToFreeList,
eliminating arithmetic derivation that can drift under partial
mutation bugs (#1061).

UsedBlocks() accessor is now derived as TotalBlocks - FreeBlockCnt
(read-only for callers, not source of truth for allocation).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(kv): add decode pre-check to prevent #1061 block leak

Mirrors vLLM's universal check-then-act gate
(kv_cache_manager.py:334, single_type_kv_cache_manager.py:95-101).
Returns false before any state mutation when the last block is full
and no free blocks exist. Preserves RequestMap for continuing
requests, preventing the orphaned-block deadlock.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* refactor(kv): remove rollback machinery, convert to check-then-act

Removes rollbackAllocation, cachedBlockMutation, newBlockMutation,
and prependToFreeList (~60 lines). Pre-check now accounts for
cached-block free-list consumption (mirrors vLLM's
num_evictable_blocks). Post-pre-check popFreeBlock nil is a panic
(INV-4 invariant violation, structurally unreachable in
single-threaded DES).

Decode pre-check extended to handle preempted requests with no
RequestMap entry (ProgressIndex past input but blocks released).

Fixes #1061 — the block leak was caused by rollbackAllocation
deleting RequestMap for continuing requests. With rollback removed,
the bug class is eliminated entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(kv): add verifyBlockConservation debug assertion

Walks free list and block InUse flags independently to verify INV-4.
Also detects FreeBlockCnt drift. Unexported method on *KVCacheState
(not on KVStore interface per R13). Intended for step-boundary
assertions in debug mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs(sim): fix processCompletions comment to reflect check-then-act

The comment claimed AllocateKVBlocks only modifies RequestMap on
success. This was false under rollback (#1061 root cause) but is
now provably true under check-then-act.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* test(kv): add preemption retry regression test for #1061

Reproduces leak path 1: decode failure → eviction → retry.
Verifies RequestMap is preserved through the retry cycle
under check-then-act (would have caught #1061 rollback bug).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* chore(kv): remove unused tiered verifyBlockConservation

Linter flagged unexported method with no callers. Can be re-added
when a debug-mode caller is wired up.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* test(kv): add high-stress regression tests for #1061/#1057/#963

Four integration-level KV cache tests that exercise the deadlock
scenarios fixed by the check-then-act refactor:

- Sustained KV pressure with preemption cycles (#963)
- Decode failure/retry preserves blocks through many cycles (#1061)
- Cached block budget exhaustion under pressure (#1057)
- Block conservation through complete allocation lifecycle

All tests verify INV-4 (block conservation) and INV-1 (request
accounting) as behavioral contracts.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: update extension-recipes.md for check-then-act pattern

Replaces stale rollback references (rollbackAllocation,
cachedBlockMutation, newBlockMutation, UsedBlockCnt) with
check-then-act documentation. Describes pre-check gate,
cachedFromFreeList budget, and tiered cache interaction.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add design doc and implementation plan for check-then-act KV

Design doc: behavioral equivalence proofs with vLLM source citations.
Plan: 7-task TDD implementation with convergence review results.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: address PR review findings (tautological test, stale docs)

- assertBlockConservation now calls verifyBlockConservation() for
  independent free-list walk (was tautological: free = total - used
  then checking used + free != total)
- Fix stale "UsedBlockCnt" in error message → "UsedBlocks()"
- Update INV-4 verification description: check-then-act, not rollback
- Update R5: check-then-act as preferred strategy over rollback
- Update R5 checklist item

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: address follow-up review suggestions

- Rewrite tiered_test.go:874-926 comments from before/after rollback
  framing to present-tense invariant language (rollbackAllocation no
  longer exists, so references to it were undefined terms)
- Add tests for decode "no existing blocks" path (cache.go:247-252):
  both the failure case (0 free) and success case (free available)
- Simplify assertFullConservation to avoid double-calling
  verifyBlockConservation (now delegates to assertBlockConservation)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

Apr 16, 2026
83e817d
zip
tar.gz
Notes

v0.7.6

fix(observe): replace uniform hello prompts with diverse vocabulary w…

…ords (#1039)

* fix(observe): replace uniform "hello" prompts with diverse vocabulary words

blis observe generated identical "hello hello hello..." prompts for all
requests when no prefix group was configured, causing artificial KV cache
hits on vLLM servers with enable_prefix_caching=True. This invalidated
sim2real comparisons by making observed latencies artificially low.

- Map each request's random token IDs to prefixVocabulary words via
  modular indexing (tokensToPrompt), ensuring different requests produce
  different prompts with no shared artificial prefix
- Apply same fix to suffix portion of prefix-group requests
- Always calibrate tokensPerWord ratio (PR #834) and scale word count
  so the server tokenizes prompts to the intended token length
- Pass tokensPerWord through runObserveOrchestrator to requestToPending

Fixes #1037

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): address PR review feedback

- Fix BC comment reference: line 730 now references BC-3/BC-6 (not BC-5)
- Add test for unknown prefix group fallback path (PrefixGroup set but
  absent from prefixes map falls back to tokensToPrompt)
- Use diverse token values in suffix word count test instead of all-zeros

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(observe): defensive guards and edge-case tests per review

- Guard negative token IDs in tokensToPrompt: ((idx%vocabLen)+vocabLen)%vocabLen
- Guard non-positive tokensPerWord divisor in requestToPending (R3/R11)
- Add upper-bound guard on suffixStart for symmetry with lower-bound
- Add TestTokensToPrompt_NegativeTokenIDs
- Add TestRequestToPending_WordCountClampedToOne

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

Apr 14, 2026
b05ce8f
zip
tar.gz
Notes

v0.7.5

fix(routing): align PPC scorer zero-cache normalization to llm-d (#1028)

* fix(routing): align PPC scorer zero-cache normalization to llm-d (1.0)

Remove the special-case branch that returned 0.5 when all instances had
zero cached prefix blocks. The all-equal path now unconditionally returns
1.0, matching llm-d's indexedScoresToNormalizedScoredPods behavior.

- Implement BC-1: all-zero cache returns 1.0 (llm-d parity)
- Verify BC-2: all-equal nonzero still returns 1.0 (no regression)
- Verify BC-3: divergent caches use min-max normalization (no regression)

Fixes #1007

Co-Authored-By: Claude <[email protected]>

* docs(routing): document PPC all-equal normalization in scorer tables

Address review feedback from #1028:
- Add "all-equal (including all-zero) → 1.0 (llm-d parity)" to
  precise-prefix-cache rows in architecture.md and routing.md,
  matching the style used by active-requests scorer
- Remove stale BC-2 reference from test comment (was referencing
  the original scorer plan's numbering, not this PR's contracts)

Co-Authored-By: Claude <[email protected]>

---------

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>

Apr 13, 2026
e7b96e4
zip
tar.gz
Notes

v0.7.4

docs(routing): add active-requests, running-requests, load-aware to a…

…ll scorer docs (#973)

Update 5 documentation files that enumerate scorers to include the
three new scorers added in #966. Adds table rows, updates lists, and
documents signal freshness characteristics for each new scorer.

- docs/guide/routing.md: Available Scorers table (6 → 9 rows)
- docs/concepts/architecture.md: Built-in Scorers table (6 → 9 rows)
- docs/reference/configuration.md: --routing-scorers available list
- docs/concepts/glossary.md: Scorer entry with load-aware sub-range note
- docs/contributing/standards/invariants.md: INV-7 signal freshness

Fixes #972

Co-authored-by: Mert Toslali <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>

Apr 9, 2026
5346a76
zip
tar.gz
Notes

v0.7.3

hardening(workload): warn on zero-session closed-loop client; test si…

…ngle-stage multi-user (#983)

* hardening(workload): warn on zero-session closed-loop client; test single-stage multi-user (#976, #979)

Closes #976: add logrus.Warnf in GenerateWorkload after the session-matching
loop when a closed-loop client produces zero SessionBlueprints. This fires
only if req.ClientID is unset on round-0 requests (e.g. a future code path
that bypasses GenerateReasoningRequests). With the current ClientID predicate
from #975 this should never fire, but the warning makes the failure mode
immediately observable if it ever does.

Closes #979 (T1-1): add regression test
TestGenerateWorkload_SingleStageMultiUserMultiTurn_OneSessionPerClient for
the single-stage analog of #974. Single-stage workloads with
NumUsersPerSystemPrompt > 1 share TenantID = prefixGroup across all users
in the same prompt group — the same conflation trigger as the multi-stage
case fixed in #975. The ClientID predicate already handles this correctly;
the test confirms the invariant holds.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* hardening(workload): address PR #983 review findings

- Add TestGenerateWorkload_ZeroSessionClosedLoopClient_EmitsWarning (BC-1):
  directly verifies the warning fires when a closed-loop client produces no
  session blueprints. Uses a lifecycle window beyond the horizon to deterministically
  trigger the zero-session path; captures logrus output to assert warning text.

- Expand R1 justification comment in generator.go: explains why warn-only is
  correct (unreachable via public API, blueprint loop is safe no-op on empty map).

- Fix test guard message: "spec changed?" → accurate description of the invariant
  that ExpandInferencePerfSpec must produce one client per (prompt, user, stage).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <[email protected]>

Apr 8, 2026
ae62d94
zip
tar.gz
Notes

v0.7.2

fix(workload): change inference_perf SLOClass from "batch" to "standa…

…rd" (#965) (#968)

* fix(workload): change inference_perf SLOClass from \"batch\" to \"standard\" (#965)

Commit 8bc7a48 introduced a deferred queue that parks any request with
SLOClass \"batch\" or \"background\" when the cluster is busy. inference_perf.go
had been generating all clients with SLOClass \"batch\" since before the
deferred queue existed, when the label had no scheduling effect.

After 8bc7a48, every inference_perf request after the first was deferred
until the cluster went fully idle, serializing all requests one-by-one
instead of batching them. This inflated TTFT by 6-100x depending on load,
breaking all training experiments that use inference_perf workloads (issue #965).

Fix: change the three SLOClass assignments in ExpandInferencePerfSpec from
\"batch\" to \"standard\". inference_perf models production inference benchmarking
traffic, which should not yield to real-time traffic under the deferred queue
semantics. \"standard\" is the correct tier.

Add TestInferencePerfClients_SLOClass_IsStandard to regression-guard this.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* test(cluster): assert standard SLO bypasses deferred queue; batch SLO is serialized (#965)

Add two regression-guard tests to cluster_deferred_test.go:

TestDeferredQueue_StandardSLONotSerialized (BC-2):
  10 standard-class requests arriving every 10µs must produce mean TTFT
  < 15ms. Without the inference_perf fix, this would be ~100ms (full
  serialization). The 15ms bound gives 2.4x margin over the expected
  non-serialized TTFT of ~6.2ms.

TestDeferredQueue_BatchSLOIsSerializedAboveBound (BC-3):
  Guard-validity companion: 10 batch-class requests in the same setup
  must produce mean TTFT >= 15ms, confirming the bound discriminates
  between serialized and non-serialized execution.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* docs(plans): add implementation plan for inference_perf SLOClass fix (#965)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* feat(golden): trained-physics iter29 golden dataset with behavioral tests (#965)

Three changes in one commit:

1. Remove model_configs/ from .gitignore and add negation for
   model_configs/*/config.json so architecture configs are tracked.
   These files are small JSON specs (no model weights) and are needed
   for roofline/trained-physics tests to run without network access.

2. Add testdata/trained_physics_iter29.json — golden values for all 15
   training experiments under the trained-physics backend with iter29
   coefficients (alpha/beta from sequential golden section search, loss
   34.5675%). Generated from the patched binary after the SLOClass fix.

3. Add TestTrainedPhysics_GoldenDataset in sim/cluster/ — runs all 15
   experiments via ClusterSimulator with the exact iter29 configuration
   (GenerateWorkload + SessionManager for multi-turn chat, same KV offload
   parameters as the training runner) and asserts:
   - Invariant: request conservation, zero dropped, causality (TTFT < E2E)
   - Golden: TTFT/E2E/ITL metrics byte-for-byte identical (relTol=1e-9)

If the trained-physics backend needs behavioral changes in the future, rename
it first (e.g. "trained-physics-v2"). Do not update golden values in place —
that would silently accept regressions.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* fix(gitignore): add negation for model_configs/*/config.json

The *.json catch-all at line 44 was preventing model config files from
being staged. Add an explicit negation so architecture specs under
model_configs/ are tracked alongside the golden dataset negations.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* fix(golden): address PR review issues — parallelism, nil guard, INV-1, docs

C1: Add testing.Short() gate + t.Parallel() on all 15 sub-tests.
    Wall-clock time drops from ~68s to ~23s (bounded by slowest experiment).

I1: Guard against nil ws.InferencePerf after yaml.Unmarshal — malformed
    golden JSON now fails with a clear diagnostic instead of opaque panic.

I2: Expand TestInferencePerfClients_SLOClass_IsStandard to table-driven with
    3 cases covering all code paths: single-stage/no-multiturn (line 183),
    single-stage/multiturn (line 131), and multi-stage (line 237).

I3: Complete INV-1 conservation check — add StillQueued, StillRunning, and
    RejectedRequests assertions alongside existing DroppedUnservable and
    TimedOutRequests checks. Catches request leaks independently of golden values.

I4: Fix sortedValues doc comment — clarify that key-sort satisfies R2
    (deterministic map iteration) while value-sort serves percentile computation.

I5: Update CLAUDE.md Recent Changes with SLOClass fix, model_configs tracking,
    and trained-physics golden dataset.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* docs: fix stale bound description and asymmetric margin comment

PR review (#968) flagged two documentation issues:

1. cluster_deferred_test.go:322: "~2.4× margin each side" was factually wrong
   — the two margins are different. Corrected to "~2.4× above non-serialized
   (6.2ms) and ~6.7× below serialized (100ms)".

2. docs/plans: Section C still referenced the old 5ms bound from the initial
   draft (before alpha/beta coefficients were found to be swapped). Updated
   to match the corrected 15ms bound implemented in Section F, with a
   deviation log note explaining why it changed.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

* fix(test): remove duplicate docstring header in TestInferencePerfClients_SLOClass_IsStandard

The cat-append operation left the original 3-line comment block when the test
was expanded to table-driven form. Lines 1664-1666 were a verbatim duplicate of
lines 1667-1668. Remove the stale first occurrence.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <[email protected]>

Apr 7, 2026
50bc797
zip
tar.gz
Notes

v0.7.1

feat(latency): add trained-physics model (#950)

* feat(latency): add evolved model with architecture-aware MoE overhead (BC-2,BC-5,BC-6,BC-7,BC-8,BC-9,BC-11)

- Copy evolved_model.go from training branch
- Implements LatencyModel interface with roofline basis functions
- Architecture-aware β₈ scaling: applies to interleaved MoE, skips uniform MoE
- StepTime, QueueingTime, OutputTokenProcessingTime, PostDecodeFixedOverhead
- Add InterleaveMoELayerStep and DenseIntermediateDim fields to ModelConfig
  (required by evolved model for Scout-style interleaved MoE/dense architectures)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): address code quality issues in evolved model

- Add weightBPP > 0 validation (R3 compliance)
- Change Alpha/Beta to unexported alpha/beta fields (R8 compliance)
- Fix hasInterleavedMoE to require NumLocalExperts > 1 (semantic consistency)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* test(latency): add evolved model coefficient validation tests

- Copy evolved_model_test.go from training branch
- β₁₀ batching inefficiency tests (quadratic scaling, batch size effects)
- β₃' KV sequence length tests (linear scaling with layers)
- β₁₀ physics analysis validating μs-scale coefficient ranges

Co-Authored-By: Claude <[email protected]>

* feat(latency): register evolved backend in factory (BC-10)

- Add case "evolved" to NewLatencyModel factory
- Dispatches to NewEvolvedModel with validation

Co-Authored-By: Claude <[email protected]>

* feat(sim): add evolved to valid latency backends (BC-10)

- Add "evolved" to validLatencyBackends map
- Enables CLI flag validation for --latency-model evolved

Co-Authored-By: Claude <[email protected]>

* feat(config): add evolved model trained coefficients (BC-10)

- Add evolved_coefficients section to defaults.yaml with iter25 coefficients
- Alpha: [15561.96, 776.24, 45.91] (API/framework overheads in µs)
- Beta: 10 coefficients including architecture-aware β₈ (427.3 µs/MoE-layer)
- Add EvolvedDefaults struct to Config for R10 strict YAML parsing
- Full precision preserved from training/iterations/iter25

Co-Authored-By: Claude <[email protected]>

* docs: add evolved latency model to CLAUDE.md

- Update latency estimation section with evolved backend
- Document architecture-aware MoE overhead scaling
- Document 10-beta mode (prefill compute-only, decode memory-only)

Co-Authored-By: Claude <[email protected]>

* docs: add evolved latency backend to README

- Add evolved to list of available latency backends
- Add evolved_model.go to file tree listing
- Users can now discover --latency-model evolved option

Co-Authored-By: Claude <[email protected]>

* fix(cli): add evolved backend CLI wiring and test updates

- Add evolved coefficient loading from defaults.yaml
- Add evolved to analytical backends processing blocks
- Update flag help string to include evolved
- Add evolved assertions to bundle_test.go

Co-Authored-By: Claude <[email protected]>

* fix(roofline): use batch size for MoE nEff calculation in weight bandwidth

Cherry-picked from PR #878 (commit fcad468).

Fixes MoE weight bandwidth calculation by passing totalNewTokens to
calculateMemoryAccessBytes instead of 0. This ensures nEff reflects
the real batch size and MoE expert weights are correctly accounted for.

Without this fix, nEff=0 causes massive underestimation of weight
bandwidth (~7 GB instead of ~39 GB for Scout FP8).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(roofline): complete Scout MoE interleaved architecture fixes (issue #877)

Addresses all three bugs from issue #877 in roofline.go and config.go:

**Bug 1 - Interleaved MoE Architecture Ignored:**
- Add parsing of `interleave_moe_layer_step` field in config.go
- Split FLOPs calculation into MoE vs dense layers in calculateTransformerFlops
- Scout (48 layers, step=1): 24 MoE + 24 dense correctly calculated

**Bug 2 - DenseIntermediateDim Field Not Parsed:**
- Add parsing of `intermediate_size_mlp` field in config.go
- Use DenseIntermediateDim for dense layer FFN dimensions
- Scout dense layers now use 16384 FFN (not 8192 MoE expert FFN)

**Bug 3 - nEff Applied to All Layers:**
- Split weight bandwidth into MoE (with nEff) vs dense (without nEff)
- nEff expert loading now only applies to MoE layers
- Dense layers contribute full weight bandwidth regardless of batch size

**Test Coverage:**
- Add comprehensive TestRooflineStepTime_Scout_InterleavedMoE
- Validates Scout produces 39.26 GB weight bandwidth (not 7.05 GB)
- Confirms nEff=0 bug fixed: newTokens=0 now produces 7.05 GB (dense only)
- Verifies FLOPs split and both architectures produce positive latencies

Impact: Scout TTFT predictions improve from 24× underestimate to <5× error.
Mixtral and other uniform MoE models unaffected (backward compatible).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* feat(latency): add trained-physics model with iter29 coefficients (issue #939)

Replaces evolved backend with trained-physics latency model using
coefficients from iteration 29 (sequential golden section search, loss: 34.57%).

**Changes:**
1. **Add trained_physics_model.go** - Physics-informed roofline with learned corrections
   - Copied from training branch iter29 (commit e0e03b3)
   - Renamed EvolvedModel → TrainedPhysicsModel
   - Updated iteration reference: iter15 → iter29
   - Backend name: "trained-physics" (hyphenated, like "trained-roofline")

2. **Add trained_physics_model_test.go** - Behavioral tests from iter29
   - Tests for β₁₀ batching inefficiency, β₃' KV sequence length, physics analysis

3. **Add trained_physics_coefficients to defaults.yaml**
   - Alpha (µs): α₁=15563.199579, α₂=777.3455, α₃=45.907545
   - Beta: β₁=0.152128, β₂=0.0, β₃=1.36252915, β₄=0.752037,
           β₅=32.09546717, β₆=4.41684444, β₇=126.024825,
           β₈=481.8613888, β₉=0.0, β₁₀=1.94710771
   - Replaced evolved_coefficients section

4. **Update cmd/default_config.go**
   - Renamed EvolvedDefaults → TrainedPhysicsDefaults
   - Updated yaml tag: evolved_coefficients → trained_physics_coefficients
   - Updated docstring: iter25 → iter29

5. **Update cmd/root.go**
   - Backend resolution: "evolved" → "trained-physics"
   - Config loading: cfg.EvolvedDefaults → cfg.TrainedPhysicsDefaults
   - Error messages updated

6. **Update sim/bundle.go**
   - validLatencyBackends: "evolved" → "trained-physics"

7. **Update sim/latency/latency.go**
   - Factory case: "evolved" → "trained-physics"
   - Constructor call: NewEvolvedModel → NewTrainedPhysicsModel

8. **Update sim/bundle_test.go**
   - Test assertions: "evolved" → "trained-physics"

9. **Remove evolved files**
   - Deleted evolved_model.go and evolved_model_test.go

**Architecture-Aware MoE Overhead:**
β₈ applies conditionally:
- Interleaved MoE (InterleaveMoELayerStep > 0): β₈ × nMoELayers overhead
- Uniform MoE (InterleaveMoELayerStep = 0): β₈ skipped (moeScaling=0.0)
- Dense models (nMoELayers = 0): β₈ term naturally zero

**Testing:**
- All tests pass: go test ./...
- Backend validation updated
- Builds successfully: go build ./...

Fixes #939

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs(latency): improve trained-physics model documentation

Replace iteration-specific training history with architectural documentation
that explains the model design, coefficient meanings, and physical justifications.

**Changes to trained_physics_model.go:**
- Removed training iteration references (iter15, iter29)
- Added comprehensive model architecture overview
- Documented step-time formula with all 3 coefficient modes (8, 9, 10 betas)
- Explained each beta coefficient with:
  - Physical meaning (what it corrects)
  - Units and typical magnitude
  - Why it exists (kernel efficiency, bandwidth contention, etc.)
- Explained alpha coefficients (API/framework overheads)
- Documented architecture-aware features (interleaved MoE, quantization, TP)

**Changes to trained_physics_model_test.go:**
- Removed training iteration references (iter10, iter11)
- Updated test docstrings to explain coefficient behavior
- TestBeta10BatchingInefficiency: explain β₁₀ as decode memory correction
- TestBeta3PrimeKVSeqLen: clarify this tests analytical basis function
- TestBeta10PhysicsAnalysis: replaced iteration comparison with dimensional analysis

**Benefits:**
- Documentation is now timeless (won't become stale with new training runs)
- Explains "why" each coefficient exists (physical justification)
- More useful for users trying to understand model behavior
- Clearer for future maintainers

All tests pass unchanged.

* feat(latency): add trained-physics model (#939)

Add trained-physics latency model backend from training branch iter29.
This trained model applies learned correction factors to roofline basis
functions, generalizing across model architectures, workloads, and TP
configurations without per-model calibration.

Changes:
- Add trained_physics_model.go (10-beta mode with architecture-aware terms)
- Add trained_physics_model_test.go (β₁₀ unit tests)
- Add trained_physics_coefficients to defaults.yaml
- Update latency model factory registration
- Update all documentation (positioned as recommended default)
- Cherry-pick Scout MoE nEff fix from PR #878

Architecture-aware features:
- β₈ per-MoE-layer overhead (applies only to interleaved architectures)
- Quantization-aware weight bandwidth (FP8, W4A16 detection)
- Optional 10-beta mode with prefill/decode split

Pre-trained coefficients: 3 alpha (API overhead) + 10 beta (roofline corrections)
Generalizes across dense, uniform MoE, and interleaved MoE architectures.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): address PR review concerns for trained-physics model

Fixes critical and important issues from PR #950 review:

**Critical fixes (C1-C3):**
- C1: Complete test rewrite — replaced disconnected math tests with proper behavioral tests (BC-1 through BC-7 from issue #939)
  - BC-1: Empty batch returns >= 1
  - BC-2: Positive step time for all valid inputs (prefill, decode, mixed, large batches)
  - BC-3: Monotonicity tests (prefill tokens, decode batch size, sequence length)
  - BC-4: Architecture-aware β₈ scaling (interleaved vs uniform vs dense MoE)
  - BC-5: Overhead methods (QueueingTime, OutputTokenProcessingTime, PostDecodeFixedOverhead)
  - BC-6: Factory construction validation (TP, layers, NaN, negative coefficients)
  - BC-7: Config validation (coefficient length errors)
- C2: Fixed isMoE threshold inconsistency (NumLocalExperts > 1, not > 0) — single-expert models now correctly classified as dense
- C3: Added missing MFU values (MfuPrefill=0.55, MfuDecode=0.30) to Scout test to fix division by zero

**Important fixes (I1):**
- I1: Removed dead `dFF` field from TrainedPhysicsModel struct (written but never read)

**Key insight from BC-4:**
The implementation correctly uses two mechanisms:
1. `numMoELayers` — counts MoE layers for FLOPs/bandwidth calculations
2. `hasInterleavedMoE` — gates β₈ overhead via moeScaling (1.0 if interleaved, 0.0 otherwise)

For uniform MoE (Mixtral-style), numMoELayers = NumLayers (all layers do MoE work for FLOPs), but hasInterleavedMoE = false so β₈ overhead doesn't apply. Uniform MoE is more expensive than interleaved because it does more MoE work (all 48 layers vs 24 MoE + 24 dense), even without β₈ overhead.

All tests now pass. ✓

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): address remaining PR review concerns (I2, I4, I5, I3 note)

Fixes important issues from PR #950 review:

**I2: YAML key naming consistency**
- Changed TrainedPhysicsDefaults struct tags from `yaml:"alpha"` / `yaml:"beta"` to `yaml:"alpha_coeffs"` / `yaml:"beta_coeffs"` to match TrainedRooflineDefaults and CrossModelDefaults
- Updated defaults.yaml to use `alpha_coeffs:` and `beta_coeffs:` keys
- Reduces friction for manual defaults.yaml editing

**I4: MoE consistency validation**
- Added validation in NewTrainedPhysicsModel: if NumLocalExperts > 1 then NumExpertsPerTok must be > 0
- Mirrors ValidateRooflineConfig check (config.go:382)
- Prevents silent misconfiguration where kEff falls through to 1, giving wrong FLOPs
- Added test case: invalid_moe_missing_experts_per_tok

**I5: Documentation correction**
- Updated latency-models.md comparison table: Roofline MoE support changed from "No (dense only)" to "Yes (per-expert FLOPs + effective expert count)"
- Reflects PR #877 fix that added interleaved MoE support to roofline

**I3 note: Accuracy metrics**
- Updated table to note trained-physics accuracy has been "separately tested"
- Full test-split MAPE metrics are available in training artifacts

All tests pass. Verified defaults.yaml loads correctly with new YAML keys.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs(latency): remove 'separately tested' phrase from accuracy description

Remove the 'separately tested' phrase from trained-physics accuracy
description in latency-models.md comparison table. Keep it concise
like the other columns.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): address final PR review feedback for trained-physics model

Fixes all outstanding items from PR #950 review comments:

**Critical fixes:**
- README.md: Add required --hardware and --tp flags to trained-physics example
- sim/latency/trained_physics_model.go:136: Fix isMoE field comment (> 1, not > 0)
- sim/latency/trained_physics_model.go:466: Add NumLocalExperts > 1 guard to hasInterleavedMoE

**Documentation completeness:**
- sim/latency/latency.go: Add TrainedPhysicsModel to package doc comment
- sim/latency_model.go: Update "Four" → "Five" implementations
- docs/concepts/glossary.md: Update "Four" → "Five" modes, add Trained-Physics

**User-facing documentation:**
- docs/guide/latency-models.md: Add comprehensive "Generalization Scope" section
  - Supported hardware (H100/A100 with specs)
  - Model architectures (dense/uniform MoE/interleaved MoE)
  - Workload types (prefill-heavy/decode-heavy/mixed)
  - Rationale for "recommended" status vs trained-roofline

Addresses review feedback:
- #950 (comment)
- #950 (comment)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): correct hardware flag from H100-SXM to H100

The hardware_config.json key is 'H100', not 'H100-SXM'.
Valid keys: H100, A100-SXM, A100-80, L40S

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs(latency): add L40S to supported hardware, clarify H100 training

- Add L40S (48 GB GDDR6, 362 TFLOPS BF16 / 1466 TFLOPS FP8) to hardware list
- Clarify that coefficients were trained on H100 traces but generalize
  via roofline basis functions to other GPUs without per-GPU calibration

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* fix(latency): correct coefficient count (13 vs 10, not 14 vs 11)

Trained-physics: 10 beta + 3 alpha = 13 coefficients
Trained-roofline: 7 beta + 3 alpha = 10 coefficients

The prefill/decode split (β₁ₐ/β₁ᵦ, β₂ₐ/β₂ᵦ) adds 2 extra beta terms
vs trained-roofline's max() approach, plus β₈ for MoE overhead.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs(readme): move hardware/TP defaults note to Quick Start section

Moved the note about --hardware/--tp defaulting to H100/TP=1 from the
trained-physics section up to the Quick Start section where it applies
generally to all latency models. Removed redundant mention from the
trained-physics example.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

* docs(readme): soften claim from 'all' to 'most' model architectures

Changed trained-physics accuracy claim from 'across all model architectures'
to 'across most model architectures' for more conservative language.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

---------

Co-authored-by: Claude <[email protected]>

Apr 7, 2026
638a7d3
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.10

v0.7.9

v0.7.8

v0.7.7

v0.7.6

v0.7.5

v0.7.4

v0.7.3

v0.7.2

v0.7.1

Tags: inference-sim/inference-sim