Tags: inference-sim/inference-sim
Tags
fix(workload): per-phase rate normalization for phased workloads (#1146) * fix(workload): per-phase rate normalization for phased workloads (#1144) normalizeRateFractions was summing all clients' rate_fractions globally, halving rates for workloads with non-overlapping lifecycle windows. Two-part fix: 1. Per-phase normalization: each client's fraction is divided by the sum of co-active clients (overlapping lifecycle windows), not the global sum. Clients without lifecycle windows are always-on. 2. inference-perf multi-stage: uses CustomSamplerFactory with Poisson at exact per-client rate, bypassing fraction normalization entirely. Golden datasets regenerated due to RNG sequence shift from CustomSamplerFactory sub-RNG allocation. Fixes #1144 Co-Authored-By: Claude Opus 4.6 <[email protected]> * address review: document always-on limitation + add test case - I-1: Add test for always-on + two non-overlapping phases documenting that per-phase totals are < aggregate_rate (known limitation of per-client normalization). Add docstring and reference doc note. - M-1: Add comment in generator.go clarifying clientRate is only used for the skip guard when CustomSamplerFactory is set. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
fix(observe): set status=timeout on HTTP timeout, add --timeout flag (#… …1119) * fix(observe): set status=timeout on HTTP timeout, add --timeout flag (#1118) - Fix silent data corruption: streaming, non-streaming, and HTTP-level timeouts now set record.Status="timeout" instead of silent "ok" - Add isTimeoutError() helper checking os.IsTimeout + context.DeadlineExceeded - Add --timeout CLI flag (seconds, default 300) for configurable HTTP timeout - Add WithHTTPTimeout RealClientOption following existing functional-options pattern - Extract defaultHTTPTimeoutSeconds const shared between constructor and CLI flag Fixes #1118 Co-Authored-By: Claude Opus 4.6 <[email protected]> * review(observe): add isTimeoutError unit test, fix plan-code divergence - Add TestIsTimeoutError: table-driven test covering both detection branches (os.IsTimeout, context.DeadlineExceeded) plus edge cases (nil, generic error, context.Canceled, wrapped deadline). - Update plan BC-5 and Task 3 to reflect the 86400 upper bound validation that was added during code review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * review(observe): add io.EOF test case, clarify timeout unit in error message - Add io.EOF case to TestIsTimeoutError (comment promised it, test was missing it — found by comment-analyzer round 2) - Add "seconds" to --timeout validation error message for clarity Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
feat(latency): deprecate trained-roofline, crossmodel, and blackbox b… …ackends (#1107) * feat(latency): add deprecation warning for blackbox backend (BC-3) - Emit logrus.Warn when blackbox backend is selected - Warning directs users to trained-physics as replacement - Backend remains fully functional (BC-4, BC-8) Co-Authored-By: Claude <[email protected]> * feat(latency): add deprecation warning for crossmodel backend (BC-2) - Emit logrus.Warn when crossmodel backend is selected - Warning directs users to trained-physics as replacement - Backend remains fully functional (BC-4, BC-8) Co-Authored-By: Claude <[email protected]> * feat(latency): add deprecation warning for trained-roofline backend (BC-1) - Emit logrus.Warn when trained-roofline backend is selected - Warning directs users to trained-physics as replacement - Backend remains fully functional (BC-4, BC-8) Co-Authored-By: Claude <[email protected]> * test(latency): verify non-deprecated backends emit no warnings (BC-5) - Add test for roofline (no warning) - Add test for trained-physics (no warning) - Ensures deprecation warnings are backend-specific Co-Authored-By: Claude <[email protected]> * docs(claude): mark trained-roofline, crossmodel, blackbox as deprecated (BC-6) - Update latency estimation section with deprecation notice - Recommend trained-physics as replacement - Reference docs/guide/latency-models.md for migration details Co-Authored-By: Claude <[email protected]> * docs(readme): mark trained-roofline, crossmodel, blackbox as deprecated (BC-6) - Strike through deprecated backends in features bullet - Add inline deprecation notice to file tree - Direct users to trained-physics Co-Authored-By: Claude <[email protected]> * docs(guide): add deprecation notices to latency model guide (BC-6) - Mark blackbox, crossmodel, trained-roofline sections as deprecated - Add admonition blocks with migration guidance - Update opening paragraph with recommended backend Co-Authored-By: Claude <[email protected]> * fix(test): update pre-existing roofline test with required config fields - Add BytesPerParam, MfuPrefill, MfuDecode to roofline test - Fixes test that was broken by stricter validation - Pre-existing test updated for consistency Co-Authored-By: Claude <[email protected]> * fix(sim/latency): address PR #1107 review feedback Fixes all critical and recommended issues from review: **Critical Issues (Required):** 1. **Per-instance warning spam**: Added sync.Once to emit deprecation warnings at most once per process (instead of once per instance in multi-instance clusters). 2. **Restored monotonicity tests**: Re-added 4 deleted tests that verify system invariants (monotonicity laws + edge case validation): - TestBlackboxLatencyModel_StepTime_Monotonic - TestRooflineLatencyModel_StepTime_PositiveAndMonotonic - TestNewLatencyModel_UnknownBackend_ReturnsError - TestNewLatencyModel_NegativeCoefficients_ReturnsError 3. **Removed structural assertions**: Replaced type checks with behavioral assertions per BDD principles (refactor-survival test). Models are now verified by calling StepTime() and checking output validity, not by inspecting concrete types. **Important Issues (Recommended):** 4. **Naming consistency**: Changed "cross-model" (hyphenated) to "crossmodel" in docs/guide/latency-models.md to match CLI flag. Added explicit flag values in deprecation warning box for clarity. 5. **Missing deprecation note**: Added deprecation marker to trained_roofline.go in README.md file tree. Note: Removed three deprecation warning emission tests. sync.Once makes these tests execution-order-dependent and fragile. Warnings are verified via manual testing and visible in other test output. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs: address minor review feedback Fixes two minor documentation issues from re-review: 1. **README latency.go comment**: Added DEPRECATED markers to BlackboxLatencyModel and CrossModelLatencyModel in the file tree comment (line 262) for consistency with crossmodel.go and trained_roofline.go entries. 2. **Test comment clarity**: Updated latency_test.go:521 comment to explicitly state that BC-1/2/3 (deprecated backends emit warnings) are not automatable due to sync.Once, while BC-5 (non-deprecated backends emit no warnings) IS tested. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(test): add automated BC-1/2/3 tests and restore table-driven negative coeff test Addresses two critical issues from third review: 1. **Automated deprecation warning tests (BC-1/2/3)**: Added resetDeprecationWarningsForTest() function to reset sync.Once vars, enabling reliable test isolation. Restored three positive deprecation tests that now call reset at test start, making the primary behavioral contracts of this PR automatable in CI. 2. **Table-driven negative coefficients test**: Restored original table-driven test covering all four negative coefficient positions (alpha[0], alpha[2], beta[0], beta[1]) instead of single case. This catches bugs that only check specific indices in validateCoeffs. 3. **Documentation fix**: Updated "Choosing the right mode" tip box to stop actively recommending deprecated backends. Now clearly marks blackbox/trained-roofline/crossmodel as deprecated and directs users to trained-physics. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * revert: undo unnecessary changes to TestBlackboxLatencyModel_StepTime_EmptyBatch Reverted unrelated changes to empty batch test: - Restored beta0 from 500 back to 1000 (original value) - Removed unnecessary assertion checking exact beta0 value - Restored original comment about interface contract The original test was correct - it tested the interface contract (StepTime >= 1) not implementation details (exact beta0 value). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(test): add only required test changes - no unnecessary modifications Cleanly added ONLY the requested test changes: **Added:** 1. Two imports: bytes, logrus (required for deprecation tests) 2. TestNewLatencyModel_Blackbox_EmitsDeprecationWarning (BC-1) 3. TestNewLatencyModel_Crossmodel_EmitsDeprecationWarning (BC-2) 4. TestNewLatencyModel_TrainedRoofline_EmitsDeprecationWarning (BC-3) 5. TestNewLatencyModel_Roofline_NoDeprecationWarning (BC-5) 6. TestNewLatencyModel_TrainedPhysics_NoDeprecationWarning (BC-5) All three positive tests call resetDeprecationWarningsForTest() for isolation. **NOT changed:** - No test renames - No coefficient value changes - No test logic modifications - No unnecessary "improvements" The diff is minimal and focused on exactly what was requested by reviews. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs: clarify roofline is default, trained-physics is recommended Fixed incorrect conflation of "default" and "recommended": - Roofline is the DEFAULT (what you get without --latency-model flag) - Trained-physics is RECOMMENDED (best accuracy for new work) Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs: fix factory docstring and remove deprecated blackbox recommendation Addresses two Important issues from fourth review: 1. **Factory docstring (latency.go:149-152)**: Added trained-physics to dispatch list and noted that deprecated backends emit logrus.Warn side-effect once per process. IDE hover and go doc now show complete information. 2. **Cross-model section (latency-models.md:162)**: Removed recommendation of deprecated blackbox mode for dense prefill workloads. Now recommends trained-physics instead, which provides learned corrections without the deprecation path. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> --------- Co-authored-by: Claude <[email protected]>
fix(kv): check-then-act allocation replacing rollback (vLLM parity) (#… …1061) (#1068) * refactor(kv): replace UsedBlockCnt with direct FreeBlockCnt counter Mirrors vLLM's FreeKVCacheBlockQueue.num_free_blocks pattern. Counter is maintained by appendToFreeList/removeFromFreeList/prependToFreeList, eliminating arithmetic derivation that can drift under partial mutation bugs (#1061). UsedBlocks() accessor is now derived as TotalBlocks - FreeBlockCnt (read-only for callers, not source of truth for allocation). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(kv): add decode pre-check to prevent #1061 block leak Mirrors vLLM's universal check-then-act gate (kv_cache_manager.py:334, single_type_kv_cache_manager.py:95-101). Returns false before any state mutation when the last block is full and no free blocks exist. Preserves RequestMap for continuing requests, preventing the orphaned-block deadlock. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * refactor(kv): remove rollback machinery, convert to check-then-act Removes rollbackAllocation, cachedBlockMutation, newBlockMutation, and prependToFreeList (~60 lines). Pre-check now accounts for cached-block free-list consumption (mirrors vLLM's num_evictable_blocks). Post-pre-check popFreeBlock nil is a panic (INV-4 invariant violation, structurally unreachable in single-threaded DES). Decode pre-check extended to handle preempted requests with no RequestMap entry (ProgressIndex past input but blocks released). Fixes #1061 — the block leak was caused by rollbackAllocation deleting RequestMap for continuing requests. With rollback removed, the bug class is eliminated entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(kv): add verifyBlockConservation debug assertion Walks free list and block InUse flags independently to verify INV-4. Also detects FreeBlockCnt drift. Unexported method on *KVCacheState (not on KVStore interface per R13). Intended for step-boundary assertions in debug mode. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(sim): fix processCompletions comment to reflect check-then-act The comment claimed AllocateKVBlocks only modifies RequestMap on success. This was false under rollback (#1061 root cause) but is now provably true under check-then-act. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(kv): add preemption retry regression test for #1061 Reproduces leak path 1: decode failure → eviction → retry. Verifies RequestMap is preserved through the retry cycle under check-then-act (would have caught #1061 rollback bug). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * chore(kv): remove unused tiered verifyBlockConservation Linter flagged unexported method with no callers. Can be re-added when a debug-mode caller is wired up. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(kv): add high-stress regression tests for #1061/#1057/#963 Four integration-level KV cache tests that exercise the deadlock scenarios fixed by the check-then-act refactor: - Sustained KV pressure with preemption cycles (#963) - Decode failure/retry preserves blocks through many cycles (#1061) - Cached block budget exhaustion under pressure (#1057) - Block conservation through complete allocation lifecycle All tests verify INV-4 (block conservation) and INV-1 (request accounting) as behavioral contracts. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: update extension-recipes.md for check-then-act pattern Replaces stale rollback references (rollbackAllocation, cachedBlockMutation, newBlockMutation, UsedBlockCnt) with check-then-act documentation. Describes pre-check gate, cachedFromFreeList budget, and tiered cache interaction. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add design doc and implementation plan for check-then-act KV Design doc: behavioral equivalence proofs with vLLM source citations. Plan: 7-task TDD implementation with convergence review results. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address PR review findings (tautological test, stale docs) - assertBlockConservation now calls verifyBlockConservation() for independent free-list walk (was tautological: free = total - used then checking used + free != total) - Fix stale "UsedBlockCnt" in error message → "UsedBlocks()" - Update INV-4 verification description: check-then-act, not rollback - Update R5: check-then-act as preferred strategy over rollback - Update R5 checklist item Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address follow-up review suggestions - Rewrite tiered_test.go:874-926 comments from before/after rollback framing to present-tense invariant language (rollbackAllocation no longer exists, so references to it were undefined terms) - Add tests for decode "no existing blocks" path (cache.go:247-252): both the failure case (0 free) and success case (free available) - Simplify assertFullConservation to avoid double-calling verifyBlockConservation (now delegates to assertBlockConservation) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
fix(observe): replace uniform hello prompts with diverse vocabulary w… …ords (#1039) * fix(observe): replace uniform "hello" prompts with diverse vocabulary words blis observe generated identical "hello hello hello..." prompts for all requests when no prefix group was configured, causing artificial KV cache hits on vLLM servers with enable_prefix_caching=True. This invalidated sim2real comparisons by making observed latencies artificially low. - Map each request's random token IDs to prefixVocabulary words via modular indexing (tokensToPrompt), ensuring different requests produce different prompts with no shared artificial prefix - Apply same fix to suffix portion of prefix-group requests - Always calibrate tokensPerWord ratio (PR #834) and scale word count so the server tokenizes prompts to the intended token length - Pass tokensPerWord through runObserveOrchestrator to requestToPending Fixes #1037 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): address PR review feedback - Fix BC comment reference: line 730 now references BC-3/BC-6 (not BC-5) - Add test for unknown prefix group fallback path (PrefixGroup set but absent from prefixes map falls back to tokensToPrompt) - Use diverse token values in suffix word count test instead of all-zeros Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(observe): defensive guards and edge-case tests per review - Guard negative token IDs in tokensToPrompt: ((idx%vocabLen)+vocabLen)%vocabLen - Guard non-positive tokensPerWord divisor in requestToPending (R3/R11) - Add upper-bound guard on suffixStart for symmetry with lower-bound - Add TestTokensToPrompt_NegativeTokenIDs - Add TestRequestToPending_WordCountClampedToOne Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
fix(routing): align PPC scorer zero-cache normalization to llm-d (#1028) * fix(routing): align PPC scorer zero-cache normalization to llm-d (1.0) Remove the special-case branch that returned 0.5 when all instances had zero cached prefix blocks. The all-equal path now unconditionally returns 1.0, matching llm-d's indexedScoresToNormalizedScoredPods behavior. - Implement BC-1: all-zero cache returns 1.0 (llm-d parity) - Verify BC-2: all-equal nonzero still returns 1.0 (no regression) - Verify BC-3: divergent caches use min-max normalization (no regression) Fixes #1007 Co-Authored-By: Claude <[email protected]> * docs(routing): document PPC all-equal normalization in scorer tables Address review feedback from #1028: - Add "all-equal (including all-zero) → 1.0 (llm-d parity)" to precise-prefix-cache rows in architecture.md and routing.md, matching the style used by active-requests scorer - Remove stale BC-2 reference from test comment (was referencing the original scorer plan's numbering, not this PR's contracts) Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>
docs(routing): add active-requests, running-requests, load-aware to a… …ll scorer docs (#973) Update 5 documentation files that enumerate scorers to include the three new scorers added in #966. Adds table rows, updates lists, and documents signal freshness characteristics for each new scorer. - docs/guide/routing.md: Available Scorers table (6 → 9 rows) - docs/concepts/architecture.md: Built-in Scorers table (6 → 9 rows) - docs/reference/configuration.md: --routing-scorers available list - docs/concepts/glossary.md: Scorer entry with load-aware sub-range note - docs/contributing/standards/invariants.md: INV-7 signal freshness Fixes #972 Co-authored-by: Mert Toslali <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>
hardening(workload): warn on zero-session closed-loop client; test si… …ngle-stage multi-user (#983) * hardening(workload): warn on zero-session closed-loop client; test single-stage multi-user (#976, #979) Closes #976: add logrus.Warnf in GenerateWorkload after the session-matching loop when a closed-loop client produces zero SessionBlueprints. This fires only if req.ClientID is unset on round-0 requests (e.g. a future code path that bypasses GenerateReasoningRequests). With the current ClientID predicate from #975 this should never fire, but the warning makes the failure mode immediately observable if it ever does. Closes #979 (T1-1): add regression test TestGenerateWorkload_SingleStageMultiUserMultiTurn_OneSessionPerClient for the single-stage analog of #974. Single-stage workloads with NumUsersPerSystemPrompt > 1 share TenantID = prefixGroup across all users in the same prompt group — the same conflation trigger as the multi-stage case fixed in #975. The ClientID predicate already handles this correctly; the test confirms the invariant holds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * hardening(workload): address PR #983 review findings - Add TestGenerateWorkload_ZeroSessionClosedLoopClient_EmitsWarning (BC-1): directly verifies the warning fires when a closed-loop client produces no session blueprints. Uses a lifecycle window beyond the horizon to deterministically trigger the zero-session path; captures logrus output to assert warning text. - Expand R1 justification comment in generator.go: explains why warn-only is correct (unreachable via public API, blueprint loop is safe no-op on empty map). - Fix test guard message: "spec changed?" → accurate description of the invariant that ExpandInferencePerfSpec must produce one client per (prompt, user, stage). Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <[email protected]>
fix(workload): change inference_perf SLOClass from "batch" to "standa… …rd" (#965) (#968) * fix(workload): change inference_perf SLOClass from \"batch\" to \"standard\" (#965) Commit 8bc7a48 introduced a deferred queue that parks any request with SLOClass \"batch\" or \"background\" when the cluster is busy. inference_perf.go had been generating all clients with SLOClass \"batch\" since before the deferred queue existed, when the label had no scheduling effect. After 8bc7a48, every inference_perf request after the first was deferred until the cluster went fully idle, serializing all requests one-by-one instead of batching them. This inflated TTFT by 6-100x depending on load, breaking all training experiments that use inference_perf workloads (issue #965). Fix: change the three SLOClass assignments in ExpandInferencePerfSpec from \"batch\" to \"standard\". inference_perf models production inference benchmarking traffic, which should not yield to real-time traffic under the deferred queue semantics. \"standard\" is the correct tier. Add TestInferencePerfClients_SLOClass_IsStandard to regression-guard this. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * test(cluster): assert standard SLO bypasses deferred queue; batch SLO is serialized (#965) Add two regression-guard tests to cluster_deferred_test.go: TestDeferredQueue_StandardSLONotSerialized (BC-2): 10 standard-class requests arriving every 10µs must produce mean TTFT < 15ms. Without the inference_perf fix, this would be ~100ms (full serialization). The 15ms bound gives 2.4x margin over the expected non-serialized TTFT of ~6.2ms. TestDeferredQueue_BatchSLOIsSerializedAboveBound (BC-3): Guard-validity companion: 10 batch-class requests in the same setup must produce mean TTFT >= 15ms, confirming the bound discriminates between serialized and non-serialized execution. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * docs(plans): add implementation plan for inference_perf SLOClass fix (#965) Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * feat(golden): trained-physics iter29 golden dataset with behavioral tests (#965) Three changes in one commit: 1. Remove model_configs/ from .gitignore and add negation for model_configs/*/config.json so architecture configs are tracked. These files are small JSON specs (no model weights) and are needed for roofline/trained-physics tests to run without network access. 2. Add testdata/trained_physics_iter29.json — golden values for all 15 training experiments under the trained-physics backend with iter29 coefficients (alpha/beta from sequential golden section search, loss 34.5675%). Generated from the patched binary after the SLOClass fix. 3. Add TestTrainedPhysics_GoldenDataset in sim/cluster/ — runs all 15 experiments via ClusterSimulator with the exact iter29 configuration (GenerateWorkload + SessionManager for multi-turn chat, same KV offload parameters as the training runner) and asserts: - Invariant: request conservation, zero dropped, causality (TTFT < E2E) - Golden: TTFT/E2E/ITL metrics byte-for-byte identical (relTol=1e-9) If the trained-physics backend needs behavioral changes in the future, rename it first (e.g. "trained-physics-v2"). Do not update golden values in place — that would silently accept regressions. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(gitignore): add negation for model_configs/*/config.json The *.json catch-all at line 44 was preventing model config files from being staged. Add an explicit negation so architecture specs under model_configs/ are tracked alongside the golden dataset negations. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(golden): address PR review issues — parallelism, nil guard, INV-1, docs C1: Add testing.Short() gate + t.Parallel() on all 15 sub-tests. Wall-clock time drops from ~68s to ~23s (bounded by slowest experiment). I1: Guard against nil ws.InferencePerf after yaml.Unmarshal — malformed golden JSON now fails with a clear diagnostic instead of opaque panic. I2: Expand TestInferencePerfClients_SLOClass_IsStandard to table-driven with 3 cases covering all code paths: single-stage/no-multiturn (line 183), single-stage/multiturn (line 131), and multi-stage (line 237). I3: Complete INV-1 conservation check — add StillQueued, StillRunning, and RejectedRequests assertions alongside existing DroppedUnservable and TimedOutRequests checks. Catches request leaks independently of golden values. I4: Fix sortedValues doc comment — clarify that key-sort satisfies R2 (deterministic map iteration) while value-sort serves percentile computation. I5: Update CLAUDE.md Recent Changes with SLOClass fix, model_configs tracking, and trained-physics golden dataset. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * docs: fix stale bound description and asymmetric margin comment PR review (#968) flagged two documentation issues: 1. cluster_deferred_test.go:322: "~2.4× margin each side" was factually wrong — the two margins are different. Corrected to "~2.4× above non-serialized (6.2ms) and ~6.7× below serialized (100ms)". 2. docs/plans: Section C still referenced the old 5ms bound from the initial draft (before alpha/beta coefficients were found to be swapped). Updated to match the corrected 15ms bound implemented in Section F, with a deviation log note explaining why it changed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(test): remove duplicate docstring header in TestInferencePerfClients_SLOClass_IsStandard The cat-append operation left the original 3-line comment block when the test was expanded to table-driven form. Lines 1664-1666 were a verbatim duplicate of lines 1667-1668. Remove the stale first occurrence. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <[email protected]>
feat(latency): add trained-physics model (#950) * feat(latency): add evolved model with architecture-aware MoE overhead (BC-2,BC-5,BC-6,BC-7,BC-8,BC-9,BC-11) - Copy evolved_model.go from training branch - Implements LatencyModel interface with roofline basis functions - Architecture-aware β₈ scaling: applies to interleaved MoE, skips uniform MoE - StepTime, QueueingTime, OutputTokenProcessingTime, PostDecodeFixedOverhead - Add InterleaveMoELayerStep and DenseIntermediateDim fields to ModelConfig (required by evolved model for Scout-style interleaved MoE/dense architectures) Co-Authored-By: Claude <[email protected]> Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): address code quality issues in evolved model - Add weightBPP > 0 validation (R3 compliance) - Change Alpha/Beta to unexported alpha/beta fields (R8 compliance) - Fix hasInterleavedMoE to require NumLocalExperts > 1 (semantic consistency) Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * test(latency): add evolved model coefficient validation tests - Copy evolved_model_test.go from training branch - β₁₀ batching inefficiency tests (quadratic scaling, batch size effects) - β₃' KV sequence length tests (linear scaling with layers) - β₁₀ physics analysis validating μs-scale coefficient ranges Co-Authored-By: Claude <[email protected]> * feat(latency): register evolved backend in factory (BC-10) - Add case "evolved" to NewLatencyModel factory - Dispatches to NewEvolvedModel with validation Co-Authored-By: Claude <[email protected]> * feat(sim): add evolved to valid latency backends (BC-10) - Add "evolved" to validLatencyBackends map - Enables CLI flag validation for --latency-model evolved Co-Authored-By: Claude <[email protected]> * feat(config): add evolved model trained coefficients (BC-10) - Add evolved_coefficients section to defaults.yaml with iter25 coefficients - Alpha: [15561.96, 776.24, 45.91] (API/framework overheads in µs) - Beta: 10 coefficients including architecture-aware β₈ (427.3 µs/MoE-layer) - Add EvolvedDefaults struct to Config for R10 strict YAML parsing - Full precision preserved from training/iterations/iter25 Co-Authored-By: Claude <[email protected]> * docs: add evolved latency model to CLAUDE.md - Update latency estimation section with evolved backend - Document architecture-aware MoE overhead scaling - Document 10-beta mode (prefill compute-only, decode memory-only) Co-Authored-By: Claude <[email protected]> * docs: add evolved latency backend to README - Add evolved to list of available latency backends - Add evolved_model.go to file tree listing - Users can now discover --latency-model evolved option Co-Authored-By: Claude <[email protected]> * fix(cli): add evolved backend CLI wiring and test updates - Add evolved coefficient loading from defaults.yaml - Add evolved to analytical backends processing blocks - Update flag help string to include evolved - Add evolved assertions to bundle_test.go Co-Authored-By: Claude <[email protected]> * fix(roofline): use batch size for MoE nEff calculation in weight bandwidth Cherry-picked from PR #878 (commit fcad468). Fixes MoE weight bandwidth calculation by passing totalNewTokens to calculateMemoryAccessBytes instead of 0. This ensures nEff reflects the real batch size and MoE expert weights are correctly accounted for. Without this fix, nEff=0 causes massive underestimation of weight bandwidth (~7 GB instead of ~39 GB for Scout FP8). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(roofline): complete Scout MoE interleaved architecture fixes (issue #877) Addresses all three bugs from issue #877 in roofline.go and config.go: **Bug 1 - Interleaved MoE Architecture Ignored:** - Add parsing of `interleave_moe_layer_step` field in config.go - Split FLOPs calculation into MoE vs dense layers in calculateTransformerFlops - Scout (48 layers, step=1): 24 MoE + 24 dense correctly calculated **Bug 2 - DenseIntermediateDim Field Not Parsed:** - Add parsing of `intermediate_size_mlp` field in config.go - Use DenseIntermediateDim for dense layer FFN dimensions - Scout dense layers now use 16384 FFN (not 8192 MoE expert FFN) **Bug 3 - nEff Applied to All Layers:** - Split weight bandwidth into MoE (with nEff) vs dense (without nEff) - nEff expert loading now only applies to MoE layers - Dense layers contribute full weight bandwidth regardless of batch size **Test Coverage:** - Add comprehensive TestRooflineStepTime_Scout_InterleavedMoE - Validates Scout produces 39.26 GB weight bandwidth (not 7.05 GB) - Confirms nEff=0 bug fixed: newTokens=0 now produces 7.05 GB (dense only) - Verifies FLOPs split and both architectures produce positive latencies Impact: Scout TTFT predictions improve from 24× underestimate to <5× error. Mixtral and other uniform MoE models unaffected (backward compatible). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * feat(latency): add trained-physics model with iter29 coefficients (issue #939) Replaces evolved backend with trained-physics latency model using coefficients from iteration 29 (sequential golden section search, loss: 34.57%). **Changes:** 1. **Add trained_physics_model.go** - Physics-informed roofline with learned corrections - Copied from training branch iter29 (commit e0e03b3) - Renamed EvolvedModel → TrainedPhysicsModel - Updated iteration reference: iter15 → iter29 - Backend name: "trained-physics" (hyphenated, like "trained-roofline") 2. **Add trained_physics_model_test.go** - Behavioral tests from iter29 - Tests for β₁₀ batching inefficiency, β₃' KV sequence length, physics analysis 3. **Add trained_physics_coefficients to defaults.yaml** - Alpha (µs): α₁=15563.199579, α₂=777.3455, α₃=45.907545 - Beta: β₁=0.152128, β₂=0.0, β₃=1.36252915, β₄=0.752037, β₅=32.09546717, β₆=4.41684444, β₇=126.024825, β₈=481.8613888, β₉=0.0, β₁₀=1.94710771 - Replaced evolved_coefficients section 4. **Update cmd/default_config.go** - Renamed EvolvedDefaults → TrainedPhysicsDefaults - Updated yaml tag: evolved_coefficients → trained_physics_coefficients - Updated docstring: iter25 → iter29 5. **Update cmd/root.go** - Backend resolution: "evolved" → "trained-physics" - Config loading: cfg.EvolvedDefaults → cfg.TrainedPhysicsDefaults - Error messages updated 6. **Update sim/bundle.go** - validLatencyBackends: "evolved" → "trained-physics" 7. **Update sim/latency/latency.go** - Factory case: "evolved" → "trained-physics" - Constructor call: NewEvolvedModel → NewTrainedPhysicsModel 8. **Update sim/bundle_test.go** - Test assertions: "evolved" → "trained-physics" 9. **Remove evolved files** - Deleted evolved_model.go and evolved_model_test.go **Architecture-Aware MoE Overhead:** β₈ applies conditionally: - Interleaved MoE (InterleaveMoELayerStep > 0): β₈ × nMoELayers overhead - Uniform MoE (InterleaveMoELayerStep = 0): β₈ skipped (moeScaling=0.0) - Dense models (nMoELayers = 0): β₈ term naturally zero **Testing:** - All tests pass: go test ./... - Backend validation updated - Builds successfully: go build ./... Fixes #939 Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs(latency): improve trained-physics model documentation Replace iteration-specific training history with architectural documentation that explains the model design, coefficient meanings, and physical justifications. **Changes to trained_physics_model.go:** - Removed training iteration references (iter15, iter29) - Added comprehensive model architecture overview - Documented step-time formula with all 3 coefficient modes (8, 9, 10 betas) - Explained each beta coefficient with: - Physical meaning (what it corrects) - Units and typical magnitude - Why it exists (kernel efficiency, bandwidth contention, etc.) - Explained alpha coefficients (API/framework overheads) - Documented architecture-aware features (interleaved MoE, quantization, TP) **Changes to trained_physics_model_test.go:** - Removed training iteration references (iter10, iter11) - Updated test docstrings to explain coefficient behavior - TestBeta10BatchingInefficiency: explain β₁₀ as decode memory correction - TestBeta3PrimeKVSeqLen: clarify this tests analytical basis function - TestBeta10PhysicsAnalysis: replaced iteration comparison with dimensional analysis **Benefits:** - Documentation is now timeless (won't become stale with new training runs) - Explains "why" each coefficient exists (physical justification) - More useful for users trying to understand model behavior - Clearer for future maintainers All tests pass unchanged. * feat(latency): add trained-physics model (#939) Add trained-physics latency model backend from training branch iter29. This trained model applies learned correction factors to roofline basis functions, generalizing across model architectures, workloads, and TP configurations without per-model calibration. Changes: - Add trained_physics_model.go (10-beta mode with architecture-aware terms) - Add trained_physics_model_test.go (β₁₀ unit tests) - Add trained_physics_coefficients to defaults.yaml - Update latency model factory registration - Update all documentation (positioned as recommended default) - Cherry-pick Scout MoE nEff fix from PR #878 Architecture-aware features: - β₈ per-MoE-layer overhead (applies only to interleaved architectures) - Quantization-aware weight bandwidth (FP8, W4A16 detection) - Optional 10-beta mode with prefill/decode split Pre-trained coefficients: 3 alpha (API overhead) + 10 beta (roofline corrections) Generalizes across dense, uniform MoE, and interleaved MoE architectures. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): address PR review concerns for trained-physics model Fixes critical and important issues from PR #950 review: **Critical fixes (C1-C3):** - C1: Complete test rewrite — replaced disconnected math tests with proper behavioral tests (BC-1 through BC-7 from issue #939) - BC-1: Empty batch returns >= 1 - BC-2: Positive step time for all valid inputs (prefill, decode, mixed, large batches) - BC-3: Monotonicity tests (prefill tokens, decode batch size, sequence length) - BC-4: Architecture-aware β₈ scaling (interleaved vs uniform vs dense MoE) - BC-5: Overhead methods (QueueingTime, OutputTokenProcessingTime, PostDecodeFixedOverhead) - BC-6: Factory construction validation (TP, layers, NaN, negative coefficients) - BC-7: Config validation (coefficient length errors) - C2: Fixed isMoE threshold inconsistency (NumLocalExperts > 1, not > 0) — single-expert models now correctly classified as dense - C3: Added missing MFU values (MfuPrefill=0.55, MfuDecode=0.30) to Scout test to fix division by zero **Important fixes (I1):** - I1: Removed dead `dFF` field from TrainedPhysicsModel struct (written but never read) **Key insight from BC-4:** The implementation correctly uses two mechanisms: 1. `numMoELayers` — counts MoE layers for FLOPs/bandwidth calculations 2. `hasInterleavedMoE` — gates β₈ overhead via moeScaling (1.0 if interleaved, 0.0 otherwise) For uniform MoE (Mixtral-style), numMoELayers = NumLayers (all layers do MoE work for FLOPs), but hasInterleavedMoE = false so β₈ overhead doesn't apply. Uniform MoE is more expensive than interleaved because it does more MoE work (all 48 layers vs 24 MoE + 24 dense), even without β₈ overhead. All tests now pass. ✓ Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): address remaining PR review concerns (I2, I4, I5, I3 note) Fixes important issues from PR #950 review: **I2: YAML key naming consistency** - Changed TrainedPhysicsDefaults struct tags from `yaml:"alpha"` / `yaml:"beta"` to `yaml:"alpha_coeffs"` / `yaml:"beta_coeffs"` to match TrainedRooflineDefaults and CrossModelDefaults - Updated defaults.yaml to use `alpha_coeffs:` and `beta_coeffs:` keys - Reduces friction for manual defaults.yaml editing **I4: MoE consistency validation** - Added validation in NewTrainedPhysicsModel: if NumLocalExperts > 1 then NumExpertsPerTok must be > 0 - Mirrors ValidateRooflineConfig check (config.go:382) - Prevents silent misconfiguration where kEff falls through to 1, giving wrong FLOPs - Added test case: invalid_moe_missing_experts_per_tok **I5: Documentation correction** - Updated latency-models.md comparison table: Roofline MoE support changed from "No (dense only)" to "Yes (per-expert FLOPs + effective expert count)" - Reflects PR #877 fix that added interleaved MoE support to roofline **I3 note: Accuracy metrics** - Updated table to note trained-physics accuracy has been "separately tested" - Full test-split MAPE metrics are available in training artifacts All tests pass. Verified defaults.yaml loads correctly with new YAML keys. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs(latency): remove 'separately tested' phrase from accuracy description Remove the 'separately tested' phrase from trained-physics accuracy description in latency-models.md comparison table. Keep it concise like the other columns. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): address final PR review feedback for trained-physics model Fixes all outstanding items from PR #950 review comments: **Critical fixes:** - README.md: Add required --hardware and --tp flags to trained-physics example - sim/latency/trained_physics_model.go:136: Fix isMoE field comment (> 1, not > 0) - sim/latency/trained_physics_model.go:466: Add NumLocalExperts > 1 guard to hasInterleavedMoE **Documentation completeness:** - sim/latency/latency.go: Add TrainedPhysicsModel to package doc comment - sim/latency_model.go: Update "Four" → "Five" implementations - docs/concepts/glossary.md: Update "Four" → "Five" modes, add Trained-Physics **User-facing documentation:** - docs/guide/latency-models.md: Add comprehensive "Generalization Scope" section - Supported hardware (H100/A100 with specs) - Model architectures (dense/uniform MoE/interleaved MoE) - Workload types (prefill-heavy/decode-heavy/mixed) - Rationale for "recommended" status vs trained-roofline Addresses review feedback: - #950 (comment) - #950 (comment) Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): correct hardware flag from H100-SXM to H100 The hardware_config.json key is 'H100', not 'H100-SXM'. Valid keys: H100, A100-SXM, A100-80, L40S Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs(latency): add L40S to supported hardware, clarify H100 training - Add L40S (48 GB GDDR6, 362 TFLOPS BF16 / 1466 TFLOPS FP8) to hardware list - Clarify that coefficients were trained on H100 traces but generalize via roofline basis functions to other GPUs without per-GPU calibration Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * fix(latency): correct coefficient count (13 vs 10, not 14 vs 11) Trained-physics: 10 beta + 3 alpha = 13 coefficients Trained-roofline: 7 beta + 3 alpha = 10 coefficients The prefill/decode split (β₁ₐ/β₁ᵦ, β₂ₐ/β₂ᵦ) adds 2 extra beta terms vs trained-roofline's max() approach, plus β₈ for MoE overhead. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs(readme): move hardware/TP defaults note to Quick Start section Moved the note about --hardware/--tp defaulting to H100/TP=1 from the trained-physics section up to the Quick Start section where it applies generally to all latency models. Removed redundant mention from the trained-physics example. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * docs(readme): soften claim from 'all' to 'most' model architectures Changed trained-physics accuracy claim from 'across all model architectures' to 'across most model architectures' for more conservative language. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> --------- Co-authored-by: Claude <[email protected]>
PreviousNext