Skip to content

Init commit#1

Merged
sriumcp merged 9 commits intoinference-sim:mainfrom
sriumcp:initcommit
Jun 10, 2025
Merged

Init commit#1
sriumcp merged 9 commits intoinference-sim:mainfrom
sriumcp:initcommit

Conversation

@sriumcp
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp commented Jun 10, 2025

Bunch of basic files for inference-sim, including shape of request, batch, etc.

sriumcp added 9 commits June 9, 2025 20:39
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
@sriumcp sriumcp merged commit f36f1a1 into inference-sim:main Jun 10, 2025
atantawi added a commit to atantawi/inference-sim that referenced this pull request Mar 14, 2026
…e-sim#1)

Introduces first-class infrastructure entities for cluster simulation:
NodePool/Node/GPU inventory with bin-packing placement, 6-state instance
lifecycle (Scheduling→Loading→WarmingUp→Active→Draining→Terminated),
multi-model request routing with per-model metrics output, warm-up TTFT
penalty for cold instances, and three drain policies (IMMEDIATE/WAIT/REDIRECT).

Key additions:
- PlacementManager with first-fit bin-packing and GPU conservation invariant (INV-A)
- NodeReadyEvent/NodeDrainedEvent/InstanceLoadedEvent DES lifecycle events
- buildRouterState() model-filtered routing (backward-compatible, empty Model = all)
- inst.Model initialized from ModelHardwareConfig.Model at construction
- NaN/Inf guards on all float64 config fields (R3)
- Empty-snapshot guard in RoutingDecisionEvent prevents routing-policy panic
- warmUpRequestIDs cleared post-aggregateMetrics (memory management)
- Per-model TTFT/E2E/throughput in JSON output via ComputePerModelMetrics()
- 40+ new BDD/TDD tests including state-monotonicity invariant test

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
atantawi added a commit to atantawi/inference-sim that referenced this pull request Mar 16, 2026
…e-sim#1)

Introduces first-class infrastructure entities for cluster simulation:
NodePool/Node/GPU inventory with bin-packing placement, 6-state instance
lifecycle (Scheduling→Loading→WarmingUp→Active→Draining→Terminated),
multi-model request routing with per-model metrics output, warm-up TTFT
penalty for cold instances, and three drain policies (IMMEDIATE/WAIT/REDIRECT).

Key additions:
- PlacementManager with first-fit bin-packing and GPU conservation invariant (INV-A)
- NodeReadyEvent/NodeDrainedEvent/InstanceLoadedEvent DES lifecycle events
- buildRouterState() model-filtered routing (backward-compatible, empty Model = all)
- inst.Model initialized from ModelHardwareConfig.Model at construction
- NaN/Inf guards on all float64 config fields (R3)
- Empty-snapshot guard in RoutingDecisionEvent prevents routing-policy panic
- warmUpRequestIDs cleared post-aggregateMetrics (memory management)
- Per-model TTFT/E2E/throughput in JSON output via ComputePerModelMetrics()
- 40+ new BDD/TDD tests including state-monotonicity invariant test

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
susiejojo added a commit that referenced this pull request Mar 19, 2026
…prove code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.
sriumcp pushed a commit that referenced this pull request Mar 19, 2026
…#698)

* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
sriumcp pushed a commit that referenced this pull request Mar 20, 2026
* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes #762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion #589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
vishakha-ramani pushed a commit to atantawi/inference-sim that referenced this pull request Mar 20, 2026
…inference-sim#698)

* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR inference-sim#698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR inference-sim#698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
vishakha-ramani pushed a commit to atantawi/inference-sim that referenced this pull request Mar 20, 2026
* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR inference-sim#698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR inference-sim#698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes inference-sim#762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion inference-sim#589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
sriumcp added a commit that referenced this pull request Mar 20, 2026
* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle

Add node/GPU placement, instance lifecycle management, multi-model
routing, and per-model metrics to the cluster simulator.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* chore: add AGENTS.md to .gitignore

- AGENTS.md is a generated file for agent context
- Should not be committed to the repository

* fix(cluster): prevent double-counting in drain redirect policy

Fixes INV-1 conservation violation when DrainRedirect policy re-injects
queued requests. Previously, redirected requests were counted twice in
CompletedRequests: once when initially injected and again when completed
after redirection.

Changes:
- Add Request.Redirected field to track re-injected requests
- Mark requests as redirected in drainRedirect.Drain() before re-injection
- Skip CompletedRequests increment in recordRequestCompletion() for redirected requests
- Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix

The fix ensures INV-1 (request conservation) holds:
  injected = completed + queued + running + dropped + timed_out

Addresses PR #697 review feedback on drain redirect conservation.

* fix: address PR #697 review comments

- Document DrainPolicy as Phase 1C infrastructure
- Fix warm-up request overcounting by checking warmUpRemaining
- Fix drain callback memory leak in MarkNodeTerminated
- Add named constants for lifecycle event priorities
- Clarify drainWait GPU release and swap-remove logic
- Add .bob/ to .gitignore

* Fix warm-up TTFT penalty implementation

- Initialize warmUpRemaining for all instances in backward-compat mode
- Fix indentation in cluster_event.go
- Update warm-up recording to track first N requests
- Adjust test expectations to account for queueing effects

Fixes build and test failures in PR.

* chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory

* feat(latency): decouple quantized weight precision from compute dtype (#698)

* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(sim): add L40S GPU and FP8 compute support (#765)

* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes #762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion #589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics

- sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++.
  The guard caused redirected requests to vanish from INV-1 accounting: source's
  InjectedRequests=0 (drained from WaitQ before completion), destination's
  InjectedRequests=0 (skipped CompletedRequests). Destination is the sole
  completion site so incrementing there preserves conservation.

- cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition.
  Clusters where all failures are routing rejections (no routable instances)
  silently omitted the anomaly summary block (I3 from sriumcp review).

- sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter
  explicit (non-variadic). Prevents call sites from silently passing 0 and
  missing routing rejections. Updated all test call sites to pass 0 explicitly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* revert(ci): remove skip-cache from golangci-lint step

Unrelated to Phase 1A changes. skip-cache: true was added during
development but should not be merged to main.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests

- sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to
  actually seed requests into inst0's WaitQ before drain. Previous version
  used a manual event loop on an empty queue — DrainWaitQueue() returned []
  and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest
  directly + empty workload so Run() doesn't push duplicate arrivals. Also
  adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and
  clusterEvents non-empty after drain.

- sim/request.go: Update Redirected field comment to reflect actual behavior.
  Previous comment said "completion accounting is skipped" — opposite of what
  simulator.go:recordRequestCompletion now does.

- sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry
  for redirected requests before re-injection. Source registered the request
  at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry.
  Destination re-registers on re-enqueue, causing a spurious "duplicate
  request ID" WARN in aggregateMetrics() for every redirected request.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* docs: address sriumcp documentation gaps (GAP 1-4)

GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML
schema to the Policy Bundle section so users can discover and configure
Phase 1A features. Both are YAML-only (no CLI flags). Add a note block
explaining backward compatibility. Update DeploymentConfig row in the
config-to-flag mapping table to note the YAML-only fields.

GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to
"Rejected Requests (Admission)" to match actual CLI output label. Add
new "Rejected Requests (Routing)" row explaining when it fires (no
routable instances — all Loading/Draining) and the remediation action.

GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the
example output block to match printPerModelMetrics which uses m.TTFT.P50.
Add tok/s to the Throughput example line to match actual output format.

GAP 4 — results.md per_model JSON: Add table documenting the per_model
key in --results-path JSON output (omitted when no model tags present),
with field-by-field description.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

---------

Co-authored-by: tantawi <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Dipanwita Guhathakurta <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>
sriumcp pushed a commit that referenced this pull request Mar 23, 2026
…st flow (PR2) (#805)

* feat(sim/cluster): PD disaggregation — end-to-end disaggregated request flow (PR2)

Working PD pipeline. After this PR, users can run:

  blis run --prefill-instances 2 --decode-instances 2 --pd-decider always

Requests flow through the full prefill → KV transfer → decode lifecycle.
When pools are not configured (default), behavior is identical to today.

New files:
- sim/cluster/pd_events.go: PrefillRoutingEvent (priority 4),
  KVTransferStartedEvent (5), KVTransferCompletedEvent (6),
  DecodeRoutingEvent (7). Simple bandwidth calculation.
- sim/cluster/disaggregation_test.go: Integration tests — always-disaggregate
  E2E, pool conservation, prefill-to-decode lifecycle, phase causality,
  transfer conservation, determinism, backward compatibility, per-pool scorers.
- examples/pd-disaggregation-demo.yaml: Annotated demo configuration.

Modified files:
- sim/cluster/deployment.go: PD config fields (transfer bandwidth, base
  latency, KV bytes per token, per-pool scorer configs).
- sim/cluster/cluster.go: PD state (poolMembership, disaggregationDecider,
  parentRequests, pendingPrefillCompletions, pendingDecodeCompletions,
  transfer counters, per-pool routing policies). Updated constructor
  and Run() with prefill/decode completion detection.
- sim/cluster/cluster_event.go: DisaggregationDecisionEvent (priority 3).
  Bifurcated AdmissionDecisionEvent for pool-configured clusters.
- sim/cluster/instance.go: AllocateTransferredKV(), InjectDecodeOnline().
- sim/simulator.go: EnqueueDecodeSubRequest() — bypasses guards for
  pre-allocated KV, work-conserving (INV-8).
- sim/batch_formation.go: Decode-only batch path in FormBatch() Phase 2.
- cmd/root.go: CLI flags for PD configuration.
- docs/contributing/standards/invariants.md: INV-PD-1 through INV-PD-4.

Part of #793. Depends on PR1 (#794, merged).
Closes #795.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(lint): remove redundant nil check before len() on map

staticcheck S1009: len() for nil maps is defined as zero.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim/cluster): address PR review — INV-1 conservation, R8, dead code, invariant doc

C1 (Critical): Fix CompletedRequests double-counting in disaggregated mode.
Each disaggregated request produces two sub-requests (prefill + decode) that
complete on separate instances. aggregateMetrics() naively sums completions,
yielding 2N for N requests. Track pdPrefillCompletedCount and subtract after
aggregation to restore correct user-visible completion count.

C2 (Critical): Add TestDisaggregation_INV1Conservation to verify
CompletedRequests == N and full conservation equation in disaggregated mode.

I1 (Important): ParentRequests() now returns a defensive copy of the map
(R8: no exported mutable maps), matching PoolMembership() pattern.

I2 (Important): Replace structurally dead INV-PD-1 defensive check with a
comment documenting the structural guarantee. DecodeEnqueueTime and
TransferCompleteTime are both set from the same event timestamp by
construction (KVTransferCompletedEvent schedules DecodeRoutingEvent at
e.time), so the inequality can never fire.

I3 (Important): Add INV-PD-5 (Pool Stability) to invariants.md — was
referenced in test but absent from the canonical invariant document.

I4 (Minor): Remove unused eventTime parameter from InjectDecodeOnline.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim/cluster): address automated review — 5 issues (conservation, R8, INV-6, tests, MaxOutputLen)

1. (Critical 93) INV-1 at bounded horizon: track pdDecodeCompletedCount
   and compute pdInFlight = prefillCompleted - decodeCompleted - droppedAtDecodeKV.
   Add pdInFlight to StillRunning so requests mid-transfer aren't lost from
   conservation accounting. Added TestDisaggregation_INV1Conservation_BoundedHorizon.

2. (Critical 92) R8 deep copy: ParentRequests() now copies each ParentRequest
   struct (`cp := *v`) so callers cannot mutate internal lifecycle timestamps.

3. (Important 85) R2/INV-6 determinism: detectPrefillCompletions and
   detectDecodeCompletions now collect completed IDs into a sorted slice
   before processing, ensuring deterministic nextSeqID() assignment
   regardless of Go's random map iteration order.

4. (Important 82) Decode-only batch KV pressure test:
   TestDisaggregation_DecodeOnlyBatchKVPressure exercises the decode-only
   batch path under tight KV cache (50 blocks), verifying INV-1 conservation
   holds when decode-only requests cannot allocate KV.

5. (Important 80) MaxOutputLen parity: decode sub-request now copies
   MaxOutputLen from the original request, maintaining R23 parallel code
   path transformation parity with EnqueueRequest.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim/cluster): address re-review — pdInTransfer double-count, decode KV drop test, R8 comment

F1 (Critical): Fix pdInFlight double-counting at bounded horizon. When a decode
sub-request has been injected into an instance but not yet completed, it appears
in the instance's StillQueued/StillRunning via Finalize(). The old formula
(pdPrefillCompleted - pdDecodeCompleted - droppedAtDecodeKV) counted these
requests as in-transfer, adding them to StillRunning a second time. Fix:
subtract len(pendingDecodeCompletions) which tracks decode sub-requests already
on instances but not yet completed.

F2 (Important): Update ParentRequests() comment to accurately note that
OriginalRequest is a shared pointer. Callers must not mutate via it.

F3 (Important): Add TestDisaggregation_DroppedAtDecodeKV that actually triggers
the droppedAtDecodeKV path. Uses 1 decode instance with 3 KV blocks (48 tokens)
and 20-token requests — second concurrent transfer fails AllocateTransferredKV.
Verifies droppedAtDecodeKV > 0 and INV-1 conservation holds.

F5 (Minor): Replace hardcoded count 5 with len(requests) in
TestDisaggregation_TransferConservation.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim/cluster): address PR review — IsRoutable filter, Deadline propagation, graceful rejection

Critical #1: buildPoolFilteredSnapshots now filters by IsRoutable() for
parity with buildRouterState (R23). Prevents routing to Loading/Draining/
Terminated instances when PD disaggregation combines with Phase 1A lifecycle.

Critical #2: Propagate Deadline from original request to both prefill and
decode sub-requests. EnqueueDecodeSubRequest now schedules TimeoutEvent
when Deadline is set (R23: parity with EnqueueRequest). Also copy
MaxOutputLen and PrefixGroup to prefill sub-request for field completeness.

Important #3: Log warning when pdInTransfer is negative, mirroring the
existing inFlightRequests negative-check pattern. Surfaces bookkeeping
bugs instead of silently swallowing conservation gaps.

Important #4: Replace empty-pool panics in PrefillRoutingEvent and
DecodeRoutingEvent with graceful rejection (warn + routingRejections++
or droppedAtDecodeKV++). Now that IsRoutable() filtering can produce
empty pools at runtime, these are no longer pure programming errors.

Important #7: Qualify INV-PD-3 statement for bounded horizons — initiated
can exceed completed when transfers are in-flight at horizon.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim/cluster): address re-review NA-1/NA-2 — model filter comment, conservation equation

NA-1: Update buildPoolFilteredSnapshots comment to document that model
filter is intentionally omitted (all instances in DeploymentConfig share
config.Model). Notes where to add it if multi-model PD clusters are added.

NA-2: Fix INV-1 conservation assertions to include TimedOutRequests per
the canonical INV-1 definition. Extract assertINV1Conservation helper
used by all 5 conservation tests. Document in pdInTransfer comment that
timed-out prefill sub-requests are already counted in instance
TimedOutRequests and need no correction.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant