docs(blog): Physics of simulation blog article#997
Conversation
Blog Post Review — FeedbackGreat article overall — the deeper sections (Layer 1, Layer 2, Layer 3) read clearly and are easy to follow. A few suggestions, mostly focused on the intro and a couple of accuracy points: Intro
Admission Control (Layer 2)
"BLIS in Action" Example
|
Comprehensive revisions addressing PR #997 review feedback: 1. Title & Scope - Change to "Distributed Inference Platform Simulation" - Reframe opening: distributed systems, policy exploration, KPIs - Emphasize end-to-end simulation of routing, admission, autoscaling, engine 2. Remove Exaggerations & Jargon - "overspending by millions" → "wasting budget" - "lockstep" → "batches process together—all requests wait for slowest" - Make "workload analysis" concrete with TP degree example 3. Admission Control Simplification - Remove "token bucket" and "GIE architecture" overclaims - Focus on "saturation-based admit/reject" (what BLIS models today) - Update TL;DR, diagrams, prose to match current implementation 4. Request Journey Diagram - Add P/D disaggregation with Prefill Pool and Decode Pool - Show unified vs disaggregated paths with KV transfer - Metrics from all pools to control plane 5. BLIS in Action Clarity - Title: "Simulating a Configuration Decision" (not "Real Scenario") - "simulate these configurations" (not "test it") - "Simulated Results" with "Predicted P99 TTFT" column - Explicit validation note referencing next article All changes verified against codebase (routing_scorers.go, admission.go, saturation.go) to ensure claims match implementation. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Add comprehensive outline and writing guidelines for 7-min blog article on BLIS's structural integrity and physics modeling. Target audience: llm-d executives and platform engineers. Sections: - Opening: simulation trust problem - CPU-only breakthrough (physics + learning) - Request journey (engine, data plane, control plane) - Real scenario (PD disaggregation example) - Closing (tease validation article) Includes tone guidelines, coordination notes, and success criteria for multiple contributors. Issue: #993
Changes: - Update confrontational line to discovery framing (Option 2) - Remove standalone Section 2 (CPU-only breakthrough) - Integrate CPU-only approach into Section 1 (brief mention in thesis) - Expand Section 2A (engine level) with CPU-only technical details - Renumber all sections (Section 3 → Section 2, etc.) - Update transitions and coordination notes Total sections now: 4 (was 5) - Section 1: Opening (~350 words) - Section 2: Request Journey with 4 subsections (~1550 words) - Section 3: Real Scenario (~400 words) - Section 4: Closing (~400 words) Smoother narrative flow from problem → journey → example → closing.
Change from academic numbering (Section 2A, 2B, 2C, 2D) to more blog-friendly layer names: - Section 2A → Layer 1: The Engine (vLLM) - Section 2B → Layer 2: The Data Plane (Cluster Orchestration) - Section 2C → Layer 3: The Control Plane (Autoscaling) - Section 2D → The Complete Journey Updated coordination notes and references throughout.
- Create docs/blog/posts/building-trust-physics-of-simulation.md with: - Proper YAML frontmatter (date, authors, categories, draft flag) - Section 1 and Layer 1 content written - Placeholders for Layer 2, Layer 3, Complete Journey, and other sections - MkDocs Material blog format (<!-- more --> tag for excerpt) - Update outline to document MkDocs Material blog requirements: - YAML frontmatter structure - Authors from .authors.yml - Standard markdown heading hierarchy - Reference to existing blog post Ready for multi-author collaboration on remaining sections.
Replace informal contractions with formal equivalents: - isn't → is not - doesn't → does not - don't → do not - can't → cannot - Let's → Let us - Here's → Here is - we're → we are Blog should maintain professional tone suitable for technical executives and platform engineers.
Improvements to Layer 1 (The Engine):
- Add bold subheadings for structure (batch-step paradigm, concrete
example, why models fail, etc.)
- Remove redundant transitions ("Let us start", "So what does",
"Watch what happens")
- More concise language (425 words vs 500 words)
- Clear takeaway box at end
- Consistent format: statement → example → implication
Maintains technical accuracy while improving readability for both
executives and platform engineers.
Rewrite Layer 1 as flowing narrative paragraphs instead of choppy bold-header format. Keep all concrete examples (four requests, 2ms vs 20ms batch evolution). Better transitions between ideas while maintaining same word count and technical accuracy.
…n-real-time claim
…ds for visibility in both modes
…nge section title
…d like reference blog
…lity in both modes
…curacy section, remove PD disagg example
Add ~500 words covering cluster-level orchestration through four gates: - Gate 1: Admission control (token bucket rate limiting) - Gate 2: Gateway queue with saturation-gated dispatch (BLIS innovation) - Gate 3: Routing with three-tier signal freshness hierarchy - Gate 4: P/D orchestration with KV transfer modeling Key additions: - llm-d parity woven naturally (default routing profile, ZMQ delay) - BLIS innovation highlighted (gateway queue 20-40% TTFT improvement) - Signal staleness explained with concrete 2s blind spot example - Compound latency tax demonstrated (2ms + 5ms + 20ms = 27ms) Maintains accessible-but-rigorous tone matching Layer 1, with concrete numbers and clear "why it matters" conclusion. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Simplified vLLM mechanism description and latency model explanation: - Removed duplicate listing of mechanisms (vLLM uses X, BLIS models X) - Added inline explanations for each mechanism (continuous batching, mixed prefill-decode, etc.) - Simplified basis function explanation from three sentences to two - Emphasized generalization across LLM architectures, hardware, and TP configs - Clarified that accurate forward pass predictions drive end-to-end latency metrics Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Major restructuring of Layer 2 section: - Added pluggable interfaces intro (admission policies, saturation detectors, routing scorers, disaggregation deciders) - Combined admission + flow control into single gate with GIE (Gateway Inference Engine) parity - Condensed routing from 3 paragraphs to 1, focusing on burst arrival challenge and BLIS mechanisms (in-flight tracking, signal staleness with 2s ZMQ delay) - Added prefill/decode disaggregation section explaining compute-bound vs memory-bound separation, KV transfer pipeline, bandwidth contention - Removed made-up latency numbers, complex saturation formulas, and repetitive explanations - Updated mermaid diagram to show full pipeline including disaggregation branch Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed "model" to "LLM" in basis function formula and explanation for clarity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Removed parenthetical explanations for compute-bound and memory-bound to reduce verbosity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added concise TL;DR blockquotes at the start of Layer 1 (vLLM) and Layer 2 (Data Plane) sections to make the content more accessible for executive readers. Each summary highlights the core problem, what BLIS models, and how it works. Layer 1 emphasizes batch coupling, vLLM pipeline modeling, and CPU-only latency prediction. Layer 2 showcases llm-d GIE architecture parity with composable weighted routing, nine scoring dimensions, and production orchestration features. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Rewrote the "Building Fidelity" section to accurately represent what BLIS does: - Added explicit **capacity planning** and **configuration search** as key use cases - Replaced "mechanisms that do not yet exist" with accurate description of policy/config experimentation within vLLM+llm-d architecture - Condensed repetitive paragraphs while keeping scannable bullet points - Added links to vLLM and llm-d when first mentioned - Added ADRS (AI-Driven Research Systems) link as concrete use case example - Clarified hybrid approach: physics-based dynamics + learnable latency components Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Completed the blog post with three major additions: **Layer 3: The Control Plane (Autoscaling)** - Added TL;DR explaining autoscaling feedback loop costs (30+ min experiments) - Described autoscaling purpose: balancing SLOs vs. GPU costs - Detailed WVA four-stage pipeline (Collector/Analyzer/Optimizer/Actuator) - Emphasized pluggable interfaces for autoscaling policy research **BLIS in Action: A Real Scenario** - Concrete example: testing prefix-aware routing vs. load-balanced routing - Showed actual CLI commands (observe/replay workflow) - Revealed trade-offs: prefix-aware wins at moderate load, degrades during bursts - Value proposition: test 10 configs in minutes vs. weeks of production A/B tests **From Modeling to Validation** - Recap of three-layer architecture and hybrid modeling approach - Validation teaser: tested against production workloads and commercial simulators - Next article promise: detailed methodology and accuracy results Additional improvements: - Removed "The Complete Journey" section (redundant with BLIS in Action) - Added ADRS (AI-Driven Research Systems) link as use case example - Refined all TL;DR sections for clarity and consistency - Shortened "From Modeling to Validation" section to avoid over-claiming Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Fix equation description: map φᵢ parameters 1:1 (batch, LLM, hardware) - Replace complex routing staleness example with simpler hardware comparison - Show H100 round-robin vs prefix-aware (7% improvement) vs A100 (4× slower) - Use verified simulation results: 8 instances, TP=2, chatbot workload at 50 req/s - Emphasize hierarchy: routing optimization (7%) vs hardware choice (4×) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Comprehensive revisions addressing PR #997 review feedback: 1. Title & Scope - Change to "Distributed Inference Platform Simulation" - Reframe opening: distributed systems, policy exploration, KPIs - Emphasize end-to-end simulation of routing, admission, autoscaling, engine 2. Remove Exaggerations & Jargon - "overspending by millions" → "wasting budget" - "lockstep" → "batches process together—all requests wait for slowest" - Make "workload analysis" concrete with TP degree example 3. Admission Control Simplification - Remove "token bucket" and "GIE architecture" overclaims - Focus on "saturation-based admit/reject" (what BLIS models today) - Update TL;DR, diagrams, prose to match current implementation 4. Request Journey Diagram - Add P/D disaggregation with Prefill Pool and Decode Pool - Show unified vs disaggregated paths with KV transfer - Metrics from all pools to control plane 5. BLIS in Action Clarity - Title: "Simulating a Configuration Decision" (not "Real Scenario") - "simulate these configurations" (not "test it") - "Simulated Results" with "Predicted P99 TTFT" column - Explicit validation note referencing next article All changes verified against codebase (routing_scorers.go, admission.go, saturation.go) to ensure claims match implementation. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- "can't" → "cannot" for formal tone - "—" → " - " for dash consistency Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…tion Replace mathematical equation with intuitive flowchart showing: - 📊 Inputs: batch state, model architecture, hardware specs - ⚡ Physics-Based Roofline: compute and memory bottlenecks - 🎯 Learned Corrections: trained on real vLLM traces - ⏱️ Output: predicted step time Changes: - Remove equation notation (Σ βᵢ φᵢ) for accessibility - Use "Compute Operations / GPU Speed" instead of "FLOPs / TFLOPs" - Use "Bytes Transferred / GPU Bandwidth" for clarity - Add emojis for visual appeal (work in both light/dark modes) - Bold subgraph headers for emphasis Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Tightened prose throughout while preserving key points: 1. Building Fidelity section - Removed redundant listing of platform components (already in opening) - Condensed capabilities to parenthetical format 2. Layer 1 - Simplified "What BLIS captures" by removing redundant explanations - Changed "physics-based roofline" → "physics-based latency model" 3. Layer 2 - Removed paragraph that repeated TL;DR content verbatim - Kept only TL;DR with essential details 4. Layer 3 - Tightened intro to avoid repeating timing details - "Why simulate?" → "What BLIS enables" (more direct) - Removed repetitive compression time claims 5. Physics-based dynamics - Split long sentence into punchier short sentences - Removed redundant "before production deployment" 6. Conclusion - Condensed to avoid repeating platform components - Merged validation tease into single concise statement Result: ~15% shorter with better flow, no circular repetition. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Remove "validation against production systems (the topic of our next article) confirms BLIS accuracy" - already covered in conclusion. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Italicize "What does it take to build a simulator accurate enough to guide these decisions?" to emphasize the central question. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
|
|
||
| ## Building Fidelity from First Principles | ||
|
|
||
| [BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds. |
There was a problem hiding this comment.
| [BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds. | |
| [BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. Also known as discrete-even simulation, this approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds. |
|
|
||
| BLIS uses discrete-event simulation to model all three layers. This full-stack fidelity enables **capacity planning** (instance count, GPU type, TP degree) and **configuration search** (routing weights, admission thresholds). Without modeling distributed system couplings, planners predict linear scaling where production saturates, miss SLO violations from routing pile-on, or deploy autoscalers that oscillate. | ||
|
|
||
| By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment: |
There was a problem hiding this comment.
| By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment: | |
| By modeling the behavior of production systems at the server ([vLLM](https://github.com/vllm-project/vllm) and platform [llm-d](https://llm-d.ai)) layers, BLIS enables safe experimentation before deployment: |
| - **Capacity planning** — Compare model/GPU/TP configurations | ||
| - **Workload analysis** — Test how switching from TP=2 to TP=4 affects tail latency under production traffic patterns | ||
|
|
||
| Physics-based dynamics with learnable latency components generalize across model architectures and hardware while maintaining production fidelity. Test new configurations on a laptop in seconds without needing production infrastructure. This enables projects like [ADRS](https://sky.cs.berkeley.edu/project/adrs/) (AI-Driven Research Systems) to develop and validate serving policies through fast simulation loops. |
There was a problem hiding this comment.
ADRS... or AI-native systems?
There was a problem hiding this comment.
ADRS/AI-native systems should be "frameworks" or "paradigms" rather than "projects," wdyt?
|
|
||
| ## A Request's Journey: The Hidden Complexity | ||
|
|
||
| A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model. |
There was a problem hiding this comment.
Nit: previously you used "50ms" and "200ms", but you're spelling it out here. Maybe just stick to 50ms
|
|
||
| A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model. | ||
|
|
||
| ```mermaid |
There was a problem hiding this comment.
Is it possible to also color code each layer?
There was a problem hiding this comment.
The color coding gets mixed up really bad between light/dark mode. Claude tried multiple ways using css/js injection but it wouldn't look good in the dark mode, so I went with flat colors.
| end | ||
|
|
||
| Request([Request]) --> Layer2 | ||
| PD -->|Unified| Layer1 |
| subgraph Layer3["Layer 3: Control Plane"] | ||
| Monitor[Monitor Metrics] | ||
| Decide[Scale Decision] | ||
| Actuate[Add/Remove Instances] | ||
| Monitor --> Decide --> Actuate | ||
| end |
There was a problem hiding this comment.
Should control plane link to Layer 1 for scaling? Rn it only links to Layer 2
| F -->|KV Transfer| G[Decode Pool] | ||
| ``` | ||
|
|
||
| **Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance. |
There was a problem hiding this comment.
| **Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance. | |
| **Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions, the default behavior in llm-d: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance. |
|
|
||
| Autoscaling dynamically adjusts instance count to match demand. In production, this happens through feedback loops where HPA scrapes, Kubernetes scheduling, VM provisioning, and model loading all add latency before a new replica serves traffic. | ||
|
|
||
| **What BLIS captures.** BLIS models llm-d's WVA four-stage pipeline — Collect, Analyze, Optimize, Actuate, with pluggable interfaces. **Collector** observes per-replica metrics, **Analyzer** detects saturation and emits scaling signals, **Optimizer** decides which GPU types to add/remove respecting multi-model inventory constraints, and **Actuator** applies decisions with configurable delay. |
There was a problem hiding this comment.
Maybe link to the llm-d-workload-variant-autoscaler repp here
Summary
Introduces "The Physics of High-Fidelity Inference Simulation" blog article explaining how BLIS achieves production-level accuracy by modeling the mechanisms that determine latency.
What This Article Covers
Building Fidelity from First Principles: BLIS uses discrete-event simulation to model inference serving, advancing from event to event rather than stepping through continuous time. This runs orders of magnitude faster than real-time and enables experimentation with mechanisms that don't yet exist in production.
Layer 1 - The Engine (vLLM): Explains how BLIS models continuous batching, mixed prefill-decode execution, KV cache management, and chunked prefill. Describes the trained-physics latency model that combines basis functions (computational physics) with learned coefficients (hardware-specific corrections), generalizing across LLM architectures, hardware, and tensor parallelism configurations.
Layer 2 - The Data Plane (Cluster Orchestration): Covers admission/flow control with GIE (Gateway Inference Engine) parity, routing with in-flight tracking and signal staleness (2-second cache blind spot matching llm-d's ZMQ propagation delay), and prefill/decode disaggregation with KV transfer modeling. Emphasizes that all mechanisms—admission policies, saturation detectors, routing scorers, disaggregation deciders—are pluggable interfaces for testing custom algorithms against production workloads.
Preview Instructions
To preview the blog locally:
The article is at
docs/blog/posts/building-trust-physics-of-simulation.md.Closes #993
🤖 Generated with Claude Code