docs(blog): Physics of simulation blog article by susiejojo · Pull Request #997 · inference-sim/inference-sim

susiejojo · 2026-04-09T18:45:27Z

Summary

Introduces "The Physics of High-Fidelity Inference Simulation" blog article explaining how BLIS achieves production-level accuracy by modeling the mechanisms that determine latency.

What This Article Covers

Building Fidelity from First Principles: BLIS uses discrete-event simulation to model inference serving, advancing from event to event rather than stepping through continuous time. This runs orders of magnitude faster than real-time and enables experimentation with mechanisms that don't yet exist in production.

Layer 1 - The Engine (vLLM): Explains how BLIS models continuous batching, mixed prefill-decode execution, KV cache management, and chunked prefill. Describes the trained-physics latency model that combines basis functions (computational physics) with learned coefficients (hardware-specific corrections), generalizing across LLM architectures, hardware, and tensor parallelism configurations.

Layer 2 - The Data Plane (Cluster Orchestration): Covers admission/flow control with GIE (Gateway Inference Engine) parity, routing with in-flight tracking and signal staleness (2-second cache blind spot matching llm-d's ZMQ propagation delay), and prefill/decode disaggregation with KV transfer modeling. Emphasizes that all mechanisms—admission policies, saturation detectors, routing scorers, disaggregation deciders—are pluggable interfaces for testing custom algorithms against production workloads.

Preview Instructions

To preview the blog locally:

# Install dependencies (first time only)
pip install mkdocs-material mkdocs-blog-plugin

# Serve locally
mkdocs serve

# Open http://127.0.0.1:8000/blog/ in your browser

The article is at docs/blog/posts/building-trust-physics-of-simulation.md.

Closes #993

🤖 Generated with Claude Code

mtoslalibu · 2026-04-10T14:10:17Z

Blog Post Review — Feedback

Great article overall — the deeper sections (Layer 1, Layer 2, Layer 3) read clearly and are easy to follow. A few suggestions, mostly focused on the intro and a couple of accuracy points:

Intro

"overspending by millions" — This feels like an exaggeration without backing. Consider replacing with a concrete, sourced number (e.g., from published cloud cost analyses or industry reports) to make the claim more credible.
"Inference engines process batches in lockstep" — "Lockstep" is jargon that may not land with all readers. Consider either defining it inline (e.g., "in lockstep — meaning all requests in a batch wait for the slowest operation to finish") or using simpler phrasing.
"Workload analysis — Evaluate architecture changes against realistic traffic" — This bullet is vague compared to the others. What does "architecture changes" mean here concretely? A brief example would help (e.g., "test how switching from TP=2 to TP=4 affects tail latency under production traffic patterns").
General note on the intro: Ironically, the rest of the article is much smoother and easier to understand than the intro, which is the most important section for drawing readers in. Consider simplifying the language and flow to match the clarity of the later sections.

Admission Control (Layer 2)

The current description implies strong parity between BLIS and llm-d on admission/flow control, but we're not fully there yet. Suggest simplifying this to focus on what BLIS actually models well today: saturation-based admit/reject decisions. Drop the flow control and token bucket details for now, or caveat them as in-progress.

"BLIS in Action" Example

The --routing-scorers "prefix-affinity:2,queue-depth:1" flag and the "prefix-aware routing" label appear to use a stale name. Replace with the actual current prefix router name to keep the example accurate and up to date.

Comprehensive revisions addressing PR #997 review feedback: 1. Title & Scope - Change to "Distributed Inference Platform Simulation" - Reframe opening: distributed systems, policy exploration, KPIs - Emphasize end-to-end simulation of routing, admission, autoscaling, engine 2. Remove Exaggerations & Jargon - "overspending by millions" → "wasting budget" - "lockstep" → "batches process together—all requests wait for slowest" - Make "workload analysis" concrete with TP degree example 3. Admission Control Simplification - Remove "token bucket" and "GIE architecture" overclaims - Focus on "saturation-based admit/reject" (what BLIS models today) - Update TL;DR, diagrams, prose to match current implementation 4. Request Journey Diagram - Add P/D disaggregation with Prefill Pool and Decode Pool - Show unified vs disaggregated paths with KV transfer - Metrics from all pools to control plane 5. BLIS in Action Clarity - Title: "Simulating a Configuration Decision" (not "Real Scenario") - "simulate these configurations" (not "test it") - "Simulated Results" with "Predicted P99 TTFT" column - Explicit validation note referencing next article All changes verified against codebase (routing_scorers.go, admission.go, saturation.go) to ensure claims match implementation. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Add comprehensive outline and writing guidelines for 7-min blog article on BLIS's structural integrity and physics modeling. Target audience: llm-d executives and platform engineers. Sections: - Opening: simulation trust problem - CPU-only breakthrough (physics + learning) - Request journey (engine, data plane, control plane) - Real scenario (PD disaggregation example) - Closing (tease validation article) Includes tone guidelines, coordination notes, and success criteria for multiple contributors. Issue: #993

Changes: - Update confrontational line to discovery framing (Option 2) - Remove standalone Section 2 (CPU-only breakthrough) - Integrate CPU-only approach into Section 1 (brief mention in thesis) - Expand Section 2A (engine level) with CPU-only technical details - Renumber all sections (Section 3 → Section 2, etc.) - Update transitions and coordination notes Total sections now: 4 (was 5) - Section 1: Opening (~350 words) - Section 2: Request Journey with 4 subsections (~1550 words) - Section 3: Real Scenario (~400 words) - Section 4: Closing (~400 words) Smoother narrative flow from problem → journey → example → closing.

Change from academic numbering (Section 2A, 2B, 2C, 2D) to more blog-friendly layer names: - Section 2A → Layer 1: The Engine (vLLM) - Section 2B → Layer 2: The Data Plane (Cluster Orchestration) - Section 2C → Layer 3: The Control Plane (Autoscaling) - Section 2D → The Complete Journey Updated coordination notes and references throughout.

- Create docs/blog/posts/building-trust-physics-of-simulation.md with: - Proper YAML frontmatter (date, authors, categories, draft flag) - Section 1 and Layer 1 content written - Placeholders for Layer 2, Layer 3, Complete Journey, and other sections - MkDocs Material blog format ( tag for excerpt) - Update outline to document MkDocs Material blog requirements: - YAML frontmatter structure - Authors from .authors.yml - Standard markdown heading hierarchy - Reference to existing blog post Ready for multi-author collaboration on remaining sections.

Replace informal contractions with formal equivalents: - isn't → is not - doesn't → does not - don't → do not - can't → cannot - Let's → Let us - Here's → Here is - we're → we are Blog should maintain professional tone suitable for technical executives and platform engineers.

Improvements to Layer 1 (The Engine): - Add bold subheadings for structure (batch-step paradigm, concrete example, why models fail, etc.) - Remove redundant transitions ("Let us start", "So what does", "Watch what happens") - More concise language (425 words vs 500 words) - Clear takeaway box at end - Consistent format: statement → example → implication Maintains technical accuracy while improving readability for both executives and platform engineers.

Rewrite Layer 1 as flowing narrative paragraphs instead of choppy bold-header format. Keep all concrete examples (four requests, 2ms vs 20ms batch evolution). Better transitions between ideas while maintaining same word count and technical accuracy.

…tions

…allowed list

…mode

…n-real-time claim

…ds for visibility in both modes

…e theme colors

…andle theming

…m layers

…k mode

…nge section title

…d like reference blog

…lity in both modes

…curacy section, remove PD disagg example

…ing dark mode

Add ~500 words covering cluster-level orchestration through four gates: - Gate 1: Admission control (token bucket rate limiting) - Gate 2: Gateway queue with saturation-gated dispatch (BLIS innovation) - Gate 3: Routing with three-tier signal freshness hierarchy - Gate 4: P/D orchestration with KV transfer modeling Key additions: - llm-d parity woven naturally (default routing profile, ZMQ delay) - BLIS innovation highlighted (gateway queue 20-40% TTFT improvement) - Signal staleness explained with concrete 2s blind spot example - Compound latency tax demonstrated (2ms + 5ms + 20ms = 27ms) Maintains accessible-but-rigorous tone matching Layer 1, with concrete numbers and clear "why it matters" conclusion. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Simplified vLLM mechanism description and latency model explanation: - Removed duplicate listing of mechanisms (vLLM uses X, BLIS models X) - Added inline explanations for each mechanism (continuous batching, mixed prefill-decode, etc.) - Simplified basis function explanation from three sentences to two - Emphasized generalization across LLM architectures, hardware, and TP configs - Clarified that accurate forward pass predictions drive end-to-end latency metrics Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Major restructuring of Layer 2 section: - Added pluggable interfaces intro (admission policies, saturation detectors, routing scorers, disaggregation deciders) - Combined admission + flow control into single gate with GIE (Gateway Inference Engine) parity - Condensed routing from 3 paragraphs to 1, focusing on burst arrival challenge and BLIS mechanisms (in-flight tracking, signal staleness with 2s ZMQ delay) - Added prefill/decode disaggregation section explaining compute-bound vs memory-bound separation, KV transfer pipeline, bandwidth contention - Removed made-up latency numbers, complex saturation formulas, and repetitive explanations - Updated mermaid diagram to show full pipeline including disaggregation branch Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Changed "model" to "LLM" in basis function formula and explanation for clarity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Removed parenthetical explanations for compute-bound and memory-bound to reduce verbosity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Added concise TL;DR blockquotes at the start of Layer 1 (vLLM) and Layer 2 (Data Plane) sections to make the content more accessible for executive readers. Each summary highlights the core problem, what BLIS models, and how it works. Layer 1 emphasizes batch coupling, vLLM pipeline modeling, and CPU-only latency prediction. Layer 2 showcases llm-d GIE architecture parity with composable weighted routing, nine scoring dimensions, and production orchestration features. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Rewrote the "Building Fidelity" section to accurately represent what BLIS does: - Added explicit **capacity planning** and **configuration search** as key use cases - Replaced "mechanisms that do not yet exist" with accurate description of policy/config experimentation within vLLM+llm-d architecture - Condensed repetitive paragraphs while keeping scannable bullet points - Added links to vLLM and llm-d when first mentioned - Added ADRS (AI-Driven Research Systems) link as concrete use case example - Clarified hybrid approach: physics-based dynamics + learnable latency components Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Completed the blog post with three major additions: **Layer 3: The Control Plane (Autoscaling)** - Added TL;DR explaining autoscaling feedback loop costs (30+ min experiments) - Described autoscaling purpose: balancing SLOs vs. GPU costs - Detailed WVA four-stage pipeline (Collector/Analyzer/Optimizer/Actuator) - Emphasized pluggable interfaces for autoscaling policy research **BLIS in Action: A Real Scenario** - Concrete example: testing prefix-aware routing vs. load-balanced routing - Showed actual CLI commands (observe/replay workflow) - Revealed trade-offs: prefix-aware wins at moderate load, degrades during bursts - Value proposition: test 10 configs in minutes vs. weeks of production A/B tests **From Modeling to Validation** - Recap of three-layer architecture and hybrid modeling approach - Validation teaser: tested against production workloads and commercial simulators - Next article promise: detailed methodology and accuracy results Additional improvements: - Removed "The Complete Journey" section (redundant with BLIS in Action) - Added ADRS (AI-Driven Research Systems) link as use case example - Refined all TL;DR sections for clarity and consistency - Shortened "From Modeling to Validation" section to avoid over-claiming Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- Fix equation description: map φᵢ parameters 1:1 (batch, LLM, hardware) - Replace complex routing staleness example with simpler hardware comparison - Show H100 round-robin vs prefix-aware (7% improvement) vs A100 (4× slower) - Use verified simulation results: 8 instances, TP=2, chatbot workload at 50 req/s - Emphasize hierarchy: routing optimization (7%) vs hardware choice (4×) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Comprehensive revisions addressing PR #997 review feedback: 1. Title & Scope - Change to "Distributed Inference Platform Simulation" - Reframe opening: distributed systems, policy exploration, KPIs - Emphasize end-to-end simulation of routing, admission, autoscaling, engine 2. Remove Exaggerations & Jargon - "overspending by millions" → "wasting budget" - "lockstep" → "batches process together—all requests wait for slowest" - Make "workload analysis" concrete with TP degree example 3. Admission Control Simplification - Remove "token bucket" and "GIE architecture" overclaims - Focus on "saturation-based admit/reject" (what BLIS models today) - Update TL;DR, diagrams, prose to match current implementation 4. Request Journey Diagram - Add P/D disaggregation with Prefill Pool and Decode Pool - Show unified vs disaggregated paths with KV transfer - Metrics from all pools to control plane 5. BLIS in Action Clarity - Title: "Simulating a Configuration Decision" (not "Real Scenario") - "simulate these configurations" (not "test it") - "Simulated Results" with "Predicted P99 TTFT" column - Explicit validation note referencing next article All changes verified against codebase (routing_scorers.go, admission.go, saturation.go) to ensure claims match implementation. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- "can't" → "cannot" for formal tone - "—" → " - " for dash consistency Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…tion Replace mathematical equation with intuitive flowchart showing: - 📊 Inputs: batch state, model architecture, hardware specs - ⚡ Physics-Based Roofline: compute and memory bottlenecks - 🎯 Learned Corrections: trained on real vLLM traces - ⏱️ Output: predicted step time Changes: - Remove equation notation (Σ βᵢ φᵢ) for accessibility - Use "Compute Operations / GPU Speed" instead of "FLOPs / TFLOPs" - Use "Bytes Transferred / GPU Bandwidth" for clarity - Add emojis for visual appeal (work in both light/dark modes) - Bold subgraph headers for emphasis Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Tightened prose throughout while preserving key points: 1. Building Fidelity section - Removed redundant listing of platform components (already in opening) - Condensed capabilities to parenthetical format 2. Layer 1 - Simplified "What BLIS captures" by removing redundant explanations - Changed "physics-based roofline" → "physics-based latency model" 3. Layer 2 - Removed paragraph that repeated TL;DR content verbatim - Kept only TL;DR with essential details 4. Layer 3 - Tightened intro to avoid repeating timing details - "Why simulate?" → "What BLIS enables" (more direct) - Removed repetitive compression time claims 5. Physics-based dynamics - Split long sentence into punchier short sentences - Removed redundant "before production deployment" 6. Conclusion - Condensed to avoid repeating platform components - Merged validation tease into single concise statement Result: ~15% shorter with better flow, no circular repetition. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Remove "validation against production systems (the topic of our next article) confirms BLIS accuracy" - already covered in conclusion. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Italicize "What does it take to build a simulator accurate enough to guide these decisions?" to emphasize the central question. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

jgchn · 2026-04-13T15:27:59Z

+
+## Building Fidelity from First Principles
+
+[BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds.


Suggested change

[BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds.

[BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. Also known as discrete-even simulation, this approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds.

jgchn · 2026-04-13T15:30:23Z

+
+BLIS uses discrete-event simulation to model all three layers. This full-stack fidelity enables **capacity planning** (instance count, GPU type, TP degree) and **configuration search** (routing weights, admission thresholds). Without modeling distributed system couplings, planners predict linear scaling where production saturates, miss SLO violations from routing pile-on, or deploy autoscalers that oscillate.
+
+By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment:


Suggested change

By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment:

By modeling the behavior of production systems at the server ([vLLM](https://github.com/vllm-project/vllm) and platform [llm-d](https://llm-d.ai)) layers, BLIS enables safe experimentation before deployment:

jgchn · 2026-04-13T15:34:21Z

+- **Capacity planning** — Compare model/GPU/TP configurations
+- **Workload analysis** — Test how switching from TP=2 to TP=4 affects tail latency under production traffic patterns
+
+Physics-based dynamics with learnable latency components generalize across model architectures and hardware while maintaining production fidelity. Test new configurations on a laptop in seconds without needing production infrastructure. This enables projects like [ADRS](https://sky.cs.berkeley.edu/project/adrs/) (AI-Driven Research Systems) to develop and validate serving policies through fast simulation loops.


ADRS... or AI-native systems?

ADRS/AI-native systems should be "frameworks" or "paradigms" rather than "projects," wdyt?

jgchn · 2026-04-13T16:15:50Z

+
+## A Request's Journey: The Hidden Complexity
+
+A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model.


Nit: previously you used "50ms" and "200ms", but you're spelling it out here. Maybe just stick to 50ms

jgchn · 2026-04-13T16:16:26Z

+
+A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model.
+
+```mermaid


Is it possible to also color code each layer?

The color coding gets mixed up really bad between light/dark mode. Claude tried multiple ways using css/js injection but it wouldn't look good in the dark mode, so I went with flat colors.

jgchn · 2026-04-13T16:17:18Z

+    end
+
+    Request([Request]) --> Layer2
+    PD -->|Unified| Layer1


Unified or "Aggregate"?

jgchn · 2026-04-13T16:18:19Z

+    subgraph Layer3["Layer 3: Control Plane"]
+        Monitor[Monitor Metrics]
+        Decide[Scale Decision]
+        Actuate[Add/Remove Instances]
+        Monitor --> Decide --> Actuate
+    end


Should control plane link to Layer 1 for scaling? Rn it only links to Layer 2

jgchn · 2026-04-13T16:24:21Z

+    F -->|KV Transfer| G[Decode Pool]
+```
+
+**Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.


Suggested change

**Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.

**Admission control** determines whether requests enter the system. BLIS models saturation-based admit/reject decisions, the default behavior in llm-d: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.

jgchn · 2026-04-13T16:25:53Z

+
+Autoscaling dynamically adjusts instance count to match demand. In production, this happens through feedback loops where HPA scrapes, Kubernetes scheduling, VM provisioning, and model loading all add latency before a new replica serves traffic.
+
+**What BLIS captures.** BLIS models llm-d's WVA four-stage pipeline — Collect, Analyze, Optimize, Actuate, with pluggable interfaces. **Collector** observes per-replica metrics, **Analyzer** detects saturation and emits scaling signals, **Optimizer** decides which GPU types to add/remove respecting multi-model inventory constraints, and **Actuator** applies decisions with configurable delay.


Maybe link to the llm-d-workload-variant-autoscaler repp here

susiejojo changed the title ~~docs(blog): Layer 1 and Layer 2 content for physics of simulation article~~ docs(blog): Physics of simulation blog article Apr 9, 2026

susiejojo marked this pull request as draft April 9, 2026 18:47

susiejojo added 26 commits April 10, 2026 14:34

docs(blog): replace per-request equation with BLIS latency basis func…

72437c9

…tions

fix(blog): change category from 'Deep Dive' to 'Deep Dives' to match …

34a63f8

…allowed list

feat(docs): add MathJax support and fix mermaid text visibility

dd335be

docs(blog): add all authors to blog post

9b1c5ac

fix(docs): ensure mermaid text is black on light mode, white on dark …

501471a

…mode

fix(docs): use more specific CSS selectors for mermaid text visibility

ae78c9e

fix(blog): add explicit black text color to all mermaid diagram elements

458b8a2

fix(blog): use mermaid theme init to force black text in diagram

80d0448

fix(docs): make mermaid text adapt to both light and dark modes

327f0e3

docs(blog): tighten Layer 1 for clarity and readability

ee489ac

docs(blog): introduce discrete-event simulation early with faster-tha…

54df669

…n-real-time claim

fix(blog): use mermaid theme init with explicit black text for all modes

a419780

fix(blog): remove theme init forcing black text, use darker backgroun…

126376f

…ds for visibility in both modes

fix(docs): remove all mermaid text color overrides, let mermaid handl…

eff6966

…e theme colors

fix(blog): remove all mermaid style directives, let MkDocs Material h…

b4790fe

…andle theming

docs(blog): add light background colors to distinguish mermaid diagra…

f1df061

…m layers

docs(blog): use much lighter background colors for mermaid layers

3cce651

fix(docs): force black text on mermaid diagrams in both light and dar…

7852f77

…k mode

fix(blog): use classDef to force black text on all mermaid nodes, cha…

475357b

…nge section title

susiejojo and others added 21 commits April 10, 2026 14:34

fix(blog): remove all mermaid style/color overrides, use plain mermai…

9c35edc

…d like reference blog

docs(blog): add darker background colors to mermaid layers for visibi…

0781432

…lity in both modes

docs(blog): use much darker background colors for better visibility

ebd7eeb

docs(blog): refocus article on 'what it takes', add physics-beyond-ac…

3b14ad6

…curacy section, remove PD disagg example

fix(docs): add JavaScript to force black text in mermaid diagrams dur…

04afdcd

…ing dark mode

docs(blog): clarify LLM terminology in Layer 1 latency model

4e79118

Changed "model" to "LLM" in basis function formula and explanation for clarity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

docs(blog): simplify prefill/decode disaggregation description

7d1a58d

Removed parenthetical explanations for compute-bound and memory-bound to reduce verbosity. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

chore: remove non-blog files from PR

adda9b2

chore: restore model_configs from main

77abe3f

docs(blog): minor style fixes

e72dc75

- "can't" → "cannot" for formal tone - "—" → " - " for dash consistency Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

docs(blog): remove redundant validation reference

4bbef83

Remove "validation against production systems (the topic of our next article) confirms BLIS accuracy" - already covered in conclusion. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

susiejojo force-pushed the feat-blog2a branch from 315a58b to 4bbef83 Compare April 10, 2026 18:41

docs(blog): italicize key question for emphasis

9d05cf0

Italicize "What does it take to build a simulator accurate enough to guide these decisions?" to emphasize the central question. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

sriumcp marked this pull request as ready for review April 13, 2026 16:17

Merge branch 'main' into feat-blog2a

c4c0853

sriumcp merged commit 1ea9e5f into main Apr 13, 2026
4 checks passed

jgchn reviewed Apr 13, 2026

View reviewed changes

This was referenced Apr 13, 2026

docs(blog): publish physics of simulation blog post #1030

Merged

docs(blog): address final review comments on physics blog post #1031

Merged

susiejojo deleted the feat-blog2a branch April 20, 2026 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(blog): Physics of simulation blog article#997

docs(blog): Physics of simulation blog article#997
sriumcp merged 49 commits intomainfrom
feat-blog2a

susiejojo commented Apr 9, 2026 •

edited

Loading

Uh oh!

mtoslalibu commented Apr 10, 2026

Uh oh!

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

susiejojo Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

jgchn Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Building Fidelity from First Principles

		[BLIS](https://github.com/inference-sim/inference-sim) (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds.


		BLIS uses discrete-event simulation to model all three layers. This full-stack fidelity enables capacity planning (instance count, GPU type, TP degree) and configuration search (routing weights, admission thresholds). Without modeling distributed system couplings, planners predict linear scaling where production saturates, miss SLO violations from routing pile-on, or deploy autoscalers that oscillate.

		By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment:

	By modeling production systems ([vLLM](https://github.com/vllm-project/vllm), [llm-d](https://llm-d.ai)) behavior, BLIS enables safe experimentation before deployment:
	By modeling the behavior of production systems at the server ([vLLM](https://github.com/vllm-project/vllm) and platform [llm-d](https://llm-d.ai)) layers, BLIS enables safe experimentation before deployment:


		## A Request's Journey: The Hidden Complexity

		A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model.


		A user hits enter, and fifty milliseconds later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model.

		```mermaid

	Admission control determines whether requests enter the system. BLIS models saturation-based admit/reject decisions: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.
	Admission control determines whether requests enter the system. BLIS models saturation-based admit/reject decisions, the default behavior in llm-d: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.


		Autoscaling dynamically adjusts instance count to match demand. In production, this happens through feedback loops where HPA scrapes, Kubernetes scheduling, VM provisioning, and model loading all add latency before a new replica serves traffic.

		What BLIS captures. BLIS models llm-d's WVA four-stage pipeline — Collect, Analyze, Optimize, Actuate, with pluggable interfaces. Collector observes per-replica metrics, Analyzer detects saturation and emits scaling signals, Optimizer decides which GPU types to add/remove respecting multi-model inventory constraints, and Actuator applies decisions with configurable delay.

Conversation

susiejojo commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This Article Covers

Preview Instructions

Uh oh!

mtoslalibu commented Apr 10, 2026

Blog Post Review — Feedback

Intro

Admission Control (Layer 2)

"BLIS in Action" Example

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

susiejojo commented Apr 9, 2026 •

edited

Loading