Skip to content

feat(blog): add LLM simulator showdown comparative analysis#944

Open
susiejojo wants to merge 27 commits intomainfrom
feat-blog-2
Open

feat(blog): add LLM simulator showdown comparative analysis#944
susiejojo wants to merge 27 commits intomainfrom
feat-blog-2

Conversation

@susiejojo
Copy link
Copy Markdown
Contributor

@susiejojo susiejojo commented Apr 3, 2026

Summary

Adds a comprehensive blog post comparing 5 LLM inference simulators across 38 real-world experiments.

What's Included

Blog Post: "The LLM Simulator Showdown: Which Tool Actually Delivers?"

  • Compares BLIS (Roofline + Evolved), Vidur, LLMServingSim, LLM-Optimizer, and AIConfigurator
  • 38 experiments across 6 models (Llama-3.1-8B, Qwen3-14B, CodeLlama-34B, Llama-2-70B, Mixtral-8x7B, Mixtral-8x22B)
  • 3 GPU types: H100, A100-80GB, L40S
  • 4 workload categories from ServeGen production traces (General-Purpose, Code Generation, Role-Playing, Reasoning)

Key Findings:

  • Coverage: BLIS 100% (38/38), LLM-Optimizer 94.7% (36/38), AIConfigurator 50% (19/38), Vidur 10.5% (4/38), LLMServingSim 2.6% (1/38)
  • Accuracy: BLIS-Evolved achieved 11.79% E2E MAPE and 22.81% TTFT MAPE across all experiments
  • Speed vs Accuracy: Pareto frontier analysis showing BLIS-Evolved balances accuracy and runtime

Figures:

  • Comparison charts for BLIS vs each competitor (4 figures)
  • Pareto frontier: speed vs accuracy trade-off
  • Runtime comparison table

Changes:

  • Add new blog post with MkDocs frontmatter
  • Add comparison figures (5 PNG files + 1 LaTeX table)
  • Add "Simulator Evaluation" and "Benchmarking" categories to mkdocs.yml
  • Fix Mert's avatar URL in .authors.yml

How to Test:

mkdocs serve

Closes #943

susiejojo and others added 11 commits April 3, 2026 15:38
- Compare 5 simulators: BLIS (Roofline + Evolved), Vidur, LLMServingSim,
  LLM-Optimizer, and AIConfigurator
- 38 experiments across 6 models (Llama, Qwen, Mixtral, CodeLlama)
- 3 GPU types (H100, A100-80GB, L40S)
- 4 workload categories from ServeGen traces
- Coverage analysis: BLIS 100%, others 2.6-94.7%
- Accuracy comparison: BLIS-Evolved 11.79% E2E MAPE vs competitors
- Include comparison figures and Pareto frontier analysis
- Add Simulator Evaluation and Benchmarking categories to mkdocs.yml
- Fix Mert's avatar URL in .authors.yml

Closes #943

Co-Authored-By: Claude <[email protected]>
…Showdown

- Rename file from llm-simulator-showdown.md to inference-simulator-showdown.md
- Update title from "The LLM Simulator Showdown" to "The Inference Simulator Showdown"
- Update opening line to "inference simulator" for broader applicability

Co-Authored-By: Claude <[email protected]>
- Convert all 5 figure captions to italicized academic format
- Expand captions with more descriptive technical details
- Maintain Figure 1-5 numbering and clear descriptions
- Format follows standard technical publication conventions

Co-Authored-By: Claude <[email protected]>
<a name="accuracy-and-coverage"></a>
## Accuracy and Coverage: Who Gets It Right?

Picture this: you are deploying Mixtral-8x7B on A100 nodes with TP=4. You have read the accuracy benchmarks, picked your simulator, fired it up, and it tells you it does not support MoE models! Or A100s. Or diverse TP configurations. Before you ever question the predictions, coverage gaps have already made the decision for you.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hook can be more convincing. If the user has already read accuracy benchmarks (and can understand them), what's stopping them from just taking the best configuration and running it for real?


### Capacity Planning & Configuration Search

**Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.
**Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs,w which is extremely costly. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.

**BLIS-Evolved** hits the sweet spot: high accuracy, extensive coverage, moderate speed.
**LLM-Optimizer** dominates speed at 0.1s for rapid exploration. **Vidur** provides scheduler-level fidelity for focused research.

Marketing claims are not validation. Run the simulator on *your* model, *your* hardware, *your* workload, then compare against real measurements. Test before you trust.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very strong ending! Might want to link to a reproduction tutorial and detailed benchmark results folder.

| **LLM-Optimizer** | 36/38 | 94.7% | No L40S hardware profiles (H100/A100 only); MoE approximated as dense; limited vLLM argument support |
| **AIConfigurator** | 19/38 | 50% | Dense models & H100 only|
| **Vidur** | 4/38 | 10.5% | Requires pre-built model profiles; only CodeLlama-34B & Llama-2-70B |
| **LLMServingSim** | 1/38 | 2.6% | Only 1 model with profiled coefficients matching our test set (Mixtral-8x7B); supports only 2 models total on H100; prohibitive runtime limits broader testing |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no... I would be a bit concerned seeing this. The main reason being I would doubt that the 38 experiments does not fully capture what LLMServingSim can support and what BLIS cannot support.

Copy link
Copy Markdown
Contributor Author

@susiejojo susiejojo Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMServingSim only supports exactly 2 model x TP combinations on H100 and nothing on A100 and L40S. Of these, one we are showing in the plot. The other one happens to be overlapping with our training set currently, so it won't be fair to evaluate BLIS against LLMServingSim on that datapoint. In our future training data collection rounds, will try to avoid these two in training data.

susiejojo added 16 commits April 3, 2026 16:26
…ors' clear wins

- Accuracy section: Call out BLIS-Evolved's TTFT weakness (22.81% vs 11.79% E2E)
- Algorithm discovery: State LLM-Optimizer wins decisively, BLIS too slow (8x)
- Capacity planning: Give Vidur its win for scheduler-level fidelity
- Bottom line: Reframe as use-case-specific wins, not universal BLIS victory
- Removes marketing tone while preserving data integrity
- Capacity planning: LLM-Optimizer for exploration, BLIS-Evolved for validation
- Algorithm discovery: LLM-Optimizer for single-instance, BLIS-Evolved for multi-instance
- Remove redundant 'For rapid exploration' bullet
- Polish limitations section wording and em dash spacing
- Merge stakes paragraph into opening for tighter narrative
- Add transition sentence about strengths/weaknesses
- Format test matrix as bulleted list for readability
- Fix punctuation in BLIS description (em dash to comma)
- Correct 'vLLM GPU memory utilization' to 'LLM GPU memory utilization'
sriumcp added a commit to sriumcp/inference-sim that referenced this pull request Apr 4, 2026
…rence-sim#944)

Sequential golden section over β₃→β₆→β₅→β₈→β₂ᵦ from iter27 warm start.
β₆ (per-request overhead) jumped +57% (2.805→4.417), the primary driver.
β₃ and β₆ were frozen in iter27's joint search and needed re-calibration.

Final loss: 34.5675% vs iter27 baseline of 34.6564% (-0.089 points).
15 experiments parallelized per evaluation (~5 min/phase, ~25 min total).
No code changes; coefficients are passed via CLI flags.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog Article: LLM Simulator Comparison and Evaluation Guide

2 participants