feat(blog): add LLM simulator showdown comparative analysis#944
feat(blog): add LLM simulator showdown comparative analysis#944
Conversation
- Compare 5 simulators: BLIS (Roofline + Evolved), Vidur, LLMServingSim, LLM-Optimizer, and AIConfigurator - 38 experiments across 6 models (Llama, Qwen, Mixtral, CodeLlama) - 3 GPU types (H100, A100-80GB, L40S) - 4 workload categories from ServeGen traces - Coverage analysis: BLIS 100%, others 2.6-94.7% - Accuracy comparison: BLIS-Evolved 11.79% E2E MAPE vs competitors - Include comparison figures and Pareto frontier analysis - Add Simulator Evaluation and Benchmarking categories to mkdocs.yml - Fix Mert's avatar URL in .authors.yml Closes #943 Co-Authored-By: Claude <[email protected]>
…Showdown - Rename file from llm-simulator-showdown.md to inference-simulator-showdown.md - Update title from "The LLM Simulator Showdown" to "The Inference Simulator Showdown" - Update opening line to "inference simulator" for broader applicability Co-Authored-By: Claude <[email protected]>
- Convert all 5 figure captions to italicized academic format - Expand captions with more descriptive technical details - Maintain Figure 1-5 numbering and clear descriptions - Format follows standard technical publication conventions Co-Authored-By: Claude <[email protected]>
… LLMServingSim runtime
| <a name="accuracy-and-coverage"></a> | ||
| ## Accuracy and Coverage: Who Gets It Right? | ||
|
|
||
| Picture this: you are deploying Mixtral-8x7B on A100 nodes with TP=4. You have read the accuracy benchmarks, picked your simulator, fired it up, and it tells you it does not support MoE models! Or A100s. Or diverse TP configurations. Before you ever question the predictions, coverage gaps have already made the decision for you. |
There was a problem hiding this comment.
I think the hook can be more convincing. If the user has already read accuracy benchmarks (and can understand them), what's stopping them from just taking the best configuration and running it for real?
|
|
||
| ### Capacity Planning & Configuration Search | ||
|
|
||
| **Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale. |
There was a problem hiding this comment.
| **Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale. | |
| **Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs,w which is extremely costly. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale. |
| **BLIS-Evolved** hits the sweet spot: high accuracy, extensive coverage, moderate speed. | ||
| **LLM-Optimizer** dominates speed at 0.1s for rapid exploration. **Vidur** provides scheduler-level fidelity for focused research. | ||
|
|
||
| Marketing claims are not validation. Run the simulator on *your* model, *your* hardware, *your* workload, then compare against real measurements. Test before you trust. |
There was a problem hiding this comment.
Very strong ending! Might want to link to a reproduction tutorial and detailed benchmark results folder.
| | **LLM-Optimizer** | 36/38 | 94.7% | No L40S hardware profiles (H100/A100 only); MoE approximated as dense; limited vLLM argument support | | ||
| | **AIConfigurator** | 19/38 | 50% | Dense models & H100 only| | ||
| | **Vidur** | 4/38 | 10.5% | Requires pre-built model profiles; only CodeLlama-34B & Llama-2-70B | | ||
| | **LLMServingSim** | 1/38 | 2.6% | Only 1 model with profiled coefficients matching our test set (Mixtral-8x7B); supports only 2 models total on H100; prohibitive runtime limits broader testing | |
There was a problem hiding this comment.
Oh no... I would be a bit concerned seeing this. The main reason being I would doubt that the 38 experiments does not fully capture what LLMServingSim can support and what BLIS cannot support.
There was a problem hiding this comment.
LLMServingSim only supports exactly 2 model x TP combinations on H100 and nothing on A100 and L40S. Of these, one we are showing in the plot. The other one happens to be overlapping with our training set currently, so it won't be fair to evaluate BLIS against LLMServingSim on that datapoint. In our future training data collection rounds, will try to avoid these two in training data.
…ors' clear wins - Accuracy section: Call out BLIS-Evolved's TTFT weakness (22.81% vs 11.79% E2E) - Algorithm discovery: State LLM-Optimizer wins decisively, BLIS too slow (8x) - Capacity planning: Give Vidur its win for scheduler-level fidelity - Bottom line: Reframe as use-case-specific wins, not universal BLIS victory - Removes marketing tone while preserving data integrity
… algorithm discovery section
- Capacity planning: LLM-Optimizer for exploration, BLIS-Evolved for validation - Algorithm discovery: LLM-Optimizer for single-instance, BLIS-Evolved for multi-instance - Remove redundant 'For rapid exploration' bullet - Polish limitations section wording and em dash spacing
- Merge stakes paragraph into opening for tighter narrative - Add transition sentence about strengths/weaknesses - Format test matrix as bulleted list for readability - Fix punctuation in BLIS description (em dash to comma) - Correct 'vLLM GPU memory utilization' to 'LLM GPU memory utilization'
…atement, add single-instance caveat
…dd balanced recommendations
…, practical configs
…rence-sim#944) Sequential golden section over β₃→β₆→β₅→β₈→β₂ᵦ from iter27 warm start. β₆ (per-request overhead) jumped +57% (2.805→4.417), the primary driver. β₃ and β₆ were frozen in iter27's joint search and needed re-calibration. Final loss: 34.5675% vs iter27 baseline of 34.6564% (-0.089 points). 15 experiments parallelized per evaluation (~5 min/phase, ~25 min total). No code changes; coefficients are passed via CLI flags. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Summary
Adds a comprehensive blog post comparing 5 LLM inference simulators across 38 real-world experiments.
What's Included
Blog Post: "The LLM Simulator Showdown: Which Tool Actually Delivers?"
Key Findings:
Figures:
Changes:
How to Test:
Closes #943