feat(blog): add LLM simulator showdown comparative analysis#944

Open

susiejojo wants to merge 27 commits intomainfrom

Contributor

susiejojo commented Apr 3, 2026 •

edited

Loading

Summary

Adds a comprehensive blog post comparing 5 LLM inference simulators across 38 real-world experiments.

What's Included

Blog Post: "The LLM Simulator Showdown: Which Tool Actually Delivers?"

Compares BLIS (Roofline + Evolved), Vidur, LLMServingSim, LLM-Optimizer, and AIConfigurator
38 experiments across 6 models (Llama-3.1-8B, Qwen3-14B, CodeLlama-34B, Llama-2-70B, Mixtral-8x7B, Mixtral-8x22B)
3 GPU types: H100, A100-80GB, L40S
4 workload categories from ServeGen production traces (General-Purpose, Code Generation, Role-Playing, Reasoning)

Key Findings:

Coverage: BLIS 100% (38/38), LLM-Optimizer 94.7% (36/38), AIConfigurator 50% (19/38), Vidur 10.5% (4/38), LLMServingSim 2.6% (1/38)
Accuracy: BLIS-Evolved achieved 11.79% E2E MAPE and 22.81% TTFT MAPE across all experiments
Speed vs Accuracy: Pareto frontier analysis showing BLIS-Evolved balances accuracy and runtime

Figures:

Comparison charts for BLIS vs each competitor (4 figures)
Pareto frontier: speed vs accuracy trade-off
Runtime comparison table

Changes:

Add new blog post with MkDocs frontmatter
Add comparison figures (5 PNG files + 1 LaTeX table)
Add "Simulator Evaluation" and "Benchmarking" categories to mkdocs.yml
Fix Mert's avatar URL in .authors.yml

How to Test:

mkdocs serve

Closes #943

susiejojo and others added 11 commits

April 3, 2026 15:38


          feat(blog): add LLM simulator showdown comparative analysis

9b496f8

- Compare 5 simulators: BLIS (Roofline + Evolved), Vidur, LLMServingSim,
  LLM-Optimizer, and AIConfigurator
- 38 experiments across 6 models (Llama, Qwen, Mixtral, CodeLlama)
- 3 GPU types (H100, A100-80GB, L40S)
- 4 workload categories from ServeGen traces
- Coverage analysis: BLIS 100%, others 2.6-94.7%
- Accuracy comparison: BLIS-Evolved 11.79% E2E MAPE vs competitors
- Include comparison figures and Pareto frontier analysis
- Add Simulator Evaluation and Benchmarking categories to mkdocs.yml
- Fix Mert's avatar URL in .authors.yml

Closes #943

Co-Authored-By: Claude <[email protected]>


          refactor(blog): rename LLM Simulator Showdown to Inference Simulator …

…Showdown

- Rename file from llm-simulator-showdown.md to inference-simulator-showdown.md
- Update title from "The LLM Simulator Showdown" to "The Inference Simulator Showdown"
- Update opening line to "inference simulator" for broader applicability

Co-Authored-By: Claude <[email protected]>


          style(blog): format figure captions in academic style

b69042f

- Convert all 5 figure captions to italicized academic format
- Expand captions with more descriptive technical details
- Maintain Figure 1-5 numbering and clear descriptions
- Format follows standard technical publication conventions

Co-Authored-By: Claude <[email protected]>


          docs(blog): polish content - remove ToC, clarify BLIS description and…

dc24f42

… LLMServingSim runtime


          style(blog): format table caption in italic style

72dae63


          docs(blog): update comparison figure images

735dd59


          docs(blog): update comparison figures (latest version)

899366e


          docs(blog): add Limitations and Future Work section for transparency

2c1acc9


          docs(blog): add concrete cost example to opening hook (K/month impact)


          docs(blog): revise cost impact to be more professional and general

795f30a


          docs(blog): improve flow in Why Simulator Choice Matters section

0638e8c

jgchn reviewed

View reviewed changes

docs/blog/posts/inference-simulator-showdown.md Outdated Show resolved Hide resolved

docs/blog/posts/inference-simulator-showdown.md Outdated Show resolved Hide resolved

docs/blog/posts/inference-simulator-showdown.md Outdated Show resolved Hide resolved

docs/blog/posts/inference-simulator-showdown.md

+              <a name="accuracy-and-coverage"></a>
+              ## Accuracy and Coverage: Who Gets It Right?
+              Picture this: you are deploying Mixtral-8x7B on A100 nodes with TP=4. You have read the accuracy benchmarks, picked your simulator, fired it up, and it tells you it does not support MoE models! Or A100s. Or diverse TP configurations. Before you ever question the predictions, coverage gaps have already made the decision for you.

Collaborator

jgchn Apr 3, 2026

I think the hook can be more convincing. If the user has already read accuracy benchmarks (and can understand them), what's stopping them from just taking the best configuration and running it for real?

docs/blog/posts/inference-simulator-showdown.md Outdated


		### Capacity Planning & Configuration Search

		Accuracy matters most. A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.

Collaborator

jgchn Apr 3, 2026

Suggested change

      
            **Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.
          
            **Accuracy matters most.** A fast simulator with 50% error means wrong resource decisions—overprovision and waste budget, or underprovision and miss SLOs,w which is extremely costly. BLIS-Evolved delivered 11.79% E2E error and 22.81% TTFT error across 38 experiments (Figure 1). Pure roofline models miss queueing delays and communication overhead—errors that compound when planning at scale.

docs/blog/posts/inference-simulator-showdown.md

+              **BLIS-Evolved** hits the sweet spot: high accuracy, extensive coverage, moderate speed.
+              **LLM-Optimizer** dominates speed at 0.1s for rapid exploration. **Vidur** provides scheduler-level fidelity for focused research.
+              Marketing claims are not validation. Run the simulator on *your* model, *your* hardware, *your* workload, then compare against real measurements. Test before you trust.

Collaborator

jgchn Apr 3, 2026

Very strong ending! Might want to link to a reproduction tutorial and detailed benchmark results folder.

docs/blog/posts/inference-simulator-showdown.md

+              | **LLM-Optimizer** | 36/38 | 94.7% | No L40S hardware profiles (H100/A100 only); MoE approximated as dense; limited vLLM argument support |
+              | **AIConfigurator** | 19/38 | 50% | Dense models & H100 only|
+              | **Vidur** | 4/38 | 10.5% | Requires pre-built model profiles; only CodeLlama-34B & Llama-2-70B |
+              | **LLMServingSim** | 1/38 | 2.6% | Only 1 model with profiled coefficients matching our test set (Mixtral-8x7B); supports only 2 models total on H100; prohibitive runtime limits broader testing |

Collaborator

jgchn Apr 3, 2026

Oh no... I would be a bit concerned seeing this. The main reason being I would doubt that the 38 experiments does not fully capture what LLMServingSim can support and what BLIS cannot support.

Contributor Author

susiejojo Apr 3, 2026 •

edited

Loading

LLMServingSim only supports exactly 2 model x TP combinations on H100 and nothing on A100 and L40S. Of these, one we are showing in the plot. The other one happens to be overlapping with our training set currently, so it won't be fair to evaluate BLIS against LLMServingSim on that datapoint. In our future training data collection rounds, will try to avoid these two in training data.

susiejojo added 16 commits

April 3, 2026 16:26


          docs(blog): add critical balance - call out BLIS weaknesses, competit…

8f97fdf

…ors' clear wins

- Accuracy section: Call out BLIS-Evolved's TTFT weakness (22.81% vs 11.79% E2E)
- Algorithm discovery: State LLM-Optimizer wins decisively, BLIS too slow (8x)
- Capacity planning: Give Vidur its win for scheduler-level fidelity
- Bottom line: Reframe as use-case-specific wins, not universal BLIS victory
- Removes marketing tone while preserving data integrity


          docs(blog): refine tone - integrate BLIS limitations smoothly, soften…

f9845e6

… algorithm discovery section


          docs(blog): clarify Vidur vLLM v0 limitation, clean up percentage for…

f1f91c0

…matting


          docs(blog): generalize multi-instance capabilities phrasing

51a499c


          docs(blog): clarify 'algorithm discovery' vs 'training', improve flow…

4038e1a

… in AI section


          docs(blog): fix punctuation - use colon instead of em dash

c8db2bc


          docs(blog): polish prose - fix em dash spacing, expand contractions

5a910ae


          docs(blog): refine bottom line section - emphasize workflow over winners

61ace17

- Capacity planning: LLM-Optimizer for exploration, BLIS-Evolved for validation
- Algorithm discovery: LLM-Optimizer for single-instance, BLIS-Evolved for multi-instance
- Remove redundant 'For rapid exploration' bullet
- Polish limitations section wording and em dash spacing


          docs(blog): simplify capacity planning section, consolidate Vidur ass…

0632ebb

…essment


          docs(blog): simplify bottom line - remove redundant MAPE statistics

adaa76d


          docs(blog): improve flow and formatting in opening sections

0d621a3

- Merge stakes paragraph into opening for tighter narrative
- Add transition sentence about strengths/weaknesses
- Format test matrix as bulleted list for readability
- Fix punctuation in BLIS description (em dash to comma)
- Correct 'vLLM GPU memory utilization' to 'LLM GPU memory utilization'


          docs(blog): consolidate bottom line - combine workflow into single st…

48b20a5

…atement, add single-instance caveat


          docs(blog): simplify bottom line to single workflow statement

9b06cd2


          docs(blog): integrate workflow guidance into bottom line paragraph, a…

6bbd436

…dd balanced recommendations


          docs(blog): add rationale for test matrix selection - widely-deployed…

69a83f4

…, practical configs


          docs(blog): clarify blackbox evaluation methodology - fair, off-the-s…

feea7b6

…helf comparison

sriumcp added a commit to sriumcp/inference-sim that referenced this pull request


          feat(training): iter29 — sequential golden section, loss 34.57% (infe…

5a7f21f

…rence-sim#944)

Sequential golden section over β₃→β₆→β₅→β₈→β₂ᵦ from iter27 warm start.
β₆ (per-request overhead) jumped +57% (2.805→4.417), the primary driver.
β₃ and β₆ were frozen in iter27's joint search and needed re-calibration.

Final loss: 34.5675% vs iter27 baseline of 34.6564% (-0.089 points).
15 experiments parallelized per evaluation (~5 min/phase, ~25 min total).
No code changes; coefficients are passed via CLI flags.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet