docs: fix hardcoded metrics mismatches and tutorial narrative gaps by susiejojo · Pull Request #632 · inference-sim/inference-sim

susiejojo · 2026-03-12T16:03:18Z

Summary

Fixes documentation metrics that drifted after the default latency model switched from blackbox (LLaMA 3.1 8B) to roofline (Qwen3-14B). All numbers in this PR were measured by running the exact commands shown in the docs.

Tutorial rewrite: rate=20 → rate=200 scenario with correct per-instance capacity (~17 req/s), real saturation progression (4 instances: 1,500ms → 8 instances: 60ms → 12 instances: 54ms TTFT p99), honest routing differentiation claims, fitness score explanation, --rate clarification
Guide docs: fix stale hardcoded numbers in workloads.md (~57 → ~17 req/s), cluster.md (queue growth math), kv-cache.md (prefill time 43ms → 97ms), routing.md (step time ~9ms → ~6ms)
Quickstart + README: HuggingFace fetch note, multi-JSON output warning, convert command descriptions, meaningful cluster example load, features list split into Core/Advanced, project structure collapsed behind <details>

Measured metrics (Qwen3-14B / H100 / TP=1, roofline mode)

Metric	Old claim	Actual measured
Single-instance throughput	~2.5 req/s	~17 req/s (saturated)
2048-token prefill TTFT	~43ms	~97ms
ITL at low load	~9ms or ~17.4ms	~6ms
Cluster capacity per instance	~57 req/s	~17 req/s

Test plan

go build ./... passes
go test ./... passes (all packages OK)
All example commands in tutorial produce results matching the documented narrative
Visual review of MkDocs rendering (admonitions, details tags)

Fixes #550

🤖 Generated with Claude Code

sriumcp · 2026-03-12T16:36:14Z

@claude review PR

claude · 2026-03-12T17:08:37Z

Claude finished @sriumcp's task in 1m 37s —— View job

PR Review: docs: fix hardcoded metrics mismatches and tutorial narrative gaps

Read all changed files and compare with origin/main
Review README.md changes
Review getting-started docs (installation, quickstart, tutorial)
Review guide docs (cluster, kv-cache, latency-models, routing, workloads)
Post review feedback

Overall Assessment

Good PR — the core mission (updating stale metrics after the blackbox→roofline default switch) is well-executed. The measured metrics table in the PR description is helpful, and the tutorial rewrite with the 200 req/s scenario tells a much more compelling story than the old 20 req/s version. The added admonitions (HF fetch note, multi-JSON output, --rate semantics) address real user pain points.

Issues Found

1. cluster.md: All example excess rates are positive — no sub-saturation example (Minor)

docs/guide/cluster.md:59-63 — The old version showed a meaningful contrast: 4 instances were overloaded (excess=67.6) while 8 instances were barely queuing (excess=5.1). The new version at rate=500 has all three cases heavily overloaded:

4 instances:  excess = 108 req/s   → rapid queue growth
8 instances:  excess = 45.5 req/s  → heavy queueing
12 instances: excess = 24.7 req/s  → moderate queueing

None of these examples actually show the "sub-saturation" regime the next line references. The old version had 8 instances at excess=5.1 which was genuinely near equilibrium. Consider either: (a) adding a 4th line showing ~30 instances where excess goes negative, or (b) using a lower rate (e.g., rate=200) where 12 instances crosses the threshold (200/12 - 17 ≈ -0.3, essentially balanced). The "~8x" improvement claim also isn't self-evident from the numbers since both 4 and 8 instances are deeply overloaded — the reader can't verify it from the excess rates alone.

Fix this →

2. tutorial.md: Step 3 comment says "moderate headroom" for 8 instances, but body says it's overloaded (Minor)

docs/getting-started/tutorial.md:49 — The inline comment says # 8 instances (~136 req/s capacity → moderate headroom) but the explanation below (line 61) says TTFT p99 is ~60ms due to per-instance rate (25 req/s) exceeding single-request capacity. "Moderate headroom" suggests comfortable margin; the text describes near-saturation behavior. Consider changing to something like → still overloaded but manageable or → near capacity with batching.

Fix this →

3. tutorial.md: Capacity math in Step 3 comments could confuse readers (Nit)

docs/getting-started/tutorial.md:42-55 — The inline comments calculate capacity as 17 * N (e.g., ~68 req/s, ~136 req/s, ~204 req/s), but this is the throughput at saturation where continuous batching is fully engaged. A reader who ran Step 1 might think each instance can handle 17 independent req/s regardless of batching. The tutorial body correctly explains this nuance (lines 60-62), but the comments could be clearer — e.g., # 4 instances (saturated throughput ~68 req/s, needs 200).

4. quickstart.md: `jq --slurp` pipeline is fragile (Nit)

docs/getting-started/quickstart.md:67 — The suggested command ./blis run ... 2>/dev/null | tail -n +2 | jq --slurp '.[-1]' assumes the first line of stdout is the metrics header. If the output format changes (e.g., header removed or extra lines), tail -n +2 silently produces wrong results. Consider using jq -s with a filter on instance_id == "cluster" instead: jq --slurp '.[] | select(.instance_id == "cluster")' which is format-resilient.

5. README.md: "PP, EP" dropped from vLLM config feature line (Nit)

README.md:24 — Changed from (TP, PP, EP, batch limits) to (TP, chunk size, batch limits), dropping pipeline parallelism (PP) and expert parallelism (EP). If BLIS genuinely doesn't model PP/EP, this is correct. But if it does (even partially), mentioning them in the features list is useful for discoverability.

What Looks Good

Tutorial narrative flow: The progression from "measure capacity under load" → "why batching matters" → "scale up" → "identify bottleneck" → "compare routing" is pedagogically sound. The new admonitions ("Why measure at high load?", "--rate is total arrival rate") preempt real FAQ questions.
Honest routing claims: The old tutorial implied routing always makes a big difference; the new version correctly states routing differentiation needs heterogeneous workloads (tutorial.md:107). This is more accurate and avoids setting false expectations.
Consistent HF_TOKEN messaging: The HF token note is now surfaced early in quickstart, installation, and README with consistent wording and a link to Llama-2 as the canonical gated model example.
README structure improvements: Collapsing the directory tree behind <details> and splitting features into Core/Advanced are good UX improvements for the landing page.
Guide metric fixes (kv-cache.md, routing.md, workloads.md, cluster.md): All single-line metric corrections look correct against the measured values in the PR description.

Summary

The PR achieves its stated goal of fixing stale metrics and improving the tutorial narrative. The issues above are all minor/nit — the only one worth addressing before merge is #1 (cluster.md excess rate examples) since it undermines the pedagogical point about sub-saturation. The rest are suggestions that could be addressed in follow-up.

|

) - Tutorial: rewrite capacity planning scenario with rate=200 (was rate=20), correct per-instance throughput ~17 req/s (was ~2.5), add --rate clarification, explain fitness scores, honest routing differentiation claims - workloads.md: fix stale ~57 req/s capacity → ~17 req/s for roofline mode - cluster.md: fix queue growth math using 17 req/s (was 57.4 req/s) - kv-cache.md: fix 2048-token prefill time ~97ms (was ~43ms) - routing.md: fix stale ~9ms step time claim - quickstart.md: add HF fetch note, multi-JSON output format warning - README.md: add HF fetch note, convert command descriptions, meaningful cluster example load, split features into Core/Advanced, collapse project structure behind <details> tag Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo · 2026-03-12T17:31:07Z

Addressed all review comments:

cluster.md excess rates — switched to rate=200 so the examples show a real transition: 4 instances (excess=33, overloaded) → 8 instances (excess=8, near saturation) → 12 instances (excess=-0.3, balanced/sub-saturation). The closing line now explains that TTFT converges to baseline when excess ≤ 0.
tutorial.md "moderate headroom" mismatch — rewrote all three inline comments to show per-instance arrival rate vs saturated capacity (e.g., 50 req/s per instance vs ~17 saturated capacity → heavily overloaded), consistent with the explanation below.
tutorial.md capacity math clarity — same fix as Init commit #2; comments now make it clear these are per-instance arrival rates compared against saturated throughput, not standalone capacity numbers.
quickstart.md jq pipeline — replaced fragile tail -n +2 | jq --slurp '.[-1]' with format-resilient jq --slurp '.[] | select(.instance_id == "cluster")'.
README.md PP/EP — no change needed. Confirmed BLIS has no --pp or --ep CLI flags (grepped cmd/), so the original (TP, PP, EP, batch limits) was aspirational. Current (TP, chunk size, batch limits) accurately reflects what BLIS models today.

sriumcp · 2026-03-12T18:31:14Z

@claude Are all the review comments addressed? See comment above: #632 (comment)

claude · 2026-03-12T18:32:05Z

Claude finished @sriumcp's task in 45s —— View job

Review comment verification

All 5 review comments have been addressed. Here's the verification:

Fix Init commit #1: cluster.md excess rates — Switched to rate=200. Now shows a real transition: 4 instances (excess=33, overloaded) → 8 instances (excess=8, near saturation) → 12 instances (excess=-0.3, sub-saturation). Line 65 adds: "At sub-saturation (excess ≤ 0): TTFT converges to the baseline." ✅
Fix Init commit #2: tutorial.md "moderate headroom" — All three inline comments rewritten to show per-instance arrival rate vs saturated capacity (e.g., 50 req/s per instance vs ~17 saturated capacity → heavily overloaded). No more "moderate headroom" mismatch. ✅
Fix getting started instructions #3: tutorial.md capacity math clarity — Same fix as Init commit #2. Comments now clearly compare arrival rate against saturated throughput, not standalone capacity. ✅
Fix Naming Conventions #4: quickstart.md jq pipeline — Replaced tail -n +2 | jq --slurp '.[-1]' with format-resilient jq --slurp '.[] | select(.instance_id == "cluster")'. ✅
Fix Simplify kv cache (by removing matched variable) #5: README.md PP/EP — No change needed. Confirmed (TP, chunk size, batch limits) accurately reflects what BLIS models (no --pp or --ep flags exist). ✅

LGTM — all review feedback has been addressed and the changes look correct.

susiejojo force-pushed the fix-docs-550 branch 3 times, most recently from 5694270 to c32573b Compare March 12, 2026 16:19

susiejojo requested a review from sriumcp March 12, 2026 16:20

susiejojo force-pushed the fix-docs-550 branch 2 times, most recently from 4a46fe8 to e269ba7 Compare March 12, 2026 16:29

susiejojo force-pushed the fix-docs-550 branch from e269ba7 to 3de7a3c Compare March 12, 2026 17:28

sriumcp merged commit c9ca689 into main Mar 12, 2026
5 checks passed

susiejojo deleted the fix-docs-550 branch March 20, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: fix hardcoded metrics mismatches and tutorial narrative gaps#632

docs: fix hardcoded metrics mismatches and tutorial narrative gaps#632
sriumcp merged 1 commit intomainfrom
fix-docs-550

susiejojo commented Mar 12, 2026

Uh oh!

sriumcp commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

susiejojo commented Mar 12, 2026

Uh oh!

sriumcp commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

susiejojo commented Mar 12, 2026

Summary

Measured metrics (Qwen3-14B / H100 / TP=1, roofline mode)

Test plan

Uh oh!

sriumcp commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: docs: fix hardcoded metrics mismatches and tutorial narrative gaps

Overall Assessment

Issues Found

1. cluster.md: All example excess rates are positive — no sub-saturation example (Minor)

2. tutorial.md: Step 3 comment says "moderate headroom" for 8 instances, but body says it's overloaded (Minor)

3. tutorial.md: Capacity math in Step 3 comments could confuse readers (Nit)

4. quickstart.md: jq --slurp pipeline is fragile (Nit)

5. README.md: "PP, EP" dropped from vLLM config feature line (Nit)

What Looks Good

Summary

Uh oh!

susiejojo commented Mar 12, 2026

Uh oh!

sriumcp commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review comment verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Mar 12, 2026 •

edited

Loading

4. quickstart.md: `jq --slurp` pipeline is fragile (Nit)

claude bot commented Mar 12, 2026 •

edited

Loading