Skip to content

docs: fix hardcoded metrics mismatches and tutorial narrative gaps#632

Merged
sriumcp merged 1 commit intomainfrom
fix-docs-550
Mar 12, 2026
Merged

docs: fix hardcoded metrics mismatches and tutorial narrative gaps#632
sriumcp merged 1 commit intomainfrom
fix-docs-550

Conversation

@susiejojo
Copy link
Copy Markdown
Contributor

Summary

Fixes documentation metrics that drifted after the default latency model switched from blackbox (LLaMA 3.1 8B) to roofline (Qwen3-14B). All numbers in this PR were measured by running the exact commands shown in the docs.

  • Tutorial rewrite: rate=20 → rate=200 scenario with correct per-instance capacity (~17 req/s), real saturation progression (4 instances: 1,500ms → 8 instances: 60ms → 12 instances: 54ms TTFT p99), honest routing differentiation claims, fitness score explanation, --rate clarification
  • Guide docs: fix stale hardcoded numbers in workloads.md (~57 → ~17 req/s), cluster.md (queue growth math), kv-cache.md (prefill time 43ms → 97ms), routing.md (step time ~9ms → ~6ms)
  • Quickstart + README: HuggingFace fetch note, multi-JSON output warning, convert command descriptions, meaningful cluster example load, features list split into Core/Advanced, project structure collapsed behind <details>

Measured metrics (Qwen3-14B / H100 / TP=1, roofline mode)

Metric Old claim Actual measured
Single-instance throughput ~2.5 req/s ~17 req/s (saturated)
2048-token prefill TTFT ~43ms ~97ms
ITL at low load ~9ms or ~17.4ms ~6ms
Cluster capacity per instance ~57 req/s ~17 req/s

Test plan

  • go build ./... passes
  • go test ./... passes (all packages OK)
  • All example commands in tutorial produce results matching the documented narrative
  • Visual review of MkDocs rendering (admonitions, details tags)

Fixes #550

🤖 Generated with Claude Code

@susiejojo susiejojo force-pushed the fix-docs-550 branch 3 times, most recently from 5694270 to c32573b Compare March 12, 2026 16:19
@susiejojo susiejojo requested a review from sriumcp March 12, 2026 16:20
@susiejojo susiejojo force-pushed the fix-docs-550 branch 2 times, most recently from 4a46fe8 to e269ba7 Compare March 12, 2026 16:29
@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Mar 12, 2026

@claude review PR

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @sriumcp's task in 1m 37s —— View job


PR Review: docs: fix hardcoded metrics mismatches and tutorial narrative gaps

  • Read all changed files and compare with origin/main
  • Review README.md changes
  • Review getting-started docs (installation, quickstart, tutorial)
  • Review guide docs (cluster, kv-cache, latency-models, routing, workloads)
  • Post review feedback

Overall Assessment

Good PR — the core mission (updating stale metrics after the blackbox→roofline default switch) is well-executed. The measured metrics table in the PR description is helpful, and the tutorial rewrite with the 200 req/s scenario tells a much more compelling story than the old 20 req/s version. The added admonitions (HF fetch note, multi-JSON output, --rate semantics) address real user pain points.

Issues Found

1. cluster.md: All example excess rates are positive — no sub-saturation example (Minor)

docs/guide/cluster.md:59-63 — The old version showed a meaningful contrast: 4 instances were overloaded (excess=67.6) while 8 instances were barely queuing (excess=5.1). The new version at rate=500 has all three cases heavily overloaded:

4 instances:  excess = 108 req/s   → rapid queue growth
8 instances:  excess = 45.5 req/s  → heavy queueing
12 instances: excess = 24.7 req/s  → moderate queueing

None of these examples actually show the "sub-saturation" regime the next line references. The old version had 8 instances at excess=5.1 which was genuinely near equilibrium. Consider either: (a) adding a 4th line showing ~30 instances where excess goes negative, or (b) using a lower rate (e.g., rate=200) where 12 instances crosses the threshold (200/12 - 17 ≈ -0.3, essentially balanced). The "~8x" improvement claim also isn't self-evident from the numbers since both 4 and 8 instances are deeply overloaded — the reader can't verify it from the excess rates alone.

Fix this →

2. tutorial.md: Step 3 comment says "moderate headroom" for 8 instances, but body says it's overloaded (Minor)

docs/getting-started/tutorial.md:49 — The inline comment says # 8 instances (~136 req/s capacity → moderate headroom) but the explanation below (line 61) says TTFT p99 is ~60ms due to per-instance rate (25 req/s) exceeding single-request capacity. "Moderate headroom" suggests comfortable margin; the text describes near-saturation behavior. Consider changing to something like → still overloaded but manageable or → near capacity with batching.

Fix this →

3. tutorial.md: Capacity math in Step 3 comments could confuse readers (Nit)

docs/getting-started/tutorial.md:42-55 — The inline comments calculate capacity as 17 * N (e.g., ~68 req/s, ~136 req/s, ~204 req/s), but this is the throughput at saturation where continuous batching is fully engaged. A reader who ran Step 1 might think each instance can handle 17 independent req/s regardless of batching. The tutorial body correctly explains this nuance (lines 60-62), but the comments could be clearer — e.g., # 4 instances (saturated throughput ~68 req/s, needs 200).

4. quickstart.md: jq --slurp pipeline is fragile (Nit)

docs/getting-started/quickstart.md:67 — The suggested command ./blis run ... 2>/dev/null | tail -n +2 | jq --slurp '.[-1]' assumes the first line of stdout is the metrics header. If the output format changes (e.g., header removed or extra lines), tail -n +2 silently produces wrong results. Consider using jq -s with a filter on instance_id == "cluster" instead: jq --slurp '.[] | select(.instance_id == "cluster")' which is format-resilient.

5. README.md: "PP, EP" dropped from vLLM config feature line (Nit)

README.md:24 — Changed from (TP, PP, EP, batch limits) to (TP, chunk size, batch limits), dropping pipeline parallelism (PP) and expert parallelism (EP). If BLIS genuinely doesn't model PP/EP, this is correct. But if it does (even partially), mentioning them in the features list is useful for discoverability.

What Looks Good

  • Tutorial narrative flow: The progression from "measure capacity under load" → "why batching matters" → "scale up" → "identify bottleneck" → "compare routing" is pedagogically sound. The new admonitions ("Why measure at high load?", "--rate is total arrival rate") preempt real FAQ questions.
  • Honest routing claims: The old tutorial implied routing always makes a big difference; the new version correctly states routing differentiation needs heterogeneous workloads (tutorial.md:107). This is more accurate and avoids setting false expectations.
  • Consistent HF_TOKEN messaging: The HF token note is now surfaced early in quickstart, installation, and README with consistent wording and a link to Llama-2 as the canonical gated model example.
  • README structure improvements: Collapsing the directory tree behind <details> and splitting features into Core/Advanced are good UX improvements for the landing page.
  • Guide metric fixes (kv-cache.md, routing.md, workloads.md, cluster.md): All single-line metric corrections look correct against the measured values in the PR description.

Summary

The PR achieves its stated goal of fixing stale metrics and improving the tutorial narrative. The issues above are all minor/nit — the only one worth addressing before merge is #1 (cluster.md excess rate examples) since it undermines the pedagogical point about sub-saturation. The rest are suggestions that could be addressed in follow-up.


|

)

- Tutorial: rewrite capacity planning scenario with rate=200 (was rate=20),
  correct per-instance throughput ~17 req/s (was ~2.5), add --rate
  clarification, explain fitness scores, honest routing differentiation claims
- workloads.md: fix stale ~57 req/s capacity → ~17 req/s for roofline mode
- cluster.md: fix queue growth math using 17 req/s (was 57.4 req/s)
- kv-cache.md: fix 2048-token prefill time ~97ms (was ~43ms)
- routing.md: fix stale ~9ms step time claim
- quickstart.md: add HF fetch note, multi-JSON output format warning
- README.md: add HF fetch note, convert command descriptions, meaningful
  cluster example load, split features into Core/Advanced, collapse project
  structure behind <details> tag

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo
Copy link
Copy Markdown
Contributor Author

Addressed all review comments:

  1. cluster.md excess rates — switched to rate=200 so the examples show a real transition: 4 instances (excess=33, overloaded) → 8 instances (excess=8, near saturation) → 12 instances (excess=-0.3, balanced/sub-saturation). The closing line now explains that TTFT converges to baseline when excess ≤ 0.

  2. tutorial.md "moderate headroom" mismatch — rewrote all three inline comments to show per-instance arrival rate vs saturated capacity (e.g., 50 req/s per instance vs ~17 saturated capacity → heavily overloaded), consistent with the explanation below.

  3. tutorial.md capacity math clarity — same fix as Init commit #2; comments now make it clear these are per-instance arrival rates compared against saturated throughput, not standalone capacity numbers.

  4. quickstart.md jq pipeline — replaced fragile tail -n +2 | jq --slurp '.[-1]' with format-resilient jq --slurp '.[] | select(.instance_id == "cluster")'.

  5. README.md PP/EP — no change needed. Confirmed BLIS has no --pp or --ep CLI flags (grepped cmd/), so the original (TP, PP, EP, batch limits) was aspirational. Current (TP, chunk size, batch limits) accurately reflects what BLIS models today.

@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Mar 12, 2026

@claude Are all the review comments addressed? See comment above: #632 (comment)

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @sriumcp's task in 45s —— View job


Review comment verification

All 5 review comments have been addressed. Here's the verification:

  • Fix Init commit #1: cluster.md excess rates — Switched to rate=200. Now shows a real transition: 4 instances (excess=33, overloaded) → 8 instances (excess=8, near saturation) → 12 instances (excess=-0.3, sub-saturation). Line 65 adds: "At sub-saturation (excess ≤ 0): TTFT converges to the baseline." ✅
  • Fix Init commit #2: tutorial.md "moderate headroom" — All three inline comments rewritten to show per-instance arrival rate vs saturated capacity (e.g., 50 req/s per instance vs ~17 saturated capacity → heavily overloaded). No more "moderate headroom" mismatch. ✅
  • Fix getting started instructions #3: tutorial.md capacity math clarity — Same fix as Init commit #2. Comments now clearly compare arrival rate against saturated throughput, not standalone capacity. ✅
  • Fix Naming Conventions #4: quickstart.md jq pipeline — Replaced tail -n +2 | jq --slurp '.[-1]' with format-resilient jq --slurp '.[] | select(.instance_id == "cluster")'. ✅
  • Fix Simplify kv cache (by removing matched variable) #5: README.md PP/EP — No change needed. Confirmed (TP, chunk size, batch limits) accurately reflects what BLIS models (no --pp or --ep flags exist). ✅

LGTM — all review feedback has been addressed and the changes look correct.

@sriumcp sriumcp merged commit c9ca689 into main Mar 12, 2026
5 checks passed
@susiejojo susiejojo deleted the fix-docs-550 branch March 20, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: comprehensive first-time user review — hardcoded metrics mismatches, tutorial narrative gaps, and UX friction

2 participants