Compare 7 inference-serving simulators against ground-truth latency data collected from vLLM, producing MAPE/MPE error tables, CSV exports, and publication figures.
| Adapter | Type | Description |
|---|---|---|
blis-blackbox |
Subprocess (Go) | BLIS with trained alpha/beta regression coefficients per model |
blis-roofline |
Subprocess (Go) | BLIS with hardware roofline latency model |
blis-crossmodel |
Subprocess (Go) | BLIS with globally-fitted cross-model coefficients |
blis-trained-roofline |
Subprocess (Go) | BLIS with trained roofline coefficients |
vidur |
Subprocess (Python) | Discrete-event simulator with vLLM scheduler emulation |
llm-optimizer-estimate |
In-process (Python) | Analytical roofline estimator from llm-optimizer |
aiconfigurator-estimate |
In-process (Python) | Analytical estimator from AIConfigurator SDK |
sim-to-real-accuracy-validation/
├── experiment/ # Core Python package
│ ├── data_model.py # Dataclasses (Experiment, StageMetrics, SimulatorResult, etc.)
│ ├── ground_truth.py # Discover and parse ground-truth experiment directories
│ ├── kv_cache_extractor.py # Extract KV cache block counts from vllm.log / kv_events.jsonl
│ ├── trace_converter.py # Convert per-request JSON → BLIS trace (header YAML + CSV)
│ ├── vidur_trace_converter.py# Convert per-request JSON → Vidur trace CSV
│ ├── metrics.py # MAPE, MPE, absolute error computation
│ ├── report.py # Formatted tables and CSV export
│ ├── run.py # Pipeline orchestrator (CLI entry point)
│ ├── figures.py # Publication figures (independent CLI entry point)
│ └── adapters/
│ ├── base.py # SimulatorAdapter ABC + shared BLIS logic
│ ├── blis_blackbox.py
│ ├── blis_roofline.py
│ ├── blis_crossmodel.py
│ ├── blis_trained_roofline.py
│ ├── vidur.py
│ ├── llm_optimizer_est.py
│ └── aiconfigurator_est.py
├── tests/ # Unit + integration tests (pytest)
├── vllm_data/ground_truth/ # 16 ground-truth experiment directories (not tracked, see below)
├── inference-sim -> ../inference-sim # Symlink to BLIS simulator repo
├── vidur -> ../vidur # Symlink to Vidur simulator repo
└── llm-optimizer -> ../llm-optimizer # Symlink to LLM optimizer repo
- Python >= 3.10
- Go >= 1.21 (for building BLIS)
- Internet access (llm-optimizer downloads model configs from HuggingFace Hub)
- Local clones of: inference-sim, vidur, llm-optimizer
This repo expects inference-sim, llm-optimizer, and vidur to be symlinks to their respective local clones. The experiments in this repo were run against the following commits:
| Repo | Commit | Description |
|---|---|---|
| inference-sim | b05154c |
hypothesis(H30-H32): BLIS replay vs real vLLM — three-way crossmodel validation |
| llm-optimizer | bb82d22 |
feat: add support for max workers |
| vidur | 8383d29 |
[Bugfix]: Revert scheduler regression and introduce canary branch |
Clone at the pinned versions and create symlinks (assuming repos live under the same parent directory):
# Clone at pinned commits
git clone [email protected]:inference-sim/inference-sim.git ../inference-sim && git -C ../inference-sim checkout b05154c
git clone [email protected]:bentoml/llm-optimizer.git ../llm-optimizer && git -C ../llm-optimizer checkout bb82d22
git clone [email protected]:microsoft/vidur.git ../vidur && git -C ../vidur checkout 8383d29
# Symlink into this repo
ln -s ../inference-sim inference-sim
ln -s ../llm-optimizer llm-optimizer
ln -s ../vidur vidurGround truth data collected from vLLM must be placed under vllm_data/ground_truth/. This directory is not tracked by git due to its size. Each experiment directory should follow the naming convention YYYYMMDD-HHMMSS-*-tp<N>-<workload> and contain the files described in the Ground-Truth Data section below.
mkdir -p vllm_data/ground_truth
# Copy or symlink your experiment directories herecd inference-sim
go build -o blis main.go
cd ..pip install numpy pyyaml pandas matplotlib # experiment package deps (pandas/matplotlib for figures)
pip install -e vidur/ # Vidur simulator
pip install -e llm-optimizer/ # LLM optimizer estimator
pip install aiconfigurator # AIConfigurator SDKThe llm-optimizer-estimate adapter uses huggingface_hub to download model config.json files. If models are gated:
export HUGGING_FACE_HUB_TOKEN=hf_...python -m experiment.run --adapters blis-roofline vidur llm-optimizer-estimate aiconfigurator-estimate --data-dir vllm_data/ground_truth --blis-binary inference-sim/blis --vidur-dir vidur --output-dir results --no-dp-scalingpython -m experiment.run \
--data-dir vllm_data/ground_truth \
--blis-binary inference-sim/blis \
--vidur-dir vidur \
--output-dir resultspython -m experiment.run \
--data-dir vllm_data/ground_truth \
--output-dir results \
--adapters vidur llm-optimizer-estimate aiconfigurator-estimatepython -m experiment.run --adapters blis-roofline vidurFigures are generated independently from the pipeline, reading the CSVs it produces:
# Basic — reads error_records.csv and runtime.csv from results/
python -m experiment.figures --results-dir results
# With metadata enrichment (adds hardware/config breakdowns)
python -m experiment.figures --results-dir results --metadata experiment_metadata.csv
# Custom output directory
python -m experiment.figures --results-dir results --output-dir results/figuresThis produces 5 PDF figures and 1 LaTeX table under results/figures/.
| Flag | Default | Description |
|---|---|---|
--data-dir |
vllm_data/ground_truth |
Directory containing ground-truth experiment folders |
--blis-binary |
inference-sim/blis |
Path to compiled BLIS binary |
--vidur-dir |
vidur |
Path to cloned Vidur repository |
--output-dir |
results |
Where reports and CSV are saved |
--adapters |
all 7 | Space-separated list of adapters to run |
--no-dp-scaling |
(disabled) | Exclude experiments with data parallelism > 1 |
| Flag | Default | Description |
|---|---|---|
--results-dir |
results |
Directory containing error_records.csv and runtime.csv |
--output-dir |
results/figures |
Where figures are saved |
--metadata |
(none) | Path to experiment_metadata.csv for hardware/config enrichment |
Valid adapter names: blis-blackbox, blis-roofline, blis-crossmodel, blis-trained-roofline, vidur, llm-optimizer-estimate, aiconfigurator-estimate.
For running LLMServingSim (extremely slow adapter) on a Kubernetes cluster:
See Cluster Deployment Guide for complete instructions.
Quick summary:
- Create 75GB PVC on cluster
- Upload data and code to PVC
- Compile astra-sim once on PVC
- Launch evaluation job (10-30 hours runtime)
- Download results and merge with local simulator outputs
This allows fast simulators to run locally while slow LLMServingSim runs on cluster resources.
The orchestrator (experiment.run) executes this sequence:
- Discover — load
experiments.jsonfrom--data-dirand resolve each entry to its directory - Parse — load each experiment's configs, metrics, and KV cache data into
Experimentdataclasses - Run — for each (experiment, adapter) pair, check
adapter.can_run(), thenadapter.run()to produce aSimulatorResult - Compare — compute MAPE, MPE, and absolute error across 9 latency metrics (e2e/ttft/itl × mean/p90/p99)
- Report — print formatted tables to stdout and save
error_records.csv
Failures at any step are logged and skipped — the pipeline does not abort on individual errors.
Not every adapter can run every experiment. The can_run() method filters incompatible pairs, and the pipeline skips them automatically. See docs/simulator-limitations.md for full details.
| Adapter | Key filters | Coverage (49 experiments) |
|---|---|---|
blis-blackbox |
Model must have coefficients in inference-sim/defaults.yaml |
Varies |
blis-roofline |
Always runs | All 49 |
blis-crossmodel |
Always runs | All 49 |
blis-trained-roofline |
Model must have trained coefficients | Varies |
vidur |
3 pre-profiled models, H100/A100 only, no FP8 | ~9 |
llm-optimizer-estimate |
H100/A100, shared_prefix workloads, no Llama-4-Scout |
~40 |
aiconfigurator-estimate |
H100 only, dense models, shared_prefix workloads |
~20 |
error_records.csv— one row per (simulator, experiment, stage, metric) with columns:simulator,experiment_folder,model,workload,stage_index,metric_name,predicted,actual,mape,mpe,absolute_error, plus metadata (exp_id,hardware,dp,cpu_offload,gpu_mem_util,precision,config_tag)runtime.csv— one row per (simulator, experiment) with wall-clock time and metadata- Stdout tables — MAPE by simulator, MAPE by model, MAPE by workload, MPE by simulator (signed), runtime summary
Generated separately via python -m experiment.figures:
fig1_model_sensitivity.pdf— MAPE by model across simulatorsfig2_hardware_portability.pdf— MAPE by hardware platformfig3_workload_sensitivity.pdf— MAPE by workload typefig4a_config_dense.pdf/fig4b_config_moe.pdf— Config sensitivity (mbt, cpu-offload, gpu-mem, dp)fig5_pareto.pdf— Accuracy vs. runtime Pareto frontiertable1_runtime.tex— LaTeX runtime comparison table
Experiments are discovered via experiments.json (a manifest file in vllm_data/ground_truth/). Each entry maps an experiment ID to its metadata (hardware, precision, dp, etc.). Directories are named <id>-<slug> and resolved by prefix matching.
Each experiment directory contains:
| File | Purpose |
|---|---|
exp-config.yaml |
Model name, TP degree, scheduler limits |
profile.yaml |
Load stages (rate, duration), data type config |
vllm.log |
GPU KV cache block count |
kv_events.jsonl |
CPU KV cache offloading events |
results/summary_lifecycle_metrics.json |
Aggregate latency and throughput |
results/stage_N_lifecycle_metrics.json |
Per-stage latency and throughput |
results/per_request_lifecycle_metrics.json |
Per-request timings (used for trace replay) |
The perf data directory is auto-detected: results/ is preferred, with inference-perf-data/ as a legacy fallback.
# Run all tests (unit + integration)
python -m pytest tests/ -v
# Run only unit tests (no ground-truth data needed)
python -m pytest tests/ -v --ignore=tests/test_integration.py
# Run integration tests (requires vllm_data/ground_truth/)
python -m pytest tests/test_integration.py -vIntegration tests are automatically skipped if vllm_data/ground_truth/ is not present.