A discrete-event simulator for LLM inference serving systems. BLIS models multi-instance clusters with configurable admission control, request routing, KV-cache dynamics (including tiered GPU+CPU offloading), scheduling policies, and token generation — all driven by trained performance coefficients, analytical roofline estimates, or physics-informed cross-model prediction.
The simulator is CPU-only, deterministic, and designed for capacity planning, policy optimization research, and performance prediction across model/GPU/TP configurations without requiring real GPUs.
- Discrete-event simulation for prefill, decode, and request scheduling
- KV-cache modeling (blocks, prefix caching, prefill chunking, tiered GPU+CPU offload)
- CPU-only inference cost model via analytical roofline estimation or learned α/β coefficients
- Four latency estimation modes: roofline (default, analytical), blackbox (data-driven), cross-model (physics-informed, MoE-aware), and trained-roofline (roofline × learned corrections)
- Multi-instance cluster simulation with shared-clock event loop and pluggable routing (round-robin, least-loaded, weighted-scoring)
- Multiple workload types: preset (
chatbot,contentgen,summarization,multidoc), custom distributions, or trace replay
- Any HuggingFace model: dense (Llama-2, Qwen3, etc.) and MoE (Mixtral, etc.) — auto-fetches model config on first run
- vLLM deployment configuration (TP, chunk size, batch limits)
- Priority policies and instance schedulers: constant, slo-based; fcfs, priority-fcfs, sjf
- Admission control: always-admit or token-bucket rate limiting
- YAML policy configuration: define all policies in a single config file (
--policy-config) - ServeGen-informed workload generation: multi-client specs with Poisson/Gamma/Weibull/Constant arrivals (
--workload-spec) - Decision tracing and counterfactual analysis: record routing decisions and evaluate alternative choices
- Fitness evaluation: weighted multi-objective scoring with configurable metric weights
- Per-SLO-class metrics: breakdown by SLO class with Jain fairness index
Requirements:
- Go ≥ 1.21
Build the binary:
git clone https://github.com/inference-sim/inference-sim.git
cd inference-sim
go build -o blis main.goNote: On first run, BLIS auto-fetches the model's config.json from HuggingFace (~1 second for public models like Qwen3). Subsequent runs use the cached config in model_configs/. For offline use, pass --latency-model blackbox (uses pre-trained coefficients, no network needed).
Environment setup (optional):
Set HF_TOKEN to access gated models (e.g., Llama-2) and avoid HuggingFace rate limits.
export HF_TOKEN=your_token_hereSee HuggingFace access tokens to create a token.
Run BLIS for qwen/qwen3-14b with default configs (auto-fetches model config from HuggingFace):
./blis run --model qwen/qwen3-14bYou should see JSON output on stdout with key fields:
| Field | Description |
|---|---|
ttft_mean_ms, ttft_p99_ms |
Time to First Token — how long until the first token is generated |
e2e_mean_ms, e2e_p99_ms |
End-to-End latency — total time from request arrival to final token |
itl_mean_ms, itl_p99_ms |
Inter-Token Latency — time between consecutive output tokens |
responses_per_sec |
Completed requests per second |
tokens_per_sec |
Output tokens generated per second |
completed_requests |
Number of requests that finished within the simulation window |
preemption_count |
Number of times a running request was evicted to make room for others (0 = healthy) |
./blis run --model qwen/qwen3-14b --workload-spec examples/servegen-language.yaml./blis run --model qwen/qwen3-14b \
--num-instances 4 --routing-policy weighted \
--routing-scorers "prefix-affinity:3,queue-depth:2,kv-utilization:2" \
--rate 100 --num-requests 500./blis run --model qwen/qwen3-14b --latency-model blackboxSee the supported models catalog for models with pre-trained coefficients.
Record timing from a real inference server into a TraceV2 file:
./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
--workload-spec workload.yaml \
--trace-header trace.yaml --trace-data trace.csvFor servers exposing /v1/chat/completions (most production vLLM/SGLang deployments), use --api-format chat and optionally account for network round-trip time:
./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
--api-format chat --rtt-ms 2.5 \
--workload-spec workload.yaml \
--trace-header trace.yaml --trace-data trace.csvSee Workload Specifications for the workload spec YAML schema.
Replay a captured TraceV2 file through the discrete-event simulator:
./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14bTo produce per-request results for calibration, add --results-path:
./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14b \
--results-path results.jsonCompare real observed latencies against simulator predictions (using the per-request results from blis replay --results-path):
./blis calibrate --trace-header t.yaml --trace-data d.csv \
--sim-results results.json --report calibration.json# Generate a v2 workload spec YAML from a built-in preset
./blis convert preset --name chatbot --rate 10 --num-requests 100
# Import a ServeGen dataset directory (requires your own ServeGen data/)
./blis convert servegen --path data/
# Import an inference-perf workload spec
./blis convert inference-perf --spec spec.yamlMerge workload spec YAMLs produced by blis convert or written by hand (see Workload Specifications):
./blis compose --from spec1.yaml --from spec2.yamlFor comprehensive usage guides, see the Documentation section below.
BLIS has a comprehensive documentation site built with MkDocs Material:
| Section | Description |
|---|---|
| Getting Started | Installation, quick start, capacity planning tutorial |
| User Guide | Routing policies, KV cache, roofline mode, workloads, cluster simulation, interpreting results |
| Concepts | Architecture, core engine, roofline estimation, glossary |
| Reference | CLI flag reference, supported models, workload spec YAML schema |
| Methodology | Strategy Evolution methodology, discovered principles |
| Contributing | Extension recipes, PR workflow, design process, standards |
For the authoritative file-level architecture documentation with interface names, method signatures, and module descriptions, see
CLAUDE.md.
Click to expand full directory tree
inference-sim/
├── main.go # CLI entry point
├── cmd/ # CLI commands
│ ├── root.go # CLI commands and flags (--num-instances, --policy-config, --routing-scorers, --workload-spec, --latency-model, etc.)
│ ├── replay.go # `blis replay` command: replays TraceV2 file through DES
│ ├── calibrate.go # `blis calibrate` command: compares real vs simulated latencies
│ ├── observe.go # Real-mode HTTP client (RealClient with functional options); Recorder for TraceV2 output
│ ├── observe_cmd.go # `blis observe` command: flags, prefix string generation, dispatch orchestrator
│ ├── convert.go # `blis convert` subcommands (servegen, preset, inference-perf)
│ ├── compose.go # `blis compose` for merging v2 specs
│ ├── hfconfig.go # HuggingFace config resolution (--latency-model auto-fetch into model_configs/)
│ └── default_config.go # defaults.yaml loading (includes GetHFRepo for HF repo mapping)
├── sim/ # Core simulation engine
│ ├── config.go # Module-scoped sub-config types (R16)
│ ├── doc.go # Package reading guide
│ ├── simulator.go # Discrete-event simulation loop
│ ├── admission.go # Admission policy interface and templates
│ ├── routing.go # Routing policy interface and templates
│ ├── routing_scorers.go # ScorerConfig, stateless scorers, ParseScorerConfigs
│ ├── routing_prefix_scorer.go # Prefix-affinity scorer + observer
│ ├── prefix_cache_index.go # PrefixCacheIndex: per-instance LRU of block hashes
│ ├── priority.go # Priority policy interface and templates
│ ├── scheduler.go # Instance scheduler interface and templates
│ ├── latency_model.go # LatencyModel interface and registration
│ ├── router_state.go # RouterState bridge type for cluster-level policies
│ ├── bundle.go # PolicyBundle YAML configuration
│ ├── event.go # Event types (Arrival, Queued, Step, Scheduled, Preemption, RequestLeft)
│ ├── kv_store.go # KVStore interface and registration variables
│ ├── batch.go # Batch struct
│ ├── batch_formation.go # BatchFormation interface, VLLMBatchFormation
│ ├── queue.go # FIFO wait queue
│ ├── request.go # Request lifecycle
│ ├── metrics.go # TTFT, TPOT, E2E collection
│ ├── metrics_utils.go # MetricsOutput JSON struct, percentile calculations
│ ├── rng.go # PartitionedRNG for deterministic simulation
│ ├── model_hardware_config.go # ModelConfig, HardwareCalib structs
│ └── internal/ # Shared internal packages (hash, testutil, util)
├── sim/kv/ # KV cache implementations
│ ├── cache.go # KVCacheState (single-tier GPU)
│ ├── tiered.go # TieredKVCache (GPU+CPU)
│ └── register.go # NewKVStore factory + init()-based registration into sim/
├── sim/latency/ # Latency model implementations
│ ├── latency.go # RooflineLatencyModel, BlackboxLatencyModel, CrossModelLatencyModel, NewLatencyModel factory
│ ├── trained_roofline.go # TrainedRooflineLatencyModel: roofline basis functions × learned corrections
│ ├── crossmodel.go # CrossModelLatencyModel: physics-informed step time from architecture features (MoE-aware)
│ ├── roofline.go # Analytical FLOPs/bandwidth latency estimation
│ ├── config.go # HFConfig, GetHWConfig, GetModelConfig, ValidateRooflineConfig
│ ├── kv_capacity.go # KV cache block auto-calculation from model architecture + GPU memory
│ └── register.go # init()-based registration into sim/
├── sim/cluster/ # Multi-replica cluster simulation
│ ├── cluster.go # Shared-clock event loop, online routing
│ ├── instance.go # Per-instance simulator wrapper
│ ├── cluster_event.go # Cluster-level event types
│ ├── snapshot.go # Instance observability snapshots
│ ├── metrics.go # RawMetrics, FitnessResult, anomaly detection, per-SLO-class metrics
│ ├── counterfactual.go # Top-k candidate ranking and regret computation
│ ├── deployment.go # DeploymentConfig (embeds SimConfig + cluster fields)
│ └── evaluation.go # EvaluationResult wrapper (metrics + trace + summary)
├── sim/workload/ # ServeGen-informed workload generation
│ ├── spec.go # WorkloadSpec, ClientSpec, ArrivalSpec, DistSpec, YAML loading
│ ├── arrival.go # ArrivalSampler: Poisson, Gamma, Weibull, Constant
│ ├── distribution.go # LengthSampler: Gaussian, Exponential, ParetoLogNormal, EmpiricalPDF, Constant
│ ├── client.go # Rate normalization, prefix group management
│ ├── generator.go # GenerateRequests pipeline with client decomposition
│ ├── servegen.go # Native ServeGen data file loading
│ ├── tracev2.go # Trace v2 format (YAML header + CSV data)
│ ├── replay.go # Trace v2 → sim.Request with synthetic token IDs
│ ├── calibrate.go # CalibrationReport, MAPE, Pearson r
│ ├── multimodal.go # Multimodal token generation (text+image+audio+video)
│ ├── reasoning.go # Reasoning multi-turn with context accumulation
│ ├── session.go # SessionManager: closed-loop session tracking, follow-up round generation
│ ├── network.go # Client-perspective latency (RTT + bandwidth)
│ ├── inference_perf.go # inference-perf format loading and validation
│ ├── scenarios.go # Built-in presets (bursty, unfair, prefix-heavy, mixed-slo)
│ ├── cohort.go # CohortSpec expansion: diurnal, spike, drain patterns
│ ├── convert.go # Format converters: ConvertServeGen, ConvertPreset, ComposeSpecs
│ └── synthesis.go # Flag-to-spec synthesis: SynthesizeFromDistribution, SynthesizeFromPreset
├── sim/trace/ # Decision trace recording
│ ├── trace.go # TraceLevel, TraceConfig, SimulationTrace
│ ├── record.go # AdmissionRecord, RoutingRecord, CandidateScore
│ └── summary.go # TraceSummary, Summarize()
├── examples/ # Example configuration files
│ ├── policy-config.yaml
│ ├── weighted-routing.yaml
│ ├── routing-comparison.sh
│ ├── servegen-language.yaml
│ ├── prefix-affinity-demo.yaml
│ ├── multiturn-chat-demo.yaml
│ ├── epp-estimate-prefix.yaml
│ ├── epp-precise-prefix.yaml
│ ├── inference-perf-shared-prefix.yaml
│ ├── regression_workload_cache_warmup.yaml
│ ├── regression_workload_load_spikes.yaml
│ └── regression_workload_multiturn.yaml
├── model_configs/ # Auto-fetched HuggingFace config.json files (gitignored)
├── defaults.yaml # Pre-trained coefficients, model defaults
├── hardware_config.json # GPU hardware specifications
├── docs/ # Documentation (MkDocs Material site)
│ ├── getting-started/ # New user onboarding
│ ├── guide/ # Task-oriented user guides
│ ├── concepts/ # Architecture and design documentation
│ ├── reference/ # Configuration and model reference
│ ├── methodology/ # Research methodology
│ ├── contributing/ # Contributor documentation
│ └── plans/ # Active implementation plans
└── mkdocs.yml # MkDocs Material site configuration
Contributions are welcome! See CONTRIBUTING.md for the engineering standards, development workflow, and step-by-step guides for adding new components. For ongoing work and architectural decisions, see docs/plans/.
This project is licensed under the Apache License, Version 2.0. See LICENSE for details.