Skip to content

inference-sim/inference-sim

Repository files navigation

Blackbox Inference Simulator (BLIS)

A discrete-event simulator for LLM inference serving systems. BLIS models multi-instance clusters with configurable admission control, request routing, KV-cache dynamics (including tiered GPU+CPU offloading), scheduling policies, and token generation — all driven by trained performance coefficients, analytical roofline estimates, or physics-informed cross-model prediction.

The simulator is CPU-only, deterministic, and designed for capacity planning, policy optimization research, and performance prediction across model/GPU/TP configurations without requiring real GPUs.


Features

Core

  • Discrete-event simulation for prefill, decode, and request scheduling
  • KV-cache modeling (blocks, prefix caching, prefill chunking, tiered GPU+CPU offload)
  • CPU-only inference cost model via analytical roofline estimation or learned α/β coefficients
  • Four latency estimation modes: roofline (default, analytical), blackbox (data-driven), cross-model (physics-informed, MoE-aware), and trained-roofline (roofline × learned corrections)
  • Multi-instance cluster simulation with shared-clock event loop and pluggable routing (round-robin, least-loaded, weighted-scoring)
  • Multiple workload types: preset (chatbot, contentgen, summarization, multidoc), custom distributions, or trace replay

Advanced

  • Any HuggingFace model: dense (Llama-2, Qwen3, etc.) and MoE (Mixtral, etc.) — auto-fetches model config on first run
  • vLLM deployment configuration (TP, chunk size, batch limits)
  • Priority policies and instance schedulers: constant, slo-based; fcfs, priority-fcfs, sjf
  • Admission control: always-admit or token-bucket rate limiting
  • YAML policy configuration: define all policies in a single config file (--policy-config)
  • ServeGen-informed workload generation: multi-client specs with Poisson/Gamma/Weibull/Constant arrivals (--workload-spec)
  • Decision tracing and counterfactual analysis: record routing decisions and evaluate alternative choices
  • Fitness evaluation: weighted multi-objective scoring with configurable metric weights
  • Per-SLO-class metrics: breakdown by SLO class with Jain fairness index

Installation

Requirements:

  • Go ≥ 1.21

Build the binary:

git clone https://github.com/inference-sim/inference-sim.git
cd inference-sim
go build -o blis main.go

Note: On first run, BLIS auto-fetches the model's config.json from HuggingFace (~1 second for public models like Qwen3). Subsequent runs use the cached config in model_configs/. For offline use, pass --latency-model blackbox (uses pre-trained coefficients, no network needed).

Environment setup (optional):

Set HF_TOKEN to access gated models (e.g., Llama-2) and avoid HuggingFace rate limits.

export HF_TOKEN=your_token_here

See HuggingFace access tokens to create a token.


Quick Start

Run BLIS for qwen/qwen3-14b with default configs (auto-fetches model config from HuggingFace):

./blis run --model qwen/qwen3-14b

You should see JSON output on stdout with key fields:

Field Description
ttft_mean_ms, ttft_p99_ms Time to First Token — how long until the first token is generated
e2e_mean_ms, e2e_p99_ms End-to-End latency — total time from request arrival to final token
itl_mean_ms, itl_p99_ms Inter-Token Latency — time between consecutive output tokens
responses_per_sec Completed requests per second
tokens_per_sec Output tokens generated per second
completed_requests Number of requests that finished within the simulation window
preemption_count Number of times a running request was evicted to make room for others (0 = healthy)

Usage

Multi-client workload specification

./blis run --model qwen/qwen3-14b --workload-spec examples/servegen-language.yaml

Cluster simulation with weighted routing

./blis run --model qwen/qwen3-14b \
  --num-instances 4 --routing-policy weighted \
  --routing-scorers "prefix-affinity:3,queue-depth:2,kv-utilization:2" \
  --rate 100 --num-requests 500

Blackbox mode (explicit trained coefficients)

./blis run --model qwen/qwen3-14b --latency-model blackbox

See the supported models catalog for models with pre-trained coefficients.

Observe real server latency

Record timing from a real inference server into a TraceV2 file:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

For servers exposing /v1/chat/completions (most production vLLM/SGLang deployments), use --api-format chat and optionally account for network round-trip time:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --api-format chat --rtt-ms 2.5 \
  --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

See Workload Specifications for the workload spec YAML schema.

Replay traces through simulator

Replay a captured TraceV2 file through the discrete-event simulator:

./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14b

To produce per-request results for calibration, add --results-path:

./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14b \
  --results-path results.json

Calibrate simulator accuracy

Compare real observed latencies against simulator predictions (using the per-request results from blis replay --results-path):

./blis calibrate --trace-header t.yaml --trace-data d.csv \
  --sim-results results.json --report calibration.json

Convert workload formats

# Generate a v2 workload spec YAML from a built-in preset
./blis convert preset --name chatbot --rate 10 --num-requests 100

# Import a ServeGen dataset directory (requires your own ServeGen data/)
./blis convert servegen --path data/

# Import an inference-perf workload spec
./blis convert inference-perf --spec spec.yaml

Compose multiple workload specs

Merge workload spec YAMLs produced by blis convert or written by hand (see Workload Specifications):

./blis compose --from spec1.yaml --from spec2.yaml

For comprehensive usage guides, see the Documentation section below.


Documentation

BLIS has a comprehensive documentation site built with MkDocs Material:

Section Description
Getting Started Installation, quick start, capacity planning tutorial
User Guide Routing policies, KV cache, roofline mode, workloads, cluster simulation, interpreting results
Concepts Architecture, core engine, roofline estimation, glossary
Reference CLI flag reference, supported models, workload spec YAML schema
Methodology Strategy Evolution methodology, discovered principles
Contributing Extension recipes, PR workflow, design process, standards

Project Structure

For the authoritative file-level architecture documentation with interface names, method signatures, and module descriptions, see CLAUDE.md.

Click to expand full directory tree
inference-sim/
├── main.go                 # CLI entry point
├── cmd/                    # CLI commands
│   ├── root.go             # CLI commands and flags (--num-instances, --policy-config, --routing-scorers, --workload-spec, --latency-model, etc.)
│   ├── replay.go           # `blis replay` command: replays TraceV2 file through DES
│   ├── calibrate.go        # `blis calibrate` command: compares real vs simulated latencies
│   ├── observe.go          # Real-mode HTTP client (RealClient with functional options); Recorder for TraceV2 output
│   ├── observe_cmd.go      # `blis observe` command: flags, prefix string generation, dispatch orchestrator
│   ├── convert.go          # `blis convert` subcommands (servegen, preset, inference-perf)
│   ├── compose.go          # `blis compose` for merging v2 specs
│   ├── hfconfig.go         # HuggingFace config resolution (--latency-model auto-fetch into model_configs/)
│   └── default_config.go   # defaults.yaml loading (includes GetHFRepo for HF repo mapping)
├── sim/                    # Core simulation engine
│   ├── config.go           # Module-scoped sub-config types (R16)
│   ├── doc.go              # Package reading guide
│   ├── simulator.go        # Discrete-event simulation loop
│   ├── admission.go        # Admission policy interface and templates
│   ├── routing.go          # Routing policy interface and templates
│   ├── routing_scorers.go  # ScorerConfig, stateless scorers, ParseScorerConfigs
│   ├── routing_prefix_scorer.go # Prefix-affinity scorer + observer
│   ├── prefix_cache_index.go # PrefixCacheIndex: per-instance LRU of block hashes
│   ├── priority.go         # Priority policy interface and templates
│   ├── scheduler.go        # Instance scheduler interface and templates
│   ├── latency_model.go    # LatencyModel interface and registration
│   ├── router_state.go     # RouterState bridge type for cluster-level policies
│   ├── bundle.go           # PolicyBundle YAML configuration
│   ├── event.go            # Event types (Arrival, Queued, Step, Scheduled, Preemption, RequestLeft)
│   ├── kv_store.go         # KVStore interface and registration variables
│   ├── batch.go            # Batch struct
│   ├── batch_formation.go  # BatchFormation interface, VLLMBatchFormation
│   ├── queue.go            # FIFO wait queue
│   ├── request.go          # Request lifecycle
│   ├── metrics.go          # TTFT, TPOT, E2E collection
│   ├── metrics_utils.go    # MetricsOutput JSON struct, percentile calculations
│   ├── rng.go              # PartitionedRNG for deterministic simulation
│   ├── model_hardware_config.go  # ModelConfig, HardwareCalib structs
│   └── internal/           # Shared internal packages (hash, testutil, util)
├── sim/kv/                 # KV cache implementations
│   ├── cache.go            # KVCacheState (single-tier GPU)
│   ├── tiered.go           # TieredKVCache (GPU+CPU)
│   └── register.go         # NewKVStore factory + init()-based registration into sim/
├── sim/latency/            # Latency model implementations
│   ├── latency.go          # RooflineLatencyModel, BlackboxLatencyModel, CrossModelLatencyModel, NewLatencyModel factory
│   ├── trained_roofline.go # TrainedRooflineLatencyModel: roofline basis functions × learned corrections
│   ├── crossmodel.go       # CrossModelLatencyModel: physics-informed step time from architecture features (MoE-aware)
│   ├── roofline.go         # Analytical FLOPs/bandwidth latency estimation
│   ├── config.go           # HFConfig, GetHWConfig, GetModelConfig, ValidateRooflineConfig
│   ├── kv_capacity.go      # KV cache block auto-calculation from model architecture + GPU memory
│   └── register.go         # init()-based registration into sim/
├── sim/cluster/            # Multi-replica cluster simulation
│   ├── cluster.go          # Shared-clock event loop, online routing
│   ├── instance.go         # Per-instance simulator wrapper
│   ├── cluster_event.go    # Cluster-level event types
│   ├── snapshot.go         # Instance observability snapshots
│   ├── metrics.go          # RawMetrics, FitnessResult, anomaly detection, per-SLO-class metrics
│   ├── counterfactual.go   # Top-k candidate ranking and regret computation
│   ├── deployment.go       # DeploymentConfig (embeds SimConfig + cluster fields)
│   └── evaluation.go       # EvaluationResult wrapper (metrics + trace + summary)
├── sim/workload/           # ServeGen-informed workload generation
│   ├── spec.go             # WorkloadSpec, ClientSpec, ArrivalSpec, DistSpec, YAML loading
│   ├── arrival.go          # ArrivalSampler: Poisson, Gamma, Weibull, Constant
│   ├── distribution.go     # LengthSampler: Gaussian, Exponential, ParetoLogNormal, EmpiricalPDF, Constant
│   ├── client.go           # Rate normalization, prefix group management
│   ├── generator.go        # GenerateRequests pipeline with client decomposition
│   ├── servegen.go         # Native ServeGen data file loading
│   ├── tracev2.go          # Trace v2 format (YAML header + CSV data)
│   ├── replay.go           # Trace v2 → sim.Request with synthetic token IDs
│   ├── calibrate.go        # CalibrationReport, MAPE, Pearson r
│   ├── multimodal.go       # Multimodal token generation (text+image+audio+video)
│   ├── reasoning.go        # Reasoning multi-turn with context accumulation
│   ├── session.go          # SessionManager: closed-loop session tracking, follow-up round generation
│   ├── network.go          # Client-perspective latency (RTT + bandwidth)
│   ├── inference_perf.go   # inference-perf format loading and validation
│   ├── scenarios.go        # Built-in presets (bursty, unfair, prefix-heavy, mixed-slo)
│   ├── cohort.go           # CohortSpec expansion: diurnal, spike, drain patterns
│   ├── convert.go          # Format converters: ConvertServeGen, ConvertPreset, ComposeSpecs
│   └── synthesis.go        # Flag-to-spec synthesis: SynthesizeFromDistribution, SynthesizeFromPreset
├── sim/trace/              # Decision trace recording
│   ├── trace.go            # TraceLevel, TraceConfig, SimulationTrace
│   ├── record.go           # AdmissionRecord, RoutingRecord, CandidateScore
│   └── summary.go          # TraceSummary, Summarize()
├── examples/               # Example configuration files
│   ├── policy-config.yaml
│   ├── weighted-routing.yaml
│   ├── routing-comparison.sh
│   ├── servegen-language.yaml
│   ├── prefix-affinity-demo.yaml
│   ├── multiturn-chat-demo.yaml
│   ├── epp-estimate-prefix.yaml
│   ├── epp-precise-prefix.yaml
│   ├── inference-perf-shared-prefix.yaml
│   ├── regression_workload_cache_warmup.yaml
│   ├── regression_workload_load_spikes.yaml
│   └── regression_workload_multiturn.yaml
├── model_configs/          # Auto-fetched HuggingFace config.json files (gitignored)
├── defaults.yaml           # Pre-trained coefficients, model defaults
├── hardware_config.json    # GPU hardware specifications
├── docs/                   # Documentation (MkDocs Material site)
│   ├── getting-started/    # New user onboarding
│   ├── guide/              # Task-oriented user guides
│   ├── concepts/           # Architecture and design documentation
│   ├── reference/          # Configuration and model reference
│   ├── methodology/        # Research methodology
│   ├── contributing/       # Contributor documentation
│   └── plans/              # Active implementation plans
└── mkdocs.yml              # MkDocs Material site configuration

Contributing

Contributions are welcome! See CONTRIBUTING.md for the engineering standards, development workflow, and step-by-step guides for adding new components. For ongoing work and architectural decisions, see docs/plans/.


License

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.