Blackbox Inference Simulator (BLIS)

A discrete-event simulator for LLM inference serving systems. BLIS models multi-instance clusters with configurable admission control, request routing, KV-cache dynamics (including tiered GPU+CPU offloading), scheduling policies, and token generation — all driven by trained performance coefficients, analytical roofline estimates, or physics-informed cross-model prediction.

The simulator is CPU-only, deterministic, and designed for capacity planning, policy optimization research, and performance prediction across model/GPU/TP configurations without requiring real GPUs.

Features

Core

Discrete-event simulation for prefill, decode, and request scheduling
KV-cache modeling (blocks, prefix caching, prefill chunking, tiered GPU+CPU offload)
CPU-only inference cost model via analytical roofline estimation or learned α/β coefficients
Four latency estimation modes: roofline (default, analytical), blackbox (data-driven), cross-model (physics-informed, MoE-aware), and trained-roofline (roofline × learned corrections)
Multi-instance cluster simulation with shared-clock event loop and pluggable routing (round-robin, least-loaded, weighted-scoring)
Multiple workload types: preset (chatbot, contentgen, summarization, multidoc), custom distributions, or trace replay

Advanced

Any HuggingFace model: dense (Llama-2, Qwen3, etc.) and MoE (Mixtral, etc.) — auto-fetches model config on first run
vLLM deployment configuration (TP, chunk size, batch limits)
Priority policies and instance schedulers: constant, slo-based; fcfs, priority-fcfs, sjf
Admission control: always-admit or token-bucket rate limiting
YAML policy configuration: define all policies in a single config file (--policy-config)
ServeGen-informed workload generation: multi-client specs with Poisson/Gamma/Weibull/Constant arrivals (--workload-spec)
Decision tracing and counterfactual analysis: record routing decisions and evaluate alternative choices
Fitness evaluation: weighted multi-objective scoring with configurable metric weights
Per-SLO-class metrics: breakdown by SLO class with Jain fairness index

Installation

Requirements:

Go ≥ 1.21

Build the binary:

git clone https://github.com/inference-sim/inference-sim.git
cd inference-sim
go build -o blis main.go

Note: On first run, BLIS auto-fetches the model's config.json from HuggingFace (~1 second for public models like Qwen3). Subsequent runs use the cached config in model_configs/. For offline use, pass --latency-model blackbox (uses pre-trained coefficients, no network needed).

Environment setup (optional):

Set HF_TOKEN to access gated models (e.g., Llama-2) and avoid HuggingFace rate limits.

export HF_TOKEN=your_token_here

See HuggingFace access tokens to create a token.

Quick Start

Run BLIS for qwen/qwen3-14b with default configs (auto-fetches model config from HuggingFace):

./blis run --model qwen/qwen3-14b

You should see JSON output on stdout with key fields:

Field	Description
`ttft_mean_ms`, `ttft_p99_ms`	Time to First Token — how long until the first token is generated
`e2e_mean_ms`, `e2e_p99_ms`	End-to-End latency — total time from request arrival to final token
`itl_mean_ms`, `itl_p99_ms`	Inter-Token Latency — time between consecutive output tokens
`responses_per_sec`	Completed requests per second
`tokens_per_sec`	Output tokens generated per second
`completed_requests`	Number of requests that finished within the simulation window
`preemption_count`	Number of times a running request was evicted to make room for others (0 = healthy)

Usage

Multi-client workload specification

./blis run --model qwen/qwen3-14b --workload-spec examples/servegen-language.yaml

Cluster simulation with weighted routing

./blis run --model qwen/qwen3-14b \
  --num-instances 4 --routing-policy weighted \
  --routing-scorers "prefix-affinity:3,queue-depth:2,kv-utilization:2" \
  --rate 100 --num-requests 500

Blackbox mode (explicit trained coefficients)

./blis run --model qwen/qwen3-14b --latency-model blackbox

See the supported models catalog for models with pre-trained coefficients.

Observe real server latency

Record timing from a real inference server into a TraceV2 file:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

For servers exposing /v1/chat/completions (most production vLLM/SGLang deployments), use --api-format chat and optionally account for network round-trip time:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --api-format chat --rtt-ms 2.5 \
  --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

See Workload Specifications for the workload spec YAML schema.

Replay traces through simulator

Replay a captured TraceV2 file through the discrete-event simulator:

./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14b

To produce per-request results for calibration, add --results-path:

./blis replay --trace-header t.yaml --trace-data d.csv --model qwen/qwen3-14b \
  --results-path results.json

Calibrate simulator accuracy

Compare real observed latencies against simulator predictions (using the per-request results from blis replay --results-path):

./blis calibrate --trace-header t.yaml --trace-data d.csv \
  --sim-results results.json --report calibration.json

Convert workload formats

# Generate a v2 workload spec YAML from a built-in preset
./blis convert preset --name chatbot --rate 10 --num-requests 100

# Import a ServeGen dataset directory (requires your own ServeGen data/)
./blis convert servegen --path data/

# Import an inference-perf workload spec
./blis convert inference-perf --spec spec.yaml

Compose multiple workload specs

Merge workload spec YAMLs produced by blis convert or written by hand (see Workload Specifications):

./blis compose --from spec1.yaml --from spec2.yaml

For comprehensive usage guides, see the Documentation section below.

Documentation

BLIS has a comprehensive documentation site built with MkDocs Material:

Section	Description
Getting Started	Installation, quick start, capacity planning tutorial
User Guide	Routing policies, KV cache, roofline mode, workloads, cluster simulation, interpreting results
Concepts	Architecture, core engine, roofline estimation, glossary
Reference	CLI flag reference, supported models, workload spec YAML schema
Methodology	Strategy Evolution methodology, discovered principles
Contributing	Extension recipes, PR workflow, design process, standards

Project Structure

For the authoritative file-level architecture documentation with interface names, method signatures, and module descriptions, see CLAUDE.md.

Click to expand full directory tree

inference-sim/
├── main.go                 # CLI entry point
├── cmd/                    # CLI commands
│   ├── root.go             # CLI commands and flags (--num-instances, --policy-config, --routing-scorers, --workload-spec, --latency-model, etc.)
│   ├── replay.go           # `blis replay` command: replays TraceV2 file through DES
│   ├── calibrate.go        # `blis calibrate` command: compares real vs simulated latencies
│   ├── observe.go          # Real-mode HTTP client (RealClient with functional options); Recorder for TraceV2 output
│   ├── observe_cmd.go      # `blis observe` command: flags, prefix string generation, dispatch orchestrator
│   ├── convert.go          # `blis convert` subcommands (servegen, preset, inference-perf)
│   ├── compose.go          # `blis compose` for merging v2 specs
│   ├── hfconfig.go         # HuggingFace config resolution (--latency-model auto-fetch into model_configs/)
│   └── default_config.go   # defaults.yaml loading (includes GetHFRepo for HF repo mapping)
├── sim/                    # Core simulation engine
│   ├── config.go           # Module-scoped sub-config types (R16)
│   ├── doc.go              # Package reading guide
│   ├── simulator.go        # Discrete-event simulation loop
│   ├── admission.go        # Admission policy interface and templates
│   ├── routing.go          # Routing policy interface and templates
│   ├── routing_scorers.go  # ScorerConfig, stateless scorers, ParseScorerConfigs
│   ├── routing_prefix_scorer.go # Prefix-affinity scorer + observer
│   ├── prefix_cache_index.go # PrefixCacheIndex: per-instance LRU of block hashes
│   ├── priority.go         # Priority policy interface and templates
│   ├── scheduler.go        # Instance scheduler interface and templates
│   ├── latency_model.go    # LatencyModel interface and registration
│   ├── router_state.go     # RouterState bridge type for cluster-level policies
│   ├── bundle.go           # PolicyBundle YAML configuration
│   ├── event.go            # Event types (Arrival, Queued, Step, Scheduled, Preemption, RequestLeft)
│   ├── kv_store.go         # KVStore interface and registration variables
│   ├── batch.go            # Batch struct
│   ├── batch_formation.go  # BatchFormation interface, VLLMBatchFormation
│   ├── queue.go            # FIFO wait queue
│   ├── request.go          # Request lifecycle
│   ├── metrics.go          # TTFT, TPOT, E2E collection
│   ├── metrics_utils.go    # MetricsOutput JSON struct, percentile calculations
│   ├── rng.go              # PartitionedRNG for deterministic simulation
│   ├── model_hardware_config.go  # ModelConfig, HardwareCalib structs
│   └── internal/           # Shared internal packages (hash, testutil, util)
├── sim/kv/                 # KV cache implementations
│   ├── cache.go            # KVCacheState (single-tier GPU)
│   ├── tiered.go           # TieredKVCache (GPU+CPU)
│   └── register.go         # NewKVStore factory + init()-based registration into sim/
├── sim/latency/            # Latency model implementations
│   ├── latency.go          # RooflineLatencyModel, BlackboxLatencyModel, CrossModelLatencyModel, NewLatencyModel factory
│   ├── trained_roofline.go # TrainedRooflineLatencyModel: roofline basis functions × learned corrections
│   ├── crossmodel.go       # CrossModelLatencyModel: physics-informed step time from architecture features (MoE-aware)
│   ├── roofline.go         # Analytical FLOPs/bandwidth latency estimation
│   ├── config.go           # HFConfig, GetHWConfig, GetModelConfig, ValidateRooflineConfig
│   ├── kv_capacity.go      # KV cache block auto-calculation from model architecture + GPU memory
│   └── register.go         # init()-based registration into sim/
├── sim/cluster/            # Multi-replica cluster simulation
│   ├── cluster.go          # Shared-clock event loop, online routing
│   ├── instance.go         # Per-instance simulator wrapper
│   ├── cluster_event.go    # Cluster-level event types
│   ├── snapshot.go         # Instance observability snapshots
│   ├── metrics.go          # RawMetrics, FitnessResult, anomaly detection, per-SLO-class metrics
│   ├── counterfactual.go   # Top-k candidate ranking and regret computation
│   ├── deployment.go       # DeploymentConfig (embeds SimConfig + cluster fields)
│   └── evaluation.go       # EvaluationResult wrapper (metrics + trace + summary)
├── sim/workload/           # ServeGen-informed workload generation
│   ├── spec.go             # WorkloadSpec, ClientSpec, ArrivalSpec, DistSpec, YAML loading
│   ├── arrival.go          # ArrivalSampler: Poisson, Gamma, Weibull, Constant
│   ├── distribution.go     # LengthSampler: Gaussian, Exponential, ParetoLogNormal, EmpiricalPDF, Constant
│   ├── client.go           # Rate normalization, prefix group management
│   ├── generator.go        # GenerateRequests pipeline with client decomposition
│   ├── servegen.go         # Native ServeGen data file loading
│   ├── tracev2.go          # Trace v2 format (YAML header + CSV data)
│   ├── replay.go           # Trace v2 → sim.Request with synthetic token IDs
│   ├── calibrate.go        # CalibrationReport, MAPE, Pearson r
│   ├── multimodal.go       # Multimodal token generation (text+image+audio+video)
│   ├── reasoning.go        # Reasoning multi-turn with context accumulation
│   ├── session.go          # SessionManager: closed-loop session tracking, follow-up round generation
│   ├── network.go          # Client-perspective latency (RTT + bandwidth)
│   ├── inference_perf.go   # inference-perf format loading and validation
│   ├── scenarios.go        # Built-in presets (bursty, unfair, prefix-heavy, mixed-slo)
│   ├── cohort.go           # CohortSpec expansion: diurnal, spike, drain patterns
│   ├── convert.go          # Format converters: ConvertServeGen, ConvertPreset, ComposeSpecs
│   └── synthesis.go        # Flag-to-spec synthesis: SynthesizeFromDistribution, SynthesizeFromPreset
├── sim/trace/              # Decision trace recording
│   ├── trace.go            # TraceLevel, TraceConfig, SimulationTrace
│   ├── record.go           # AdmissionRecord, RoutingRecord, CandidateScore
│   └── summary.go          # TraceSummary, Summarize()
├── examples/               # Example configuration files
│   ├── policy-config.yaml
│   ├── weighted-routing.yaml
│   ├── routing-comparison.sh
│   ├── servegen-language.yaml
│   ├── prefix-affinity-demo.yaml
│   ├── multiturn-chat-demo.yaml
│   ├── epp-estimate-prefix.yaml
│   ├── epp-precise-prefix.yaml
│   ├── inference-perf-shared-prefix.yaml
│   ├── regression_workload_cache_warmup.yaml
│   ├── regression_workload_load_spikes.yaml
│   └── regression_workload_multiturn.yaml
├── model_configs/          # Auto-fetched HuggingFace config.json files (gitignored)
├── defaults.yaml           # Pre-trained coefficients, model defaults
├── hardware_config.json    # GPU hardware specifications
├── docs/                   # Documentation (MkDocs Material site)
│   ├── getting-started/    # New user onboarding
│   ├── guide/              # Task-oriented user guides
│   ├── concepts/           # Architecture and design documentation
│   ├── reference/          # Configuration and model reference
│   ├── methodology/        # Research methodology
│   ├── contributing/       # Contributor documentation
│   └── plans/              # Active implementation plans
└── mkdocs.yml              # MkDocs Material site configuration

Contributing

Contributions are welcome! See CONTRIBUTING.md for the engineering standards, development workflow, and step-by-step guides for adding new components. For ongoing work and architectural decisions, see docs/plans/.

License

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
.bob		.bob
.claude		.claude
.github		.github
.specify		.specify
cmd		cmd
docs		docs
examples		examples
k8s		k8s
sim		sim
specs		specs
testdata		testdata
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
defaults.yaml		defaults.yaml
go.mod		go.mod
go.sum		go.sum
hardware_config.json		hardware_config.json
main.go		main.go
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blackbox Inference Simulator (BLIS)

Features

Core

Advanced

Installation

Quick Start

Usage

Multi-client workload specification

Cluster simulation with weighted routing

Blackbox mode (explicit trained coefficients)

Observe real server latency

Replay traces through simulator

Calibrate simulator accuracy

Convert workload formats

Compose multiple workload specs

Documentation

Project Structure

Contributing

License

About

Uh oh!

Releases 23

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Blackbox Inference Simulator (BLIS)

Features

Core

Advanced

Installation

Quick Start

Usage

Multi-client workload specification

Cluster simulation with weighted routing

Blackbox mode (explicit trained coefficients)

Observe real server latency

Replay traces through simulator

Calibrate simulator accuracy

Convert workload formats

Compose multiple workload specs

Documentation

Project Structure

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages