Skip to content

rajveerb/stream2llm

Repository files navigation

Stream2LLM Artifact

DOI

Artifact for the MLSys 2026 artifact evaluation process for the paper: Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token.

This repository contains all scripts, data, and pre-built figures needed to reproduce every figure, table, and inline number in the paper.

Quick Start

# Clone the mlsys_artifact branch with data submodule
git clone --recurse-submodules -b mlsys_artifact https://github.com/rajveerb/stream2llm.git
cd stream2llm

# Pull large files in data/ submodule (hosted on HuggingFace with Git LFS)
cd data && git lfs install && git lfs pull && cd ..

# Create and activate conda environment
conda create -n stream2llm python=3.10.9 -y
conda activate stream2llm

# Install pinned Python dependencies
pip install -r requirements.txt

# (Optional) HuggingFace login for tokenizer access (needed for workload stats only)
huggingface-cli login

Repository Structure

Directory Contents
stream2llm/ Modified vLLM engine with streaming input support
experiments/ Experiment driver scripts, configs, and SLURM job files
data/ Git submodule (HuggingFace dataset) with all large data (run logs, workload traces, perf models). Clone with --recurse-submodules or run git submodule update --init after cloning.
scripts/ Plotting and analysis scripts
figures/ Generated plots and figures from the paper
figures/reference/ Pre-built reference figures from the paper (for comparison)
tables/ Generated table data and analysis results

Reproducing Paper Artifacts

Summary Table

Paper Artifact Command
All artifacts bash reproduce_artifacts.sh
Figure 4 (Perf model comparison) python scripts/utils/plotting/plot_recomp_vs_swap_clean.py --recomp_input data/perf_model/recomputation/H200_tp2_recomputation_latency.json --swap_input data/perf_model/swap/H200_tp2_swap_kernel_latency.json --recomp_input_2 data/perf_model/recomputation/A40_recomputation_latency.json --swap_input_2 data/perf_model/swap/A40_swap_kernel_latency.json --output_dir figures --output_prefix hardware_comparison --title_1 "H200 TP=2" --title_2 "A40"
Figure 5 (TTFT CCDF, Crawler + ANNS stacked) python scripts/utils/plotting/plot_ttft_ccdf_stacked_2x4.py --crawler-log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --anns-log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures
Figure 6 (Crawler TTFT vs QPS) python scripts/crawler/plotter_utils/plot_ttft_qps_comparison.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir figures --output-prefix ttft_qps_comparison_crawler -p 95 --max-rate 4
Figure 7 (ANNS TTFT vs QPS) python scripts/anns/plotter_utils/plot_ttft_qps_comparison.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures --output-prefix ttft_qps_comparison_anns -p 95 --max-rate 2
Figures 8–10 (Chunk arrival characterization) python scripts/utils/analysis/chunk_arrival_characterization.py --anns-dir data/anns/res --crawler-dir data/crawl/traces/simpleQA_ALL --output-dir figures --table-dir tables
Figure 11 (Trace completion, Crawler + ANNS) python scripts/utils/plotting/plot_trace_completion_combined.py --crawler-log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --anns-log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures
Figure 12 (Tokens invalidated CCDF) python scripts/anns/plotter_utils/plot_tokens_invalidated_aggregated.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures --table-output-dir tables --min-qps 0.25 --max-qps 2.0
Table 2 — ANNS workload stats cd data/anns && python compute_workload_stats.py --corpus-prefix retrieved_corpus_content --query-map query_trace_map_5k.json --trace-dir res --max-queries 500 --tokenizer-model meta-llama/Llama-3.1-8B-Instruct
Table 2 — Crawler workload stats cd data/crawl && python compute_workload_stats.py --input-dir traces/simpleQA_ALL --tokenizer-model meta-llama/Llama-3.1-8B-Instruct --cores $(nproc)
Table 3 (Eviction ablation) See Detailed Reproduction Commands
Table 4 (Preemption stats) See Detailed Reproduction Commands
Inline evaluation numbers See Detailed Reproduction Commands
Scheduler sorting latency See Detailed Reproduction Commands

Detailed Reproduction Commands

Table 3 — Eviction Ablation

Ablation study table (scheduler + eviction strategy speedups). Run compute_scheduler_improvements.py against all 6 delay/ablation directories — 3 Crawler + 3 ANNS. Each invocation produces a .txt file in the output directory.

Crawler:

python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10 --output-dir tables --dataset-name H200_crawler_cost_based --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_recomp_only --output-dir tables --dataset-name H200_crawler_recomp_only --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_swap_only --output-dir tables --dataset-name H200_crawler_swap_only --max-qps 4

ANNS:

python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30 --output-dir tables --dataset-name H200_anns_cost_based --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_recomp_only --output-dir tables --dataset-name H200_anns_recomp_only --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_swap_only --output-dir tables --dataset-name H200_anns_swap_only --max-qps 2

Table 4 — Preemption Stats

Preemption statistics table. Run analyze_preemptions.py against the same 6 directories.

python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_recomp_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_swap_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_recomp_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_swap_only

Inline evaluation numbers

All speedup ratios cited in the paper body (Sections 3.1–3.6). Run compute_scheduler_improvements.py against the standard (no pressure) H200 and H100 run logs.

python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name H200_crawler --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name H200_anns --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H100_enhanced_schedulers_v1_full --output-dir tables --dataset-name H100_crawler --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H100_enhanced_schedulers_v1_full --output-dir tables --dataset-name H100_anns --max-qps 2

Scheduler sorting latency

Benchmarks the computational overhead of each scheduling policy's sorting + budget-allocation logic using realistic request populations derived from run log data. Outputs a table with mean, p50, p95, p99 latencies in microseconds.

python scripts/utils/analysis/benchmark_scheduler_latency.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name anns
python scripts/utils/analysis/benchmark_scheduler_latency.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name crawler

Re-running Experiments

Building Stream2LLM Engine

The stream2llm/ directory contains the modified vLLM engine with streaming input support.

# Install Stream2LLM (requires CUDA-capable GPU)
cd stream2llm
wget https://files.pythonhosted.org/packages/c4/9d/64e107313a19327b049a2267871cceb9b0415f79ee5c00dc360099f929e8/vllm-0.8.1-cp38-abi3-manylinux1_x86_64.whl

# Environment variables (update with each vLLM release)
export VLLM_VERSION=0.8.1
export VLLM_PRECOMPILED_WHEEL_LOCATION=${PWD}/vllm-0.8.1-cp38-abi3-manylinux1_x86_64.whl

# Install in development mode
pip install -e .
cd ..

Hardware requirements: NVIDIA GPU with compute capability >= 7.0 (e.g., A40, H100, H200). Tensor parallelism requires multiple GPUs.

See experiments/README.md for detailed instructions with exact commands for running all 10 experiment configurations, ablation studies, SLURM submission, and plot generation.

Data Organization

The data/ directory is a git submodule hosted on HuggingFace (rbachkaniwala3/stream2llm-data). It must be initialized before use — either clone with --recurse-submodules or run git submodule update --init after cloning. It contains:

  • run_log/crawler/ and run_log/anns/: Experiment run logs (run_metrics.csv + config_*.yaml) for 10 configurations across H200 and H100 hardware
  • anns/: ANNS workload data (corpus content, query trace map, 4997 pipeline traces)
  • crawl/: Crawler workload data (4322 trace CSVs)
  • perf_model/: Performance model JSONs (7 recomputation + 11 swap)

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors