Artifact for the MLSys 2026 artifact evaluation process for the paper: Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token.
This repository contains all scripts, data, and pre-built figures needed to reproduce every figure, table, and inline number in the paper.
# Clone the mlsys_artifact branch with data submodule
git clone --recurse-submodules -b mlsys_artifact https://github.com/rajveerb/stream2llm.git
cd stream2llm
# Pull large files in data/ submodule (hosted on HuggingFace with Git LFS)
cd data && git lfs install && git lfs pull && cd ..
# Create and activate conda environment
conda create -n stream2llm python=3.10.9 -y
conda activate stream2llm
# Install pinned Python dependencies
pip install -r requirements.txt
# (Optional) HuggingFace login for tokenizer access (needed for workload stats only)
huggingface-cli login| Directory | Contents |
|---|---|
stream2llm/ |
Modified vLLM engine with streaming input support |
experiments/ |
Experiment driver scripts, configs, and SLURM job files |
data/ |
Git submodule (HuggingFace dataset) with all large data (run logs, workload traces, perf models). Clone with --recurse-submodules or run git submodule update --init after cloning. |
scripts/ |
Plotting and analysis scripts |
figures/ |
Generated plots and figures from the paper |
figures/reference/ |
Pre-built reference figures from the paper (for comparison) |
tables/ |
Generated table data and analysis results |
| Paper Artifact | Command |
|---|---|
| All artifacts | bash reproduce_artifacts.sh |
| Figure 4 (Perf model comparison) | python scripts/utils/plotting/plot_recomp_vs_swap_clean.py --recomp_input data/perf_model/recomputation/H200_tp2_recomputation_latency.json --swap_input data/perf_model/swap/H200_tp2_swap_kernel_latency.json --recomp_input_2 data/perf_model/recomputation/A40_recomputation_latency.json --swap_input_2 data/perf_model/swap/A40_swap_kernel_latency.json --output_dir figures --output_prefix hardware_comparison --title_1 "H200 TP=2" --title_2 "A40" |
| Figure 5 (TTFT CCDF, Crawler + ANNS stacked) | python scripts/utils/plotting/plot_ttft_ccdf_stacked_2x4.py --crawler-log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --anns-log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures |
| Figure 6 (Crawler TTFT vs QPS) | python scripts/crawler/plotter_utils/plot_ttft_qps_comparison.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir figures --output-prefix ttft_qps_comparison_crawler -p 95 --max-rate 4 |
| Figure 7 (ANNS TTFT vs QPS) | python scripts/anns/plotter_utils/plot_ttft_qps_comparison.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures --output-prefix ttft_qps_comparison_anns -p 95 --max-rate 2 |
| Figures 8–10 (Chunk arrival characterization) | python scripts/utils/analysis/chunk_arrival_characterization.py --anns-dir data/anns/res --crawler-dir data/crawl/traces/simpleQA_ALL --output-dir figures --table-dir tables |
| Figure 11 (Trace completion, Crawler + ANNS) | python scripts/utils/plotting/plot_trace_completion_combined.py --crawler-log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --anns-log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures |
| Figure 12 (Tokens invalidated CCDF) | python scripts/anns/plotter_utils/plot_tokens_invalidated_aggregated.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir figures --table-output-dir tables --min-qps 0.25 --max-qps 2.0 |
| Table 2 — ANNS workload stats | cd data/anns && python compute_workload_stats.py --corpus-prefix retrieved_corpus_content --query-map query_trace_map_5k.json --trace-dir res --max-queries 500 --tokenizer-model meta-llama/Llama-3.1-8B-Instruct |
| Table 2 — Crawler workload stats | cd data/crawl && python compute_workload_stats.py --input-dir traces/simpleQA_ALL --tokenizer-model meta-llama/Llama-3.1-8B-Instruct --cores $(nproc) |
| Table 3 (Eviction ablation) | See Detailed Reproduction Commands |
| Table 4 (Preemption stats) | See Detailed Reproduction Commands |
| Inline evaluation numbers | See Detailed Reproduction Commands |
| Scheduler sorting latency | See Detailed Reproduction Commands |
Ablation study table (scheduler + eviction strategy speedups). Run compute_scheduler_improvements.py against all 6 delay/ablation directories — 3 Crawler + 3 ANNS. Each invocation produces a .txt file in the output directory.
Crawler:
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10 --output-dir tables --dataset-name H200_crawler_cost_based --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_recomp_only --output-dir tables --dataset-name H200_crawler_recomp_only --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_swap_only --output-dir tables --dataset-name H200_crawler_swap_only --max-qps 4ANNS:
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30 --output-dir tables --dataset-name H200_anns_cost_based --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_recomp_only --output-dir tables --dataset-name H200_anns_recomp_only --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_swap_only --output-dir tables --dataset-name H200_anns_swap_only --max-qps 2Preemption statistics table. Run analyze_preemptions.py against the same 6 directories.
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_recomp_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full_delay_10_swap_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_recomp_only
python scripts/utils/analysis/analyze_preemptions.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_500q_delay_30_swap_onlyAll speedup ratios cited in the paper body (Sections 3.1–3.6). Run compute_scheduler_improvements.py against the standard (no pressure) H200 and H100 run logs.
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name H200_crawler --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name H200_anns --max-qps 2
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/crawler/H100_enhanced_schedulers_v1_full --output-dir tables --dataset-name H100_crawler --max-qps 4
python scripts/utils/analysis/compute_scheduler_improvements.py --log-dir data/run_log/anns/H100_enhanced_schedulers_v1_full --output-dir tables --dataset-name H100_anns --max-qps 2Benchmarks the computational overhead of each scheduling policy's sorting + budget-allocation logic using realistic request populations derived from run log data. Outputs a table with mean, p50, p95, p99 latencies in microseconds.
python scripts/utils/analysis/benchmark_scheduler_latency.py --log-dir data/run_log/anns/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name anns
python scripts/utils/analysis/benchmark_scheduler_latency.py --log-dir data/run_log/crawler/H200_enhanced_schedulers_v1_full --output-dir tables --dataset-name crawlerThe stream2llm/ directory contains the modified vLLM engine with streaming input support.
# Install Stream2LLM (requires CUDA-capable GPU)
cd stream2llm
wget https://files.pythonhosted.org/packages/c4/9d/64e107313a19327b049a2267871cceb9b0415f79ee5c00dc360099f929e8/vllm-0.8.1-cp38-abi3-manylinux1_x86_64.whl
# Environment variables (update with each vLLM release)
export VLLM_VERSION=0.8.1
export VLLM_PRECOMPILED_WHEEL_LOCATION=${PWD}/vllm-0.8.1-cp38-abi3-manylinux1_x86_64.whl
# Install in development mode
pip install -e .
cd ..Hardware requirements: NVIDIA GPU with compute capability >= 7.0 (e.g., A40, H100, H200). Tensor parallelism requires multiple GPUs.
See experiments/README.md for detailed instructions with exact commands for running all 10 experiment configurations, ablation studies, SLURM submission, and plot generation.
The data/ directory is a git submodule hosted on HuggingFace (rbachkaniwala3/stream2llm-data). It must be initialized before use — either clone with --recurse-submodules or run git submodule update --init after cloning. It contains:
run_log/crawler/andrun_log/anns/: Experiment run logs (run_metrics.csv+config_*.yaml) for 10 configurations across H200 and H100 hardwareanns/: ANNS workload data (corpus content, query trace map, 4997 pipeline traces)crawl/: Crawler workload data (4322 trace CSVs)perf_model/: Performance model JSONs (7 recomputation + 11 swap)
This project is licensed under the MIT License.