benchmarks

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available.

Dataset Overview

Dataset	Online	Offline	Data Path
ShareGPT	✅	✅	`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
BurstGPT	✅	✅	`wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`
Sonnet	✅	✅	Local file: `benchmarks/sonnet.txt`
Random	✅	✅	`synthetic`
HuggingFace	🟡	🟡	Specify your dataset path on HuggingFace
VisionArena	✅	✅	`lmarena-ai/vision-arena-bench-v0.1` (a HuggingFace dataset)

✅: supported

🚧: to be supported

🟡: Partial support. Currently, HuggingFaceDataset only supports dataset formats similar to lmms-lab/LLaVA-OneVision-Data and Aeala/ShareGPT_Vicuna_unfiltered. If you need support for other dataset formats, please consider contributing.

Note: VisionArena’s dataset-name should be set to hf

Example - Online Benchmark

First start serving your model

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
vllm serve ${MODEL_NAME} --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
BACKEND="vllm"
DATASET_NAME="sharegpt"
DATASET_PATH="<your data path>/ShareGPT_V3_unfiltered_cleaned_split.json"
python3 vllm/benchmarks/benchmark_serving.py --backend ${BACKEND} --model ${MODEL_NAME} --endpoint /v1/completions --dataset-name ${DATASET_NAME} --dataset-path ${DATASET_PATH} --num-prompts ${NUM_PROMPTS}

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="lmarena-ai/vision-arena-bench-v0.1"
DATASET_SPLIT='train'

python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}"

HuggingFaceDataset Examples

Currently, HuggingFaceDataset only supports dataset formats similar to lmms-lab/LLaVA-OneVision-Data and Aeala/ShareGPT_Vicuna_unfiltered. If you need support for other dataset formats, please consider contributing.

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

lmms-lab/LLaVA-OneVision-Data

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="lmms-lab/LLaVA-OneVision-Data"
DATASET_SPLIT='train'
DATASET_SUBSET='chart2text(cauldron)'
python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}" \
  --hf-subset "${DATASET_SUBSET}"

Aeala/ShareGPT_Vicuna_unfiltered

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="Aeala/ShareGPT_Vicuna_unfiltered"
DATASET_SPLIT='train'
python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}" \

Example - Offline Throughput Benchmark

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
DATASET_NAME="sonnet"
DATASET_PATH="vllm/benchmarks/sonnet.txt"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --num-prompts "${NUM_PROMPTS}"

If successful, you will see the following output

Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500

VisionArena Benchmark for Vision Language Models

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
DATASET_NAME="hf"
DATASET_PATH="lmarena-ai/vision-arena-bench-v0.1"
DATASET_SPLIT="train"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --backend "vllm-chat" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --num-prompts "${NUM_PROMPTS}" \
  --hf-split "${DATASET_SPLIT}"

The num prompt tokens now includes image token counts

Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280

Benchmark with LoRA Adapters

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
MODEL_NAME="meta-llama/Llama-2-7b-hf"
BACKEND="vllm"
DATASET_NAME="sharegpt"
DATASET_PATH="<your data path>/ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=10
MAX_LORAS=2
MAX_LORA_RANK=8
ENABLE_LORA="--enable-lora"
LORA_PATH="yard1/llama-2-7b-sql-lora-test"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --backend "${BACKEND}" \
  --dataset_path "${DATASET_PATH}" \
  --dataset_name "${DATASET_NAME}" \
  --num-prompts "${NUM_PROMPTS}" \
  --max-loras "${MAX_LORAS}" \
  --max-lora-rank "${MAX_LORA_RANK}" \
  ${ENABLE_LORA} \
  --lora-path "${LORA_PATH}"

Name		Name	Last commit message	Last commit date
parent directory ..
cutlass_benchmarks		cutlass_benchmarks
disagg_benchmarks		disagg_benchmarks
fused_kernels		fused_kernels
kernels		kernels
overheads		overheads
structured_schemas		structured_schemas
README.md		README.md
backend_request_func.py		backend_request_func.py
benchmark_dataset.py		benchmark_dataset.py
benchmark_latency.py		benchmark_latency.py
benchmark_long_document_qa_throughput.py		benchmark_long_document_qa_throughput.py
benchmark_prefix_caching.py		benchmark_prefix_caching.py
benchmark_prioritization.py		benchmark_prioritization.py
benchmark_serving.py		benchmark_serving.py
benchmark_serving_structured_output.py		benchmark_serving_structured_output.py
benchmark_throughput.py		benchmark_throughput.py
benchmark_utils.py		benchmark_utils.py
launch_tgi_server.sh		launch_tgi_server.sh
run_structured_output_benchmark.sh		run_structured_output_benchmark.sh
sonnet.txt		sonnet.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

VisionArena Benchmark for Vision Language Models

HuggingFaceDataset Examples

Example - Offline Throughput Benchmark

VisionArena Benchmark for Vision Language Models

Benchmark with LoRA Adapters

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

VisionArena Benchmark for Vision Language Models

HuggingFaceDataset Examples

Example - Offline Throughput Benchmark

VisionArena Benchmark for Vision Language Models

Benchmark with LoRA Adapters