Performance Benchmark#
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the benchmark script provided by the vllm project.
Benchmark Coverage: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see vllm-ascend benchmark scripts.
1. Run docker container#
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.9.0rc2
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
2. Install dependencies#
cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt
3. (Optional)Prepare model weights#
For faster running speed, we recommend downloading the model in advance:
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
You can also replace all model paths in the json files with your local paths:
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "your local model path",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
4. Run benchmark script#
Run benchmark script:
bash benchmarks/scripts/run-performance-benchmarks.sh
After about 10 mins, the output is as shown below:
online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 212.77
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 0.94
Output token throughput (tok/s): 204.66
Total Token throughput (tok/s): 405.16
---------------Time to First Token----------------
Mean TTFT (ms): 104.14
Median TTFT (ms): 102.22
P99 TTFT (ms): 153.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.78
Median TPOT (ms): 38.70
P99 TPOT (ms): 48.03
---------------Inter-token Latency----------------
Mean ITL (ms): 38.46
Median ITL (ms): 36.96
P99 ITL (ms): 75.03
==================================================
qps 4:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.55
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 2.76
Output token throughput (tok/s): 600.24
Total Token throughput (tok/s): 1188.27
---------------Time to First Token----------------
Mean TTFT (ms): 115.62
Median TTFT (ms): 109.39
P99 TTFT (ms): 169.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 51.48
Median TPOT (ms): 52.40
P99 TPOT (ms): 69.41
---------------Inter-token Latency----------------
Mean ITL (ms): 50.47
Median ITL (ms): 43.95
P99 ITL (ms): 130.29
==================================================
qps 16:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 47.82
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.18
Output token throughput (tok/s): 910.62
Total Token throughput (tok/s): 1802.70
---------------Time to First Token----------------
Mean TTFT (ms): 128.50
Median TTFT (ms): 128.36
P99 TTFT (ms): 187.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 83.60
Median TPOT (ms): 77.85
P99 TPOT (ms): 165.90
---------------Inter-token Latency----------------
Mean ITL (ms): 65.72
Median ITL (ms): 54.84
P99 ITL (ms): 289.63
==================================================
qps inf:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 41.26
Total input tokens: 42659
Total generated tokens: 43545
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1055.44
Total Token throughput (tok/s): 2089.40
---------------Time to First Token----------------
Mean TTFT (ms): 3394.37
Median TTFT (ms): 3359.93
P99 TTFT (ms): 3540.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.28
Median TPOT (ms): 64.19
P99 TPOT (ms): 97.66
---------------Inter-token Latency----------------
Mean ITL (ms): 56.62
Median ITL (ms): 55.69
P99 ITL (ms): 82.90
==================================================
offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds
throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens: 42659
Total num output tokens: 43545
The result json files are generated into the path benchmark/results
These files contain detailed benchmarking results for further analysis.
.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json