見出し画像

AMD9950X/128GB DDR5+4090によるvllmベンチマーク

AMD9950Xと4090をせっかくインストールしたので、とりあえずvllmでどのくらいのスループットが出るのか確かめてみた。

トルコのイスタンブールから浅草橋の技研にリモートアクセスして確かめる

機材構成としては以下の通り

- CPU  AMD9950X
- メモリ DDR5-5600 32GBx4 (128GB)
- SSD   m.2 2TB
- GPU  NVIDIA 4090
- LLM  Meta-llama/meta-llama-3-8B(量子化なし)

なのでまあ基本的にGPUのスピードだと思うが、何パターンか試してみた。

まずはvllmのデフォルトのパラメータでやってみる(プロンプト数=2)

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2

============ Serving Benchmark Result ============
Successful requests:                     2         
Benchmark duration (s):                  5.26      
Total input tokens:                      24        
Total generated tokens:                  28        
Request throughput (req/s):              0.38      
Input token throughput (tok/s):          4.56      
Output token throughput (tok/s):         5.32      
---------------Time to First Token----------------
Mean TTFT (ms):                          25.08     
Median TTFT (ms):                        25.08     
P99 TTFT (ms):                           25.24     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.05     
Median TPOT (ms):                        18.05     
P99 TPOT (ms):                           18.06     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.02     
Median ITL (ms):                         18.03     
P99 ITL (ms):                            18.16     
==================================================

これはいくらなんでも遅すぎる。5.32tok/s。
ちょっと処理するプロンプトが少なすぎて正確に計測できてないようなので、200でやってみる

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 200
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  24.79     
Total input tokens:                      3661      
Total generated tokens:                  13588     
Request throughput (req/s):              8.07      
Input token throughput (tok/s):          147.68    
Output token throughput (tok/s):         548.11    
---------------Time to First Token----------------
Mean TTFT (ms):                          552.77    
Median TTFT (ms):                        665.02    
P99 TTFT (ms):                           674.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.62     
Median TPOT (ms):                        40.16     
P99 TPOT (ms):                           79.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.93     
Median ITL (ms):                         32.08     
P99 ITL (ms):                            53.19     
==================================================

2より200の方が高速に処理しているのがわかる。Outputが548t/sはそんなに悪くない印象

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2000
============ Serving Benchmark Result ============
Successful requests:                     14        
Benchmark duration (s):                  14.37     
Total input tokens:                      168       
Total generated tokens:                  768       
Request throughput (req/s):              0.97      
Input token throughput (tok/s):          11.69     
Output token throughput (tok/s):         53.45     
---------------Time to First Token----------------
Mean TTFT (ms):                          6000.30   
Median TTFT (ms):                        5998.19   
P99 TTFT (ms):                           6016.37   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.25     
Median TPOT (ms):                        19.23     
P99 TPOT (ms):                           20.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.92     
Median ITL (ms):                         18.78     
P99 ITL (ms):                            38.20     
==================================================

2000にすると却って遅くなってしまった。53.45tok/s
500で試してみる

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 500
============ Serving Benchmark Result ============
Successful requests:                     499       
Benchmark duration (s):                  33.50     
Total input tokens:                      8948      
Total generated tokens:                  36459     
Request throughput (req/s):              14.90     
Input token throughput (tok/s):          267.13    
Output token throughput (tok/s):         1088.44   
---------------Time to First Token----------------
Mean TTFT (ms):                          2110.71   
Median TTFT (ms):                        914.80    
P99 TTFT (ms):                           5295.46   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.63     
Median TPOT (ms):                        69.98     
P99 TPOT (ms):                           83.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.79     
Median ITL (ms):                         45.24     
P99 ITL (ms):                            192.39    
==================================================

毎回変わるかな、と思って試してみたが、500くらいは何回かやっても安定して1000tok/sくらい出た。この辺りがボリュームゾーンか?
1000でやってみる

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 1000
=========== Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  26.39     
Total input tokens:                      12        
Total generated tokens:                  35        
Request throughput (req/s):              0.04      
Input token throughput (tok/s):          0.45      
Output token throughput (tok/s):         1.33      
---------------Time to First Token----------------
Mean TTFT (ms):                          20676.33  
Median TTFT (ms):                        20676.33  
P99 TTFT (ms):                           20676.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.47     
Median TPOT (ms):                        17.47     
P99 TPOT (ms):                           17.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.47     
Median ITL (ms):                         17.46     
P99 ITL (ms):                            17.57     
==================================================

全然だめ。GPU溢れちゃうのかな。
というわけで750もやってみる

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 750
============ Serving Benchmark Result ============
Successful requests:                     249       
Benchmark duration (s):                  30.00     
Total input tokens:                      4509      
Total generated tokens:                  16892     
Request throughput (req/s):              8.30      
Input token throughput (tok/s):          150.32    
Output token throughput (tok/s):         563.15    
---------------Time to First Token----------------
Mean TTFT (ms):                          5276.42   
Median TTFT (ms):                        5492.24   
P99 TTFT (ms):                           5506.22   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.34     
Median TPOT (ms):                        44.11     
P99 TPOT (ms):                           82.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.13     
Median ITL (ms):                         39.73     
P99 ITL (ms):                            49.14     
==================================================

563tok/s。500より悪くなってる。
600で試す

============ Serving Benchmark Result ============
Successful requests:                     399       
Benchmark duration (s):                  33.74     
Total input tokens:                      7434      
Total generated tokens:                  29589     
Request throughput (req/s):              11.82     
Input token throughput (tok/s):          220.31    
Output token throughput (tok/s):         876.88    
---------------Time to First Token----------------
Mean TTFT (ms):                          3267.80   
Median TTFT (ms):                        2824.51   
P99 TTFT (ms):                           5734.50   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.66     
Median TPOT (ms):                        61.29     
P99 TPOT (ms):                           77.98     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.81     
Median ITL (ms):                         40.96     
P99 ITL (ms):                            108.18    
==================================================

500より少し悪くなってる。
550で試す

============ Serving Benchmark Result ============
Successful requests:                     449       
Benchmark duration (s):                  33.41     
Total input tokens:                      8345      
Total generated tokens:                  33188     
Request throughput (req/s):              13.44     
Input token throughput (tok/s):          249.76    
Output token throughput (tok/s):         993.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          2346.12   
Median TTFT (ms):                        1516.16   
P99 TTFT (ms):                           5284.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.04     
Median TPOT (ms):                        68.58     
P99 TPOT (ms):                           84.35     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.13     
Median ITL (ms):                         40.93     
P99 ITL (ms):                            209.50    
==================================================

1000に近づいてきた。逆に500より減らしてみる

============ Serving Benchmark Result ============
Successful requests:                     450       
Benchmark duration (s):                  32.76     
Total input tokens:                      8357      
Total generated tokens:                  33212     
Request throughput (req/s):              13.73     
Input token throughput (tok/s):          255.07    
Output token throughput (tok/s):         1013.69   
---------------Time to First Token----------------
Mean TTFT (ms):                          1664.00   
Median TTFT (ms):                        914.51    
P99 TTFT (ms):                           4439.54   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.85     
Median TPOT (ms):                        65.42     
P99 TPOT (ms):                           81.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.14     
Median ITL (ms):                         40.97     
P99 ITL (ms):                            204.71    
==================================================
(base) shi3z@amd9950shi3z-1:~/git/vllm$ 

450だと500と近い性能になる。
400では?

============ Serving Benchmark Result ============
Successful requests:                     400       
Benchmark duration (s):                  32.02     
Total input tokens:                      7446      
Total generated tokens:                  29700     
Request throughput (req/s):              12.49     
Input token throughput (tok/s):          232.54    
Output token throughput (tok/s):         927.53    
---------------Time to First Token----------------
Mean TTFT (ms):                          1406.67   
Median TTFT (ms):                        916.24    
P99 TTFT (ms):                           3879.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.00     
Median TPOT (ms):                        64.81     
P99 TPOT (ms):                           87.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           46.65     
Median ITL (ms):                         41.35     
P99 ITL (ms):                            211.05    
==================================================

さらに下がった。
ということはこのシステムとこのベンチマーク方法では500プロンプトあたりが最大効率に近く、最大効率は1088tok/sくらいだということがわかった。

今回はCPU使ってないのでCPUをもっと使うようなオフロードでも試すべきだろうな。ちょっと飛行機の時間が迫ってるのでそれはまた今度