【ローカルLLM】llama.cppの「投機的サンプリング」を試す

2023年9月5日 18:30

llama.cppに「Speculative Sampling（投機的サンプリング）」という実験的な機能がマージされて話題になっていた。
この手法については、OpenAIのKarpathy氏が以下のポストで解説している。

Speculative execution for LLMs is an excellent inference-time optimization.

It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might… https://t.co/FiwTwqsfho
— Andrej Karpathy (@karpathy) August 31, 2023

この説明を素人頭で解釈するに、人間がスマホの予測変換を利用して文章を書くのに似ている。
大型のLLMでイチから推論させると時間がかかるので、先に軽量のLLMに候補となるトークンを提案させる。メインのLLMは、提案されたトークンで良いなら、そのまま採用する（でなければ無視して自力で生成）。
なお"Speculative"には「結果的には無駄になるかもしれない」という含意があって、コンピュータ分野では「投機的」と訳すのが通例っぽい。

llama.cppで試す

こちらで示されている投機的サンプリングの使用例では、メインのモデルに「codellama-34b」、補助のモデルに「codellama-7b（4bit量子化）」を使っている。

# standard F16 sampling (using "main" tool)
./bin/main \
-m ../models/codellama-34b/ggml-model-f16.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 1

# speculative F16 sampling with Q4_1 draft (using "speculative" tool)
# example 0
./bin/speculative \
-m ../models/codellama-34b/ggml-model-f16.gguf \
-md ../models/codellama-7b/ggml-model-q4_1.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 1 --draft 16

Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra pic.twitter.com/7diki8zes4
— Georgi Gerganov (@ggerganov) August 31, 2023

開発者のGerganov氏がApple M2 Ultraで検証したところ、34B Code Llama（F16）でもともと約10トークン/秒だったものが、投機的サンプリングを使うと約20トークン/秒になったとのこと。
出力すべき内容が限定されているコーディングのような用途に向いており、チャットのような自由な（後続のトークンの選択肢が多い）形式には不向きらしい。
出力すべき内容が限定されているという意味なら、もしかしたら翻訳タスクとかも向いているのでは…？
という素人の発想で、今回は投機的サンプリング＋英文和訳を試してみた。

コード

以前に翻訳タスクで使ったOpenbuddyのLlama-2-70BとLlama-2-13B（いずれも3-bit量子化）を組み合わせて使ってみる。
コードは、プロンプトを書き換えただけで、パラメータはサンプルをそのまま流用。「--draft 16」は、ドラフトモデル（小さいモデル）がメインのLLMに提案する候補トークンの数を示す。

# 70Bモデル単体の推論
!./main \
-m ./models/openbuddy-llama2-70b-v10.1-q3_k.gguf \
-p "USER: 次の英文を日本語に訳してください。\nSpeculative execution for large language models (LLMs) is a technique that aims to accelerate the inference of generative LLMs by predicting multiple output tokens in parallel using smaller models, and then verifying them with the LLM. This reduces the latency and computational cost of the LLM inference, which normally relies on an auto-regressive decoding process that generates one token at a time.\n\nASSISTANT: 日本語訳は次の通りです。「" \
-e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 1

# 70Bモデル＋13Bモデルの投機的サンプリングで推論
!./speculative \
-m ./models/openbuddy-llama2-70b-v10.1-q3_k.gguf \
-md ./models/openbuddy-llama2-13b-v11.1-q3_k.gguf \
-p "USER: 次の英文を日本語に訳してください。\nSpeculative execution for large language models (LLMs) is a technique that aims to accelerate the inference of generative LLMs by predicting multiple output tokens in parallel using smaller models, and then verifying them with the LLM. This reduces the latency and computational cost of the LLM inference, which normally relies on an auto-regressive decoding process that generates one token at a time.\n\nASSISTANT: 日本語訳は次の通りです。「" \
-e -ngl 1 -t 4 -n 256 -c 4096 -s 8 --top_k 1 --draft 16

ということで、上記のコードを実行した出力結果は、以下の通り。生成されたテキストは、いずれも同じだった。

■ プロンプト
USER: 次の英文を日本語に訳してください。
Speculative execution for large language models (LLMs) is a technique that aims to accelerate the inference of generative LLMs by predicting multiple output tokens in parallel using smaller models, and then verifying them with the LLM. This reduces the latency and computational cost of the LLM inference, which normally relies on an auto-regressive decoding process that generates one token at a time.
ASSISTANT: 日本語訳は次の通りです。「

■ 70Bモデル単体の場合の出力
大規模言語モデル（LLM）のための推測実行は、生成的なLLMの推論を高速化するために、小さなモデルを使用して複数の出力トークンを並列で予測し、それらをLLMで検証する技術です。これにより、通常は一度に1つのトークンを生成する自己回帰的デコードプロセスに頼っているLLMの推論の遅延と計算コストが低減されます。」

※1.38 tokens per second

■ 70Bモデル＋13Bモデルの投機的サンプリングによる出力
大規模言語モデル（LLM）のための推測実行は、生成的なLLMの推論を高速化するために、小さなモデルを使用して複数の出力トークンを並列で予測し、それらをLLMで検証する技術です。これにより、通常は一度に1つのトークンを生成する自己回帰的デコードプロセスに頼っているLLMの推論の遅延と計算コストが低減されます。」

※0.885 tokens per second
※accept = 63.333%

この例では1.38トークン秒→0.89トークン秒と、投機的サンプリングを噛ませることで逆に生成速度が低下した。
「accept」は、13Bモデルが提案したトークン候補のうち、70Bモデルが実際に採用したトークンの比率。この場合は採用率63.3%。
上記以外の例文でも和訳タスクを試してみたが、だいたい1トークン秒くらい。ということで、今回は投機的サンプリングの効果は得られず。

感想

今回試した条件だとacceptはだいたい6割-7割くらいで、これは別に悪い数字ではなさそう（？）
公式サンプルでは、34BモデルのF16（70GB）と7Bモデルの4bit量子化（4GB）を組み合わせていて、モデルサイズの差が15倍以上ある。
今回試したのは70Bモデルの3bit量子化（33GB）と13Bモデルの3bit量子化（6.3GB）で、5倍ほどしかなかった。
組み合わせるモデルサイズの差が十分大きくないと、投機的サンプリングの効果は得られにくいのかも。
「Speculative Sampling」「Speculative Decoding」はホットな技術らしく、とりあえずふわっと把握できてよかった。