Google Colab で PowerInfer を試す

2023年12月28日 17:19

「Google Colab」で「PowerInfer」を試したので、まとめました。

【注意】Google Colab Pro/Pro+のA100で動作確認しています。

1. PowerInfer

「PowerInfer」は、家庭用の単一GPUのPCでもLLMを高速に実行できるLLM推論エンジンです。ニューロンの活性化におけるべき乗則分布によって特徴付けられる、LLM推論に固有の高い局所性を利用することで、高速推論を実現しています。

評価によると、単一のNVIDIA RTX 4090 GPU上で、様々なLLM（OPT-175Bを含む）において、平均13.20トークン/秒、ピーク29.08トークン/秒のトークン生成レートを達成しました。モデルの精度を維持しながら、llama.cppの最大11.69倍の速度を実現しています。

2. サポートモデル

モデルは、LLM重みと予測重みの両方で構成されるGGUF形式に基づいて「PowerInfer GGUF」と呼ばれる特別な形式で格納されます。
現在サポートされているモデルは、次の4つになります。

・LLaMA(ReLU)-2-7B : PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
・LLaMA(ReLU)-2-13B : PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF
・Falcon(ReLU)-40B : PowerInfer/ReluFalcon-40B-PowerInfer-GGUF
・LLaMA(ReLU)-2-70B : PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF

3. Colabでの実行

Colabでの実行手順は、次のとおりです。

(1) Colabのノートブックを開き、メニュー「編集 → ノートブックの設定」で「GPU」を選択。

(2) パッケージのインストール。

# パッケージのインストール
!git clone https://github.com/SJTU-IPADS/PowerInfer
%cd PowerInfer
!pip install -r requirements.txt

(3) ビルド。

# ビルド
!cmake -S . -B build -DLLAMA_CUBLAS=ON
!cmake --build build --config Release

(4) モデルのダウンロード。
今回は、「ReluLLaMA-70B-PowerInfer-GGUF」を使用します。22分ほどかかりました。

# モデルのダウンロード
!git clone https://huggingface.co/PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF/

HuggingFaceリポジトリから、「PowerInfer GGUFの重み」と「モデルアクティベーション統計」を取得できます。

PowerInferのフォルダ構造は、次のとおりです。リポジトリ全体をダウンロード/クローン作成することが推奨されています。

・*.powerinfer.gguf : 量子化されていないPowerInferモデル
・*.q4.powerinfer.gguf : INT4量子化PowerInferモデル
・activation : Fine-grained FFN offloading のためのプロファイリングされたアクティベーション統計
・activation_x.pt : layter xのプロファイリングされたアクティベーション統計
・*.[q4].powerinfer.gguf.generated.gpuidx : 対応するモデルの実行時に生成されるGPUインデックス

(5) 推論の実行。

# 推論の実行
!./build/bin/main \
    -m ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.q4.powerinfer.gguf \
    -n 128 \
    -t 8 \
    -p "Madoka Magika is"

llama_new_context_with_model: compute buffer total size = 388.57 MB
llama_new_context_with_model: VRAM scratch buffer: 387.00 MB
llama_new_context_with_model: total VRAM used: 10969.04 MB (model: 10582.03 MB, context: 387.00 MB)

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


Madoka Magika is one of the most divisive anime I’ve ever seen. Some people love it, others hate it and I can definitely see why. In general, however, the consensus seems to be that it’s a masterpiece. The anime follows Madoka Kaname as she discovers the world of magical girls and is forced to fight alongside them against evil witches who prey on human souls.  
The show takes its sweet time to build up, but by episode 3 (yes, this is a spoiler-free review) you’ll be completely hooked. Madoka Magika
llama_print_timings:        load time =   20381.00 ms
llama_print_timings:      sample time =      69.10 ms /   128 runs   (    0.54 ms per token,  1852.28 tokens per second)
llama_print_timings: prompt eval time =     861.31 ms /     6 tokens (  143.55 ms per token,     6.97 tokens per second)
llama_print_timings:        eval time =   22519.14 ms /   127 runs   (  177.32 ms per token,     5.64 tokens per second)
llama_print_timings:       total time =   23498.65 ms
Log end

【翻訳】
『まどか☆マギカ』は私がこれまで見た中で最も意見の分かれるアニメの一つです。好きな人もいれば嫌いな人もいますが、その理由はよくわかります。しかし、一般的には、これが傑作であるということで一致しているようです。このアニメは、鹿目まどかが魔法少女の世界を発見し、人間の魂を捕食する邪悪な魔女と彼らと一緒に戦うことを余儀なくされる様子を描いています。
このショーは成長するまでに甘い時間がかかりますが、エピソード 3 までに (はい、これはネタバレなしのレビューです)、完全に夢中になるでしょう。まどマギ

70Bが 5.64 トークン/秒でVRAMも33.3GBでした。

引数は、次のとおりです。

・-m <FNAME>, --model <FNAME> : モデルパス
・-n <N>, --n-predict <N> : 予測するトークン数 (default: -1, -1 = infinity)
・-t <N>, --threads <N> : スレッド数 (default: 8)
・-p <PROMPT>, --prompt <PROMPT> : プロンプト (default: empty)

(6) VRAM 使用量を制限して推論実行。
「--vram-budget 8」を追加しています。

# 推論の実行
!./build/bin/main \
    -m ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.q4.powerinfer.gguf \
    -n 128 \
    -t 8 \
    -p "Madoka Magika is" \
    --vram-budget 8

llama_new_context_with_model: compute buffer total size = 208.57 MB
llama_new_context_with_model: VRAM scratch buffer: 207.00 MB
llama_new_context_with_model: total VRAM used: 8390.50 MB (model: 8183.50 MB, context: 207.00 MB)

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


Madoka Magika is the first Puella Magi Madoka Magica movie and takes place right after the conclusion of the anime.  
The story follows Madoka Kaname who, along with her friend Sayaka Miki, are offered the chance to have any wish granted on the condition that they become magical girls and fight against evil witches in which Madoka accepts but Sayaka refuses. After a disastrous encounter, Madoka loses hope until she is saved by Kyubey, a cat-like being who gives her powers in exchange for becoming his familiar.  
The film starts off with Madoka Kaname
llama_print_timings:        load time =    5288.65 ms
llama_print_timings:      sample time =      74.43 ms /   128 runs   (    0.58 ms per token,  1719.64 tokens per second)
llama_print_timings: prompt eval time =    2872.58 ms /     6 tokens (  478.76 ms per token,     2.09 tokens per second)
llama_print_timings:        eval time =   78119.22 ms /   127 runs   (  615.11 ms per token,     1.63 tokens per second)
llama_print_timings:       total time =   81116.66 ms
Log end

【翻訳】
まどか☆マギカは、最初の魔法少女まどか☆マギカ映画であり、アニメの終了直後に展開されます。
物語は、鹿目まどかが友人の美樹さやかとともに、魔法少女になって邪悪な魔女と戦うことを条件に、どんな願いでも叶えてもらう機会を提供され、まどかは受け入れるがさやかは拒否するというもの。悲惨な出会いの後、まどかは絶望するが、使い魔になる代わりに彼女に力を与えてくれる猫のような存在、キュゥべえに救われる。
映画は鹿目まどかから始まります

VRAMは8.7GBに制限され、70Bが 1.63トークン/秒でした。

参考

この記事が気に入ったらサポートをしてみませんか？