![見出し画像](https://assets.st-note.com/production/uploads/images/113152988/rectangle_large_type_2_3b3eb7d66630648eca2f4f9bbba9957e.jpg?width=800)
llama2 70B chat をM1 max 32GB ramのMBP環境で動かす
タイトルの通り今回はllama2 70B chatのモデルをMBPで動かします
環境は下記のもので32GB RAMでM1 Maxチップ
MacBook Pro (16インチ, 2021) - 技術仕様 (日本) (apple.com)
前回のスクリプトをちょっと改良して70Bモデルを落としてpullしておきます
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git pull
# Build it
LLAMA_METAL=1 make
# Download model
export MODEL=llama-2-70b-chat.ggmlv3.q2_K_M.bin
wget "https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/resolve/main/${MODEL}"
この環境でMetalを有効化して実行してしまうとロードの過程でAbortしてしまうようです
$ ./main -m ./llama-2-70b-chat.ggmlv3.q2_K.bin -n 128 -ngl 1 -gqa 8 -t 10 -p "Hello"
main: build = 977 (b19edd5)
main: seed = 1691815367
llama.cpp: loading model from ./llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 27827.36 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/keigofukumoto/git/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x1427209e0
ggml_metal_init: loaded kernel_add_row 0x142722c70
ggml_metal_init: loaded kernel_mul 0x14263cc00
ggml_metal_init: loaded kernel_mul_row 0x142722150
ggml_metal_init: loaded kernel_scale 0x14263daa0
ggml_metal_init: loaded kernel_silu 0x14263e430
ggml_metal_init: loaded kernel_relu 0x14263e690
ggml_metal_init: loaded kernel_gelu 0x1427237f0
ggml_metal_init: loaded kernel_soft_max 0x14263f060
ggml_metal_init: loaded kernel_diag_mask_inf 0x1426404d0
ggml_metal_init: loaded kernel_get_rows_f16 0x142640da0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14263fda0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x1426422b0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x12268d600
ggml_metal_init: loaded kernel_get_rows_q3_K 0x12268e380
ggml_metal_init: loaded kernel_get_rows_q4_K 0x12268e8d0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x12268f1a0
ggml_metal_init: loaded kernel_get_rows_q6_K 0x142723a50
ggml_metal_init: loaded kernel_rms_norm 0x1427242c0
ggml_metal_init: loaded kernel_norm 0x142724f70
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x142642f10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x142643310
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x142643dc0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x142725510
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x142726730
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x142727070
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x142727a50
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x12268fc60
ggml_metal_init: loaded kernel_rope 0x122690c80
ggml_metal_init: loaded kernel_alibi_f32 0x1226917e0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x122692690
ggml_metal_init: loaded kernel_cpy_f32_f32 0x122693200
ggml_metal_init: loaded kernel_cpy_f16_f16 0x122693d70
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 16384.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 11083.70 MB, offs = 16964812800, (27468.16 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.17 MB, (27492.33 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 162.00 MB, (27654.33 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 237.00 MB, (27891.33 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 304.00 MB, (28195.33 / 21845.34), warning: current allocated size is greater than the recommended max working set size
system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
ggml_metal_graph_compute: command buffer 5 failed with status 5
GGML_ASSERT: ggml-metal.m:1149: false
Abort trap: 6
というわけで今回はMetalを無効化してCPUで実行します
$ ./main -m ./llama-2-70b-chat.ggmlv3.q2_K.bin -n 128 -ngl 0 -gqa 8 -t 10 -p "Hello"
main: build = 977 (b19edd5)
main: seed = 1691815893
llama.cpp: loading model from ./llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 27827.36 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello, I'm new to this forum and I have a question about the use of "se" in Spanish.
I've been
動きました。
CPUでも推論が動きはしますが動作がめちゃくちゃ遅いのでCPU推論では実用的なことはできません。
MetalかRAMの容量がq2モデルが動かない原因だと思うのでこの辺要調査ですかね
llama2 70BのM1 MaxでのCPU推論は実用に耐える速度では動かせない。なんとか最適化してq2くらいはmetalで動かせないものか pic.twitter.com/WCNiGe8utr
— John K.Happy (@manjiroukeigo) August 12, 2023
めちゃくちゃ時間かかりましたが2度目のRunをさせてToken/sまで一応出ました 0.03 token/sec
$ ./main -m ./llama-2-70b-chat.ggmlv3.q2_K.bin -n 128 -ngl 0 -gqa 8 -t 10 -p "Hello"
main: build = 977 (b19edd5)
main: seed = 1691816501
llama.cpp: loading model from ./llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 27827.36 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
system_info: n_threads = 10 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Hello, my dear. I must say, you're looking particularly lovely today.
Watson:
Ah, Holmes, you're up to your old tricks again. Using your clever words to charm the ladies, I presume?
Holmes:
Well, my dear Watson, a man has to have some secrets, doesn't he? But fear not, for I shall reveal all in due time. Now, let us proceed with the case at hand. The game, as they say, is afoot!
Watson:
Indeed it is, Hol
llama_print_timings: load time = 7888.34 ms
llama_print_timings: sample time = 93.67 ms / 128 runs ( 0.73 ms per token, 1366.57 tokens per second)
llama_print_timings: prompt eval time = 6581.35 ms / 2 tokens ( 3290.68 ms per token, 0.30 tokens per second)
llama_print_timings: eval time = 4633075.11 ms / 127 runs (36480.91 ms per token, 0.03 tokens per second)
llama_print_timings: total time = 4639775.68 ms
この記事が気に入ったらサポートをしてみませんか?