【無料】ローカルPCで動いて画像を理解する目を持ったOSSの生成AI 【LLaVA】(マルチモーダルLLM)

えんぞう

2024年4月13日 20:46

はじめに

Llama2をはじめ、ローカルPCで動くLLMにはまって遊んでいます。

llama2などを使っているうちに、

「ここまで精度のいいOSSがあるのであれば、きっとマルチモーダル対応のLLMもOSSであるのでは？」

と思って調べてみたら、見事にありました！

名前は「LLaVa」です。

マルチモーダルというのは、チャットのようなテキストだけでなく、画像や音声などもインプット情報として使って、回答が得られる機能のことです。

GPT-4VやClaude3もマルチモーダル対応しています。

大規模マルチモーダルモデル (LMM) は、最近、視覚的な指示の調整において有望な進歩を示しています。このノートでは、LLaVA の完全に接続されたビジョン言語クロスモーダルコネクタが驚くほど強力でデータ効率が高いことを示します。 LLaVA への簡単な変更、つまり MLP プロジェクションで CLIP-ViT-L-336px を使用し、シンプルな応答書式設定プロンプトを備えたアカデミックタスク指向の VQA データを追加することで、11 のシステム全体で最先端を達成する強力なベースラインを確立します。ベンチマーク。最後の 13B チェックポイントでは、公開されているわずか 120 万のデータを使用し、単一の 8-A100 ノードで完全なトレーニングを約 1 日で完了します。これにより、最先端の LMM 研究がより身近になることを願っています。コードとモデルは公開されます。

赤い線がLLaVAですが、ベンチマークのスコアも良さげです。

LLaVAの始め方

モデルのダウンロード

まずはモデルをダウンロードします。

以下２種類のファイルをダウンロードしましょう。

ggml-model-q5_k.gguf
mmproj-model-f16.gguf

# wget https://huggingface.co/mys/ggml_llava-v1.5-7b/resolve/main/ggml-model-q5_k.gguf
# wget https://huggingface.co/mys/ggml_llava-v1.5-7b/resolve/main/mmproj-model-f16.gguf

＜参考＞GGUFファイルとは？

llama.cppのセットアップ

もともとは、従来高スペックなGPUがないと動かないLlama2をMacで動かすモチベーションで作られたようです。

今ではLlama2に限らず、いろんなOSSのLLMに対応しています。

幸いLLaVAにも対応していましたので、これを使ってみましょう。

これで高いGPU買わなくても普通のPCで実行できるようになります。

Multimodal models:

LLaVA 1.5 models, LLaVA 1.6 models
BakLLaVA
Obsidian
ShareGPT4V
MobileVLM 1.7B/3B models
Yi-VL

llama.cppのセットアップ

llama.cppのセットアップは以下に記載しています。
今回は以下でセットアップした環境をそのまま使いました。

LLaVAの実行

llama.cppのセットアップが完了していれば、ダウンロードしたモデルを指定するだけで動きます。

とても簡単ですね！

それでは実行してみましょう。

まずは、以下の画像を使ってみました。

DALL-Eで生成した画像ですが、日本人ならなんとなく江戸時代くらいの商人たちが集まっていて何か品定めしてるような雰囲気は感じとれると思います。
でも、AIにとっては難しいかもしれませんね。

いきなりハードル高めですが、ちゃんとLLaVAは画像を理解してくれるのでしょうか？

実行

# ./llava-cli -m ./models/ggml-model-q5_k.gguf --mmproj ./models/mmproj-model-f16.gguf --image ./image.png  -p 'what is thie picture?'

llama.cppディレクトリ配下にある「llava-cil」というファイルを実行します。

-m：　ggml-model-q5_k.ggufのパスを指定
--mmproj：　mmproj-model-f16.ggufのパスを指定
--image：　image.pngを指定（読み込ませる画像ファイル）
-p：　プロンプトを指定（指示文章）

プロンプト

what is thie picture?

「この画像は何ですか？」と聞いてみました。

実行結果

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  9719.26 ms by CLIP (   16.87 ms per image patch)

 The image depicts a colorful, black and white scene of many people in an ornate Asian-style room. They are sitting around dining tables, working on various tasks such as writing or using books, with some even playing musical instruments. There are several cups placed throughout the scene, possibly containing refreshments for the guests. The atmosphere appears to be lively and collaborative as they all engage in their activities together.

llama_print_timings:        load time =   19316.51 ms
llama_print_timings:      sample time =      54.77 ms /    90 runs   (    0.61 ms per token,  1643.09 tokens per second)
llama_print_timings: prompt eval time =  168978.43 ms /   621 tokens (  272.11 ms per token,     3.68 tokens per second)
llama_print_timings:        eval time =   29702.77 ms /    90 runs   (  330.03 ms per token,     3.03 tokens per second)
llama_print_timings:       total time =  209067.62 ms /   711 tokens

回答抜粋

The image depicts a colorful, black and white scene of many people in an ornate Asian-style room. They are sitting around dining tables, working on various tasks such as writing or using books, with some even playing musical instruments. There are several cups placed throughout the scene, possibly containing refreshments for the guests. The atmosphere appears to be lively and collaborative as they all engage in their activities together.

翻訳

この画像には、華やかなアジアンスタイルの部屋に大勢の人々が集まるカラフルな白黒の風景が描かれています。彼らはダイニングテーブルの周りに座り、本を書いたり本を読んだりといったさまざまな作業に取り組み、中には楽器を演奏する人もいます。シーン全体にいくつかのカップが配置されており、おそらくゲスト用の軽食が含まれています。彼ら全員が一緒に活動に取り組んでいるため、雰囲気は活気があり、協力的であるように見えます。

回答出ました！

「アジアンスタイルの部屋に大勢の人々が集まる」というのは正解ですね。
「日本」かどうかまではわからなかったようです。

あとは「ダイニングテーブル」と誤解しているので、喫茶店かそんな感じの場所と勘違いしてそうですね。

ただ、「一緒に活動に取り組んでいる」とか「雰囲気は活気がある」とかはうまく読み取っているのではないかなと思います。

負荷状況


top - 21:47:31 up  3:36,  0 users,  load average: 3.72, 1.67, 0.67
Tasks:   6 total,   2 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 98.0 us,  1.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 98.7 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 98.7 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15776.2 total,   1418.6 free,   4271.3 used,  10086.2 buff/cache
MiB Swap:   8000.0 total,   8000.0 free,      0.0 used.  11159.4 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  148 root      20   0 6735532   6.2g   4.5g R 396.7  40.3  10:50.27 llava-cli
    1 root      20   0    6040   2040   1604 S   0.0   0.0   0:00.03 bash

メモリも6GB程度の使用量でした。

最近のPCは8GBメモリくらいはあるのが多いので、ノートPCでも普通に動きます。

実行時間は200秒なので、3分ほどかかりましたが、それほど我慢できないという遅さでもなさそうです。

llama_print_timings:       total time =  209067.62 ms /   711 tokens

とにかく実行するのが無料で、手元にある普通のPCでも動くのはうれしい限りです。

24時間ずっと画像解析で使っても、電気代だけで済むので、株価のグラフをずっとウォッチさせて分析させるのも面白そうですね。

画像認識＋画像解説

LLaVAの実行方法がわかったところで、実際どれだけ役に立ちそうか試してみましょう。

画像１

実行

# ./llava-cli -m ./models/ggml-model-q5_k.gguf --mmproj ./models/mmproj-model-f16.gguf --image ./image.png  -p 'what is this picture?'

プロンプト

what is this picture?

回答結果


～　～

llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  9738.50 ms by CLIP (   16.91 ms per image patch)

 The image depicts a computer screen displaying various maps and data. In the center, there is a large map of Japan with different colored areas showing the population density of each region. Alongside this main map, several other smaller maps are visible, including one that shows the population per hectare in another area of the country.

In addition to these maps, there's an image displaying data on the number of people living within a certain distance from a central point, possibly indicating the distribution of residents across the area. Overall, it is an informative and visually appealing display of statistical information related to Japan's population density.

llama_print_timings:        load time =   19249.41 ms
llama_print_timings:      sample time =      82.45 ms /   134 runs   (    0.62 ms per token,  1625.13 tokens per second)
llama_print_timings: prompt eval time =  168974.24 ms /   620 tokens (  272.54 ms per token,     3.67 tokens per second)
llama_print_timings:        eval time =   44309.30 ms /   134 runs   (  330.67 ms per token,     3.02 tokens per second)
llama_print_timings:       total time =  223680.84 ms /   754 tokens

回答抜粋

The image depicts a computer screen displaying various maps and data. In the center, there is a large map of Japan with different colored areas showing the population density of each region. Alongside this main map, several other smaller maps are visible, including one that shows the population per hectare in another area of the country. In addition to these maps, there's an image displaying data on the number of people living within a certain distance from a central point, possibly indicating the distribution of residents across the area. Overall, it is an informative and visually appealing display of statistical information related to Japan's population density.

翻訳

この画像は、さまざまな地図やデータを表示するコンピューター画面を示しています。中央には、各地域の人口密度を示す色分けされた大きな日本地図があります。このメインマップと並んで、国の別の地域のヘクタールあたりの人口を示すマップなど、他のいくつかの小さなマップが表示されます。これらの地図に加えて、中心点から一定距離内に住んでいる人の数のデータを表示する画像もあり、これはエリア全体の住民の分布を示している可能性があります。全体として、日本の人口密度に関する統計情報が有益かつ視覚的に魅力的に表示されています。

回答が出ました！

「人口密度」というキーワードは元画像にも入ってましたが、画像のまま読み込ませても理解しているようです。

「中央に人口密度を示す色分けされた日本地図」という点は正解ですね。

「中心点から一定距離内に住んでいる人の数」というのはよくわかりませんでした。何かと勘違いしてるようですね。

細かいところは間違えてますが、雰囲気は読み取れています。

画像２

次はグラフ画像を使ってみましょう。

日本の人口推移のグラフです。凡例は以下のとおりです。

実行

# ./llava-cli -m ./models/ggml-model-q5_k.gguf --mmproj ./models/mmproj-model-f16.gguf --image ./image.png -p 'What is this picture?'

プロンプト

What is this picture?

実行結果

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  9725.78 ms by CLIP (   16.89 ms per image patch)

 The image is a graph showing the rise of three different variables over time. Two of the variables display a clear, steady upward trend, while the third variable displays an erratic trajectory with many fluctuations.

The graph is set in a white background, and each variable's values are represented by distinct colors. The vertical axis on the left side of the image displays the values for these variables, with a logarithmic scale visible as they move from the bottom to the top. Overall, the graph provides insights into how these variables have changed over time, highlighting any patterns or trends that might be relevant in understanding their behavior and usage.

llama_print_timings:        load time =   19241.20 ms
llama_print_timings:      sample time =      85.98 ms /   139 runs   (    0.62 ms per token,  1616.56 tokens per second)
llama_print_timings: prompt eval time =  168799.18 ms /   620 tokens (  272.26 ms per token,     3.67 tokens per second)
llama_print_timings:        eval time =   45857.08 ms /   139 runs   (  329.91 ms per token,     3.03 tokens per second)
llama_print_timings:       total time =  225049.27 ms /   759 tokens

回答抜粋

The image is a graph showing the rise of three different variables over time. Two of the variables display a clear, steady upward trend, while the third variable displays an erratic trajectory with many fluctuations. The graph is set in a white background, and each variable's values are represented by distinct colors. The vertical axis on the left side of the image displays the values for these variables, with a logarithmic scale visible as they move from the bottom to the top. Overall, the graph provides insights into how these variables have changed over time, highlighting any patterns or trends that might be relevant in understanding their behavior and usage.

翻訳

この画像は、時間の経過に伴う 3 つの異なる変数の上昇を示すグラフです。変数のうち 2 つは明確で安定した上昇傾向を示していますが、3 番目の変数は多くの変動がある不規則な軌道を示しています。グラフは白い背景で設定され、各変数の値は個別の色で表されます。画像の左側の縦軸にはこれらの変数の値が表示され、下から上に移動するにつれて対数スケールが表示されます。全体として、グラフはこれらの変数が時間の経過とともにどのように変化したかについての洞察を提供し、変数の動作や使用状況を理解する際に関連する可能性のあるパターンや傾向を強調表示します。

回答出ました！

３つの変数のグラフというのは正解です。

さすがに何も文字情報が無いので日本の人口推移のグラフまではわからなかったようです。

次は少しヒントを教えて、数値を読み取ってもらいましょう。

実行

# ./llava-cli -m ./models/ggml-model-q5_k.gguf --mmproj ./models/mmproj-model-f16.gguf --image ./image.png -p 'This image is a graph of Japan's population trends. What was the population in 2000?'

プロンプト

This image is a graph of Japan's population trends. What was the population in 2000?

プロンプト翻訳

この画像は日本の人口推移のグラフです。2000年時点の人口は何人ですか？

回答結果

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in  9742.26 ms by CLIP (   16.91 ms per image patch)

 The population in 2000 was about 127,583,465.

llama_print_timings:        load time =   19241.92 ms
llama_print_timings:      sample time =      12.31 ms /    24 runs   (    0.51 ms per token,  1950.11 tokens per second)
llama_print_timings: prompt eval time =  174479.82 ms /   637 tokens (  273.91 ms per token,     3.65 tokens per second)
llama_print_timings:        eval time =    7891.36 ms /    24 runs   (  328.81 ms per token,     3.04 tokens per second)
llama_print_timings:       total time =  192667.33 ms /   661 tokens

回答抜粋

The population in 2000 was about 127,583,465.

回答翻訳

2000 年の人口は約 1 億 2,758 万 3,465 人でした。

回答出ました！

濃い青い線が人口の総数なので、だいたい1億2000万を少し超えたところなので、合ってそうですね。

ただ、ちょっと正確すぎるような気がするので事前学習データを答えてるだけという疑念がありますが、違う方法を考えたほうがよいですね。

まとめ

オープンソースのマルチモーダルLLMのLLaVAを使ってみました。

GPT-4V、Claude3に比べると精度は少し残念ですが、これがオープンソースで、しかもGPU無しの普通のスペックのPCで動くこと自体が驚きです。

24時間動かし続けても電気代だけで済むので、ひたすら画像を読み込ませてテキスト化するのにちょうどよいかなと思っています。

LLaVA以外にもオープンソースのマルチモーダルLLMはありそうなので別のものも試してみようと思います。

この記事が気に入ったらサポートをしてみませんか？