東京大学の松尾研究室の100億パラメータのLLM「weblab-10b」の実行方法

2023年8月20日 07:15

東京大学の松尾研究室が100億パラメータの日本語中心モデルであるweblab-10bを非商用で公開しました。Twitterでオープンソースの定義の観点で揺れているようにみえますが、どういうレベルのLLMかを検証していきます。

今回は、上記ページを参考に、少し修正し、Google Colabで実行しています。Google ColabのA100で実行しましたが、下記コードだとメモリが足らずにエラーとなってしまいました。40GB以上のGPUがある人は出来るのではないかと思います。

!pip install transformers sentencepiece accelerate einops
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("matsuo-lab/weblab-10b")
model = AutoModelForCausalLM.from_pretrained("matsuo-lab/weblab-10b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "吾輩は猫である。"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

そこで、Google ColabのA100で、更にコードを少し修正して、モデルを4bit量子化して試してみます。その時のコードは以下となります。

pip install transformers sentencepiece accelerate einops bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("matsuo-lab/weblab-10b")
model = AutoModelForCausalLM.from_pretrained("matsuo-lab/weblab-10b",trust_remote_code=True, load_in_4bit=True, device_map="auto")

#if torch.cuda.is_available():
#    model = model.to("cuda")

text = "吾輩は猫である。"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")


# pad_token_idがNoneの場合のデフォルトの値を設定
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = 0

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

4bit量子化すると精度が落ちているのではないかと思いますが、実行結果は以下となりました。

吾輩は猫である。」の猫のような生き方をしたいなと思いました。
小説は、「吾輩は猫である。」の猫のような生き方をしたいなと思いました。
わたしは、小説を書くのは、「吾輩は猫である。」の猫のような生き方をしたいな

さて、他のpromptの時はどのような結果になるのか、promptと実行結果の組み合わせを見て行きます。

text = "1から10までの素数は、"

1から10までの素数は、すべて2で割り切れる。
2で割り切れるのは、1、3、5、7、9、11、13、17、19、23、25、27、29、31、35、37、39、41、43、47、49、53、59、61、67、71、73、79、83、89、97、101、103、107、109、113、127、

素数を求めよと聞きましたが、素数というより奇数ですね。

text = "1+3="

1+3=4$

A:

You can do it with the Euclidean Algorithm.

You need to find a divisor of the given number which is not a perfect square.
The Euclidean algorithm will give you the dividend, the remainder, and the quotient.

The remainder is the dividend - the divisor, and the quotient is the dividend/divisor.

The divisor must be a perfect square.

If it is not a perfect square, you can use the Euclidean algorithm

数値計算能力はどうなっているか把握したかったですが、少し違うようです。やはり4bit量子化が影響しているのでしょうか。

text = "I am a cat."

I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat. I am a cat.

うーん、英語は微妙ですかね。

text = "私は、1000円持っています。私は300円でビールを買いました。いくら残っていますか？"

私は、1000円持っています。私は300円でビールを買いました。いくら残っていますか？」
「1000円持っています。私は、1000円持っています。私は300円でビールを買いました。いくら残っていますか？」
「1000円持っています。私は、1000円持っています。私は300円でビールを買いました。いくら残っていますか？」
「1000円持っています。私は、1000

おそらく問題文をうまく理解できなかったのかもしれません。

text = "現在の日本の首相は、"

現在の日本の首相は、「あの方」と呼ばれているので、その方が、何らかの行動を起こすことを、期待して待ちましょう。
「あの方」の行動の後には、必ず、何らかの「ご褒美」があるので、その「ご褒美」のために、「あの方」は、「あの方」の

「あの方」が気になりますね。いつまでの情報を取得しているかを把握したいため、現在の日本の首相を聞いてみました。

全体的な所感としては、Google Colabで40GBのGPUという制約があり、そのまま実行すると、それを超えてしまいました。動かしたいために、4bit量子化したコードで試しましたが、本来どのくらいの性能があるのかがわからなくなってしまった気がします。40GB以上のGPU環境で実行するのが良い気がします。

また、LLMのプレゼンスを出すために、テクニカル的な面だけではなく、マーケティング的な面も含めて多くの人に利用してもらう必要があります。そのため、今現在の一番利用しやすいGoogle Colab環境の制約の40GBのGPUで実行することができるということが今後LLMのリリース体に求められてくるのではないかと考えられます。

この記事が気に入ったらサポートをしてみませんか？