400億パラメータのfalcon-40b-instructを試してみました

Masayuki Abe

2023年6月15日 07:54

前回、falcon-40b-instructを実行してGoogle Colabでメモリを使いきってしまい失敗しました。

今回は、falcon-40b-instructという400億パラメータに対して再びチャレンジしてみます。キーポイントは、4ビット量子化を行い、メモリ使用率を減らす方法となります。

下記が、Google Colabにおける実行時のメモリ使用率等になります。

今回使用したGoogle Colab環境は、以下です。

今回は、下記のコードを参考にさせて頂きました。

今回使用したコードです。

!pip install git+https://www.github.com/huggingface/transformers
!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"
 
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Tell a story about a magical fish in the UAE:"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

next_input = input_ids
max_length = 80  # Change this to your desired output length. Too long could cause an OOM Out of Memory error!
current_length = input_ids.shape[1]

while True:
    if current_length >= max_length:  # Check if we've reached the length limit
        break

    output = model(next_input)
    next_token_logits = output.logits[:, -1, :]
    next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)
    print(tokenizer.decode(next_token_id[0].cpu().tolist(), skip_special_tokens=True), end='', flush=True)

    next_input = torch.cat([next_input, next_token_id.to("cuda")], dim=-1)

    current_length += 1

    if next_token_id[0].item() == tokenizer.eos_token_id:
        break

Once upon a time, there was a magical fish that lived in the waters of the UAE. This fish had the power to grant wishes to anyone who caught it. One day, a fisherman caught the fish and asked for a wish. The fish granted his wish and he became rich beyond his wildest dreams. The fisherman was so happy that

実行結果

日本語で聞いてみましたが、英語で返答されてしまいました。他にも質問してみます。

input_text = "Can you tell me about Japan?"

Japan is a country located in East Asia. It is made up of four main islands and over 6,000 smaller ones. The capital city is Tokyo, which is also the largest city in Japan. The country has a rich history and culture, with a unique blend of traditional and modern elements. It is known for its beautiful natural scenery, including mountains,

実行結果

概ね回答があっているように思います。日本語は使えませんけれど、英語でこの精度が出せるのは良いです。また、Google Colabで量子化を行うことにより400億パラメータを取り扱うことが出来たのは新たな発見でした。

この記事が気に入ったらサポートをしてみませんか？