WSL2でLlama-3 Swallowを試してみる

2024年7月2日 01:03

「Meta社が2024/04/18にリリースした高性能な英語モデルLlama-3-8B、Llama-3-70Bから継続事前学習を行い日本語性能を強化したモデル」であるLlama-3 Swallowの8Bモデルが2つ公開されました。

二つとも、試してみます。

使用するPCはドスパラさんの「GALLERIA UL9C-R49」。スペックは
・CPU: Intel® Core™ i9-13900HX Processor
・Mem: 64 GB
・GPU: NVIDIA® GeForce RTX™ 4090 Laptop GPU(16GB)
・GPU: NVIDIA® GeForce RTX™ 4090 (24GB)
・OS: Ubuntu22.04 on WSL2（Windows 11）
です。

1. 準備

python3 -m venv swallow
cd $_
source bin/activate

パッケージのインストール。

pip install torch transformers accelerate

2. 流し込むコード

こちらを /path/to/query.py として保存します。

import sys
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from typing import List, Dict
import time

# argv
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default=None)
parser.add_argument("--tokenizer-path", type=str, default=None)
parser.add_argument("--no-chat", action='store_true')
parser.add_argument("--no-use-system-prompt", action='store_true')
parser.add_argument("--max-tokens", type=int, default=256)

args = parser.parse_args(sys.argv[1:])

model_id = args.model_path
if model_id == None:
    exit

is_chat = not args.no_chat
use_system_prompt = not args.no_use_system_prompt
max_new_tokens = args.max_tokens

tokenizer_id = model_id
if args.tokenizer_path:
    tokenizer_id = args.tokenizer_path

# トークナイザーとモデルの準備
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_id,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    #torch_dtype="auto",
    torch_dtype=torch.bfloat16,
    #torch_dtype=torch.float16,
    device_map="auto",
    #device_map="cuda",
    low_cpu_mem_usage=True,
    trust_remote_code=True
)
#if torch.cuda.is_available():
#    model = model.to("cuda")

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。"


def q(
    user_query: str,
    history: List[Dict[str, str]]=None
) -> List[Dict[str, str]]:
    # generation params
    generation_params = {
        "do_sample": True,
        "temperature": 0.8,
        "top_p": 0.95,
        "top_k": 40,
        "max_new_tokens": max_new_tokens,
        "repetition_penalty": 1.1,
    }
    #
    start = time.process_time()
    # messages
    messages = ""
    if is_chat:
        messages = []
        if use_system_prompt:
            messages = [
                {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
            ]
        user_messages = [
            {"role": "user", "content": user_query}
        ]
    else:
        user_messages = user_query
    if history:
        user_messages = history + user_messages
    messages += user_messages
    # generation prompts
    if is_chat:
        prompt = tokenizer.apply_chat_template(
            conversation=messages,
            add_generation_prompt=True,
            tokenize=False
        )
    else:
        prompt = messages
    input_ids = tokenizer.encode(
        prompt,
        add_special_tokens=True,
        return_tensors="pt"
    )
    print("--- prompt")
    print(prompt)
    print("--- output")
    # 推論
    output_ids = model.generate(
        input_ids.to(model.device),
        streamer=streamer,
        **generation_params
    )
    output = tokenizer.decode(
        output_ids[0][input_ids.size(1) :],
        skip_special_tokens=True
    )
    if is_chat:
        user_messages.append(
            {"role": "assistant", "content": output}
        )
    else:
        user_messages += output
    end = time.process_time()
    ##
    input_tokens = len(input_ids[0])
    output_tokens = len(output_ids[0][input_ids.size(1) :])
    total_time = end - start
    tps = output_tokens / total_time
    print(f"prompt tokens = {input_tokens:.7g}")
    print(f"output tokens = {output_tokens:.7g} ({tps:f} [tps])")
    print(f"   total time = {total_time:f} [s]")
    return user_messages

3. 試してみる

(1) Llama-3-Swallow-8B-v0.1

実行します。

CUDA_VISIBLE_DEVICES=0 python -i ~/scripts/query.py --model-path tokyotech-llm/Llama-3-Swallow-8B-v0.1 --no-chat

pythonプロンプトから聞いてみましょう。

>>> history = q("ドラえもんとはなにか")
--- prompt
ドラえもんとはなにか
--- output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
？という問いに、わからないままに答えたらこうなる。
映画『STAND BY ME ドラえもん』の冒頭で、藤子・F・不二雄がインタビューを受けている場面がある。そこで彼は自作について、「子供向けのようでいて、大人の視点から描いた話なんですよ」と発言しているのだ。この言葉には様々な意味が込められているのだろうけれど、それを考えることは今回のテーマと関係ないので、ここではそれほど重要ではないだろう。
「子供向けのようでいて、大人の視点から描かれている」作品。それこそが、『ドラえもん』だと私は思う。
この作品は、そもそも大人が読むものだった（現在でもそういう側面はあり続けてる）。漫画の中身はどうあれ、その事実だけを見れば明らかなことだと思う。原作者である藤本弘が子供向け雑誌に連載し始めたのは1969年のことで、いまから半世紀以上前のことである。当時の日本は高度成長期の真っ只中であり、今よりもずっと貧しかったはず。その頃の子供
prompt tokens = 8
output tokens = 256 (25.025862 [tps])
   total time = 10.229418 [s]

？という問いに、わからないままに答えたらこうなる。
映画『STAND BY ME ドラえもん』の冒頭で、藤子・F・不二雄がインタビューを受けている場面がある。そこで彼は自作について、「子供向けのようでいて、大人の視点から描いた話なんですよ」と発言しているのだ。この言葉には様々な意味が込められているのだろうけれど、それを考えることは今回のテーマと関係ないので、ここではそれほど重要ではないだろう。
「子供向けのようでいて、大人の視点から描かれている」作品。それこそが、『ドラえもん』だと私は思う。
この作品は、そもそも大人が読むものだった（現在でもそういう側面はあり続けてる）。漫画の中身はどうあれ、その事実だけを見れば明らかなことだと思う。原作者である藤本弘が子供向け雑誌に連載し始めたのは1969年のことで、いまから半世紀以上前のことである。当時の日本は高度成長期の真っ只中であり、今よりもずっと貧しかったはず。その頃の子供

tokyotech-llm/Llama-3-Swallow-8B-v0.1

2014年、ご存命だったのか…。

(2) Llama-3-Swallow-8B-Instruct-v0.1

実行します。

CUDA_VISIBLE_DEVICES=0 python -i ~/scripts/query.py --model-path tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1

pythonプロンプトから聞いてみましょう。

>>> history = q("ドラえもんとはなにか")
--- prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

あなたは誠実で優秀な日本人のアシスタントです。<|eot_id|><|start_header_id|>user<|end_header_id|>

ドラえもんとはなにか<|eot_id|><|start_header_id|>assistant<|end_header_id|>


--- output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Doraemon is a fictional character and the titular main protagonist of the Japanese manga and anime series of the same name. Created by Fujiko F. Fujio, Doraemon first appeared in the manga on September 3, 1969. The series has been very popular around the world, especially in Japan where it has been broadcast for over half a century.

Doraemon is a robotic cat from the future who comes to the past with his young friend Nobita Nobi to help him out of various predicaments. He is known for his four-dimensional pocket, which contains various gadgets and tools that he uses to solve problems and help those around him. Despite being a helpful and kind-hearted character, Doraemon often causes trouble due to his mischievous nature and lack of understanding about the modern world.

The Doraemon franchise includes many spin-off series, films, and media, making it one of the most successful and beloved franchises in Japan's history. It has also inspired numerous merchandise and adaptations worldwide, making Doraemon a cultural icon and an enduring symbol of childhood wonder and imagination.assistantSure! Here's a summary of Doraemon in English:

Doraemon is a fictional character and the main protagonist of the popular Japanese manga and
prompt tokens = 40
output tokens = 256 (23.946474 [tps])
   total time = 10.690509 [s]

Doraemon is a fictional character and the titular main protagonist of the Japanese manga and anime series of the same name. Created by Fujiko F. Fujio, Doraemon first appeared in the manga on September 3, 1969. The series has been very popular around the world, especially in Japan where it has been broadcast for over half a century.

Doraemon is a robotic cat from the future who comes to the past with his young friend Nobita Nobi to help him out of various predicaments. He is known for his four-dimensional pocket, which contains various gadgets and tools that he uses to solve problems and help those around him. Despite being a helpful and kind-hearted character, Doraemon often causes trouble due to his mischievous nature and lack of understanding about the modern world.

The Doraemon franchise includes many spin-off series, films, and media, making it one of the most successful and beloved franchises in Japan's history. It has also inspired numerous merchandise and adaptations worldwide, making Doraemon a cultural icon and an enduring symbol of childhood wonder and imagination.assistantSure! Here's a summary of Doraemon in English:

Doraemon is a fictional character and the main protagonist of the popular Japanese manga and

tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1

うん？英語で出力されてしまっている。にんともかんとも。

4. まとめ

8Bモデルですので、VRAM消費量は16.0GB前後ですから、RTX 4090がほしいところですね。

推論結果に英語が含まれてしまうのは「Instructモデルには改善の余地があり」とあるので、おそらくそれが原因なんでしょうね。

Llama-3-Swallowをリリースしました！！
instructモデルには改善の余地がありますが、ベースモデルは日本語性能で70Bサイズではトップレベルの性能です。🎉

ブログ: https://t.co/I31qbqHLa8

公式ページ: https://t.co/1JD7K4BeWo

Huggingface: https://t.co/BMiEWOEuQR pic.twitter.com/G1b20odEzy
— Kazuki Fujii (@okoge_kaz) July 1, 2024

この記事が気に入ったらサポートをしてみませんか？