Llama 3.1 の新機能と使い方

2024年7月24日 09:05

以下の記事が面白かったので、簡単にまとめました。

・Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

1. Llama 3.1 の新機能

「Llama 3.1」の新機能は、次のとおりです。

・128Kトークンの大きなコンテキスト長 (元は8K)
・多言語
・ツールの使用
・4,050億パラメータの非常に大きな高密度モデル
・より寛容なライセンス

8B、70B、405Bの3つのサイズがあり、それぞれにベースモデルと指示モデルがあります。128Kトークンのコンテキスト長と、英語、ドイツ語、フランス語、イタリア語、ポルトガル語、ヒンディー語、スペイン語、タイ語を含む8つの言語をサポートしています。「Llama 3.1」は、より長いコンテキストに役立つ効率的な表現である「Grouped-Query Attention」(GQA) を引き続き使用しています。

提供されているモデルは、次の6つです。

・Meta-Llama-3.1-8B
・Meta-Llama-3.1-8B-Instruct
・Meta-Llama-3.1-70B
・Meta-Llama-3.1-70B-Instruct
・Meta-Llama-3.1-405B
・Meta-Llama-3.1-405B-Instruct

「Llama Guard 3」と「Prompt Gard」も提供されています。

・Llama Guard 3 : LLM入力 (プロンプト) と応答を分類して、リスク分類で安全ではないと見なされるコンテンツを検出できる。
・Prompt Gard : プロンプト・インジェクションとジェイルブレイクを検出できる、279Mパラメータの小さなBERTベースの分類器。

「Llama 3.1」の指示モデルがエージェントユースケースの「Tool Calling」でファインチューニングされています。カスタムJSON関数で拡張できる2つの組み込みツール (検索、Wolfram Alpha による数学的推論) もあります。

「Llama 3.1」は、カスタム構築されたGPU クラスタで15兆を超えるトークンで学習され、合計3,930万 GPU 時間 (80億で146万、700億で700万、4050億で3,084万) が使用されました。学習データセットの正確な詳細は不明ですが、多言語化のためにより多様なキュレーションが行われていると推測することしかできません。「Llama 3.1 Instruct」は指示に従うように最適化されており、公開されている指示データセット、および「SFT」と「RLHF」による2,500万を超える合成生成例で学習されました。 Metaは、データミックスの作成中に高品質のプロンプトと応答をフィルタリングしてキュレートするためのLLMベースの分類子を開発しました。

ライセンス条件に関しては、「Llama 3.1」には非常によく似たライセンスが付属していますが、重要な違いが1つあります。それは、他のLLMの改善にモデル出力を使用できることです。つまり、異なるモデルであっても、合成データの生成と抽出が許可されます。これは、後述するように、405Bモデルにとって特に重要です。ライセンスでは、再配布、微調整、派生作品の作成が許可されていますが、派生モデルの名前の先頭に「Llama」を含める必要があり、派生作品やサービスには「Built with Llama」と記載する必要があります。詳しくは、公式ライセンスを参照してください。

2. Llama 3.1 のメモリ要件

学習と推論の両方に必要なメモリを3つのモデルサイズに分けて説明します。

2-1. 推論のメモリ要件

推論の場合、メモリ要件はモデルサイズと重みの精度によって異なります。

H100 ノード (8x H100) には約 640 GB の VRAM があるため、405B モデルはマルチノードセットアップで実行するか、より低い精度 (FP8など) で実行する必要があります。後者のアプローチが推奨されます。

精度を低くすると (INT4など)、精度が多少低下する可能性がありますが、メモリ要件が大幅に削減され、推論速度が向上します。モデルの重みに加えて、KVキャッシュもメモリに保持する必要があります。KVキャッシュには、モデルのコンテキスト内のすべてのトークンのキーと値が含まれているため、新しいトークンを生成するときに再計算する必要はありません。特に、使用可能な長いコンテキスト長を使用する場合は、これが重要な要素になります。FP16では、KVキャッシュのメモリ要件は次のとおりです。

特に、小さなモデルの場合、コンテキスト長の最大値に近づくと、キャッシュは重みと同じ量のメモリを使用します。

2-2. 学習のメモリ要件

次の表は、さまざまな手法を使用して「Llama 3.1」を学習する場合のおおよそのメモリ要件を示しています。

3. Llama 3.1 の評価

以下は、Metaの公式評価からの抜粋です。

4. HuggingFace Transformers の使用

「Llama 3.1」では、RoPEスケーリングを効果的に処理するために、モデリングのマイナーアップデートが必要です。Transformers v4.43 では、新しい「Llama 3.1」を使用し、HuggingFace エコシステム内のすべてのツールを活用できます。

pip install "transformers>=4.43" --upgrade

詳細をいくつか紹介します。

・Transformers はデフォルトでモデルをbfloat16でロードします。これはMeta が公開した元のチェックポイントで使用されている型なので、最高の精度を確保したり評価を行ったりするには、この実行方法が推奨されます。
・アシスタントの応答は特別なトークン <|eot_id|> で終わる場合があります。ただし、通常のEOSトークンが見つかった場合は生成を停止する必要があります。
・eos_token_id パラメータにターミネータのリストを提供することで、生成を早期に停止できます。
・Metaコードベースから取得したデフォルトのサンプリングパラメータ (temperature と top_p) を使用しました。まだ広範囲にわたるテストを実施する時間がないので、自由に探索してください。

次のコードは、「meta-llama/Meta-Llama-3.1-8B-Instruct」の使用方法を示しています。このコードでは、多くの家庭用GPU に適合する約16GBのVRAM が必要です。同じコードが、140GBのVRAMと810 GBのVRAMを必要とする「meta-llama/Meta-Llama-3.1-70B-Instruct」でも機能します。8bitまたは4bitロードすると、メモリ消費量をさらに削減できます。

from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)
# Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh? Alright then, matey! I be a language-generatin' swashbuckler, a digital buccaneer with a penchant fer spinnin' words into gold doubloons o' knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name, and I be here to help ye navigate the seven seas o' questions and find the hidden treasure o' answers! So hoist the sails and set course fer adventure, me hearty! What be yer first question?

また、モデルを自動的に量子化し、bitsandbytesを使用して8bitまたは4bitでロードすることもできます。70byteの大きなバージョンの4bitロードには、実行に約34GB のメモリが必要です。生成パイプラインを4bitでロードする方法は次のとおりです。

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "quantization_config": {"load_in_4bit": True}
    },
)

Transformerでモデルを使用する方法の詳細については、モデルカードを参照してください。

5. Llama 3.1 のプロンプト

ベースモデルにはプロンプト形式がありません。他のベースモデルと同様に、入力シーケンスを妥当な継続で継続したり、Zero-Shot/Few-Shotの推論に使用したりできます。また、独自のユースケースをファインチューニングするための優れた基盤にもなります。

Instructは、4 つのロールを持つ会話形式をサポートします:

・system : 会話のコンテキストを設定。これにより、効果的に応答するのに役立つルール、ガイドライン、必要な情報を含めることができる。ツールの使用を有効にするためにも使用される。
・user : モデルに対するユーザー入力、コマンド、および質問。
・assistant : 「system」「user」プロンプトで提供されるコンテキストに基づくアシスタントの応答。
・ipython : 「Llama 3.1」で導入された新しいロール。LLM に返されるときに「Tool Calling」の出力として使用される。

Instructでは、簡単な会話に次の会話構造を使用します。

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

「Llama 3.1 Instruct」は、「Built-in Tool」 (brave_search、wolfram_alpha、code_interpreter) と「Custom Tool Calling」を含む「Tool Calling」をサポートするようになりました。「Built-in Tool」はPython構文を使用します。「Function Calling」用のPythonコードを出力する機能は、「Code Interpreter Tool」の一部であり、以下に示すように、システムプロンプトで Environmentキーワードを使用して有効にする必要があります。

5-1. Built-in Tool calling

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Tools: brave_search, wolfram_alpha

Cutting Knowledge Date: 01 March 2023
Today's Date: 13 July 2024


You are a helpful Assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Weather in Menlo Park, California<|eot_id|><|start_header_id|>assistant<|end_header_id|>

この時点でのモデルからの応答には、サポートされているツールの 1 つ (この場合は brave_search) を呼び出すPythonコードが含まれます。

<|python_tag|>brave_search.call(query="current weather in Menlo Park, California")<|eom_id|>

呼び出しの実行からの応答は、最終的な応答を取得するためにモデルに送り返されます。簡潔にするために、前のスニペットに示されているメッセージに次の内容が追加されます。

<|python_tag|>brave_search.call(query="Menlo Park California weather")<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{"query": "Menlo Park California weather", "top_k": [{"title": "10-Day Weather Forecast for West Menlo Park, CA - The Weather Channel | weather.com", "url": "https://weather.com/weather/tenday/l/West+Menlo+Park+CA?canonicalCityId=b2375713aa1943aad7d1a13a85e1c0adad13c1b10563b2bbaad70734dc61cf11", "description": "Be prepared with the most accurate 10-day forecast for West <strong>Menlo</strong> <strong>Park</strong>, CA with highs, lows, chance of precipitation from The <strong>Weather</strong> Channel and <strong>Weather</strong>.com", "type": "search_result"},....}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

LLMからの最終的な回答は次のようになります。

The current weather in Menlo Park, California is mostly sunny with a high of 77°F and a low of 56°F.<|eot_id|>

5-2. Custom Tool calling

「Llama 3.1 Instruct」は、単一のユーザーメッセージからの「Custom Function Calling」をサポートします。次のプロンプトは、モデルの出力からカスタム関数を呼び出す方法の例を示しています。「Custom Function Calling」では、モデルは <|eom_id|> ではなく <|eot_id|> を出力します。関数呼び出し出力の処理方法をモデルに通知するには、システムプロンプトを調整する必要があります。

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant with tool calling capabilities. When you receive a tool call response, use the output to format an answer to the orginal user question.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{
    "type": "function",
    "function": {
    "name": "get_current_conditions",
    "description": "Get the current weather conditions for a specific location",
    "parameters": {
        "type": "object",
        "properties": {
        "location": {
            "type": "string",
            "description": "The city and state, e.g., San Francisco, CA"
        },
        "unit": {
            "type": "string",
            "enum": ["Celsius", "Fahrenheit"],
            "description": "The temperature unit to use. Infer this from the user's location."
        }
        },
        "required": ["location", "unit"]
    }
    }
}

Question: what is the weather like in Menlo Park?<|eot_id|><|start_header_id|>assitant<|end_header_id|>

{"name": "get_current_conditions", "parameters": {"location": "Menlo Park, CA", "unit": "Fahrenheit"}}<|eot_id|><|start_header_id|>ipython<|end_header_id|>

選択したツールから出力を取得するときは、同じ <|python_tag|> 区切り文字を使用してモデルに渡します。<|python_tag|> は Python の使用を意味するものではありません。任意のツールからの出力の開始を通知することのみを目的としています。

<|python_tag|>{
    "tool_call_id": "get_current_conditions"
    "output": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"
}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The weather in Menlo Park is currently cloudy with a high of 76° and a low of 56°, with clear skies expected tonight.<|eot_id|>

効果的に使用するには、この形式を正確に再現する必要があります。Transformersで使用できるチャットテンプレートを使用すると、プロンプトを正しくフォーマットすることが簡単になります。

6. Llama 3.1 のデモ

次のデモで3つのInstructモデルを試すことができます。

・Hugging Chat with Llama 3.1 405B https://huggingface.co/chat/models/meta-llama/Meta-Llama-3.1-405b-instruct/
・Hugging Chat with Llama 3.1 70B https://huggingface.co/chat/models/meta-llama/Meta-Llama-3.1-70b-instruct/
・Gradio-powered Space with Llama 3.1 8B demo https://huggingface.co/spaces/ysharma/Chat_with_Meta_llama3_1_8b

7. Llama 3.1 の量子化

Metaは、精度の低下を最小限に抑えた「Llama 3.1 405B」の公式FP8量子化バージョンを作成しました。これを実現するために、FP8量子化は、FFNのゲートや上下投影 (推論FLOPの75%をカバー) など、モデルの主要な線形演算子にのみ適用されました。私たちは協力して、このFP8量子化チェックポイントがコミュニティ全体で互換性があることを確認しました (Transformers、TGI、VLLM)。

さらに、AutoAWQとAutoGPTQを使用して、INT4でAWQとGPTQの量子化バリアントを作成しました。AWQの場合、すべての線形レイヤーは、グループサイズが128の4bitまでのゼロポイント量子化を実行するGEMMカーネルを使用して量子化されました。GPTQの場合は、同じ設定でGPTQカーネルのみを使用しました。INT4チェックポイントがTransformersおよび TGI と互換性があることを確認しました。これには、GPTQの TGI での推論を高速化するためのMarlinカーネルサポートが含まれます。

「Llama 3.1 405B」で利用可能な量子化された重みは、次のとおりです。

・meta-llama/Meta-Llama-3.1-405B-Base-FP8
Official FP8 quantized weights, can be run on 8xH100
・meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
Official FP8 quantized weights, can be run on 8xH100
・hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
HuggingFace quantized weights, can run on 8xA100 80GB, 8xH100 80GB & 8xA100 40GB (with a reduced KV-cache and without CUDA graphs)
・hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4
HuggingFace quantized weights, can run on 8xA100 80GB, 8xH100 80GB & 8xA100 40GB (with a reduced KV-cache and without CUDA graphs)
・hugging-quants/Meta-Llama-3.1-405B-BNB-NF4
HuggingFace quantized weights, suitable for QLoRA finetuning
・hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4
HuggingFace quantized weights, suitable for inference on 8xA100 & 4xH100

8. 推論インテグレーション

8-1. HuggingFace Inference API

HuggingFace PRO ユーザーは、text-generation-inference を搭載した Llama 3.1 8B Instruct、Llama 3.1 70B Instruct、および Llama 3.1 405B Instruct AWQ をホストする専用のAPIエンドポイントにアクセスできるようになりました。すべてのバージョンは Messages API をサポートしているため、LangChain や LlamaIndex などの OpenAI クライアントライブラリと互換性があります。

from huggingface_hub import InferenceClient

# Initialize the client, pointing it to one of the available models
client = InferenceClient()

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful an honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

8-2. HuggingFace Inference Endpoints

「Llama 3.1」は、バックエンドとして「Text Generation Inference」を使用する HuggingFaceの推論エンドポイントにデプロイできます。「Text Generation Inference」は、FP8、連続バッチ処理、トークンストリーミング、複数の GPU での高速推論のためのテンソル並列処理をサポートする、HuggingFace によって開発された、本番環境対応の推論コンテナです。「Llama 3.1」をデプロイするには、モデルページに移動し、「Deploy → Inference Endpoints」をクリックします。

・Meta-Llama-3.1-8B-Instruct
recommended on 1x NVIDIA A10G or L4 GPUs
・Meta-Llama-3.1-70B-Instruct
recommended on 4x NVIDIA A100 or as AWQ/GPTQ quantized on 2x A100s
・Meta-Llama-3.1-405B-Instruct-FP8
recommended on 8x NVIDIA H100 in FP or as AWQ/GPTQquantized on 8x A100s

from huggingface_hub import InferenceClient

# Initialize the client, pointing it to one of the available models
client = InferenceClient(
    base_url="<ENDPOINT_URL>",
)

# Create a chat completion
chat_completion = client.chat.completions.create(
    model="ENDPOINT",
    messages=[
        {"role": "system", "content": "You are a helpful an honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

9. HuggingFaceパートナーインテグレーション

現在、AWS、Google Cloud、Microsoft Azure、DELLのパートナーと協力して、Llama 3.1 8B、70B、405B を Amazon SageMaker、Google Kubernetes Engine、Vertex AI モデルカタログ、Azure AI Studio、DELL Enterprise Hub に追加しています。

10. HuggingFace TRL によるファインチューニング

OpenAssistant のチャットデータセットで Llama 3.1 8B をファインチューニングする例を以下に示します。4bit量子化と QLoRA を使用してメモリを節約し、すべてのアテンションブロックの線形レイヤーをターゲットにします。

pip install "transformers>=4.43" --upgrade
pip install --upgrade bitsandbytes
pip install --ugprade peft
pip install git+https://github.com/huggingface/trl
git clone https://github.com/huggingface/trl
cd trl

python \
    examples/scripts/sft.py \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-4 \
    --report_to "none" \
    --bf16 \
    --max_seq_length 1024 \
    --lora_r 16 --lora_alpha 32 \
    --lora_target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit \
    --use_peft \
    --attn_implementation "flash_attention_2" \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir llama31

GPUに余裕がある場合は、DeepSpeed と ZeRO Stage 3 を使用して学習を実行できます。

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft.py \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-5 \
    --report_to wandb \
    --bf16 \
    --max_seq_length 1024 \
    --attn_implementation eager \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir models/llama

11. distilabel による合成データ生成

「Llama 3.1」のライセンスの大きな変更点は、モデル出力を使用して他のLLMを改善できるようになったことです。つまり、「Llama 3.1」を使用して合成データセットを生成し、それを使用してより小さく、より特化したモデルをファインチューニングできます。

合成データ生成用のオープンソースフレームワークである「distilabel」を使用して、好みのデータセットを生成する方法の例を見てみます。このデータセットは、DPOやKTOなどのTRLが提供する好みの最適化方法でモデルをファインチューニングするために使用できます。

pip install “distilabel[hf-inference-endpoints]” --upgrade

次に、次のパイプラインを定義します。

・HuggingFace Hub からの指示を含むデータセットをロード。
・HuggingFace Inference Endpoints を介して Llama 3.1 70B Instruct と Llama 3.1 405B Instruct を使用して応答を生成。
・最後に、Llama 3.1 405B Instruct を審査員として使用し、UltraFeedback プロンプトを使用して応答を評価。これらの評価から、選択された応答と拒否された応答を選択し、好みの最適化方法でモデルを微調整するために使用できる。

以下のコードを参照してパイプラインを定義するか、この Colabノートブックを使用して自分で実行し、Hubで生成されたデータセットを調べてください。

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, CombineColumns
from distilabel.steps.tasks import TextGeneration, UltraFeedback

llama70B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
llama405B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"
)

with Pipeline(name="synthetic-data-with-llama3") as pipeline:
    # load dataset with prompts
    load_dataset = LoadDataFromHub(
        repo_id="argilla/10Kprompts-mini"
    )
    # generate two responses for each prompt
    generate = [
        TextGeneration(llm=llama70B),
        TextGeneration(llm=llama405B)
    ]
    # combine responses into one column
    combine = CombineColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"]
    )
    # rate responses with 405B LLM-as-a-judge
    rate = UltraFeedback(aspect="overall-rating", llm=llama405B)
    # define the pipeline
    load_dataset >> generate >> combine >> rate

if __name__ == "__main__":
    distiset = pipeline.run()

12. 追加リソース

・Models on the Hub
・HuggingFace Llama Recipes
・Open LLM Leaderboard
・Chat demo on Hugging Chat
・Meta Blog

この記事が気に入ったらサポートをしてみませんか？