Google Colab で distilabel を試す

2024年1月18日 10:32

「Google Colab」で「distilabel」を試したので、まとめました。

1. distilabel

「distilabel」は、LLMを使用してLLM用のデータセットを作成するためのAI Feadback (AIF) フレームワークです。

・LLMの最も一般的なライブラリ・APIとの統合 (HuggingFace Transformers、OpenAI、vLLMなど)
・Self-Instruct、Preferenceデータセットなどの複数のタスクに対応
・データセットを Argillaにエクスポートすることで、データ探索とさらなるアノテーションが容易に

2. セットアップ

Google Colabでのセットアップ手順は、次のとおりです。

(1) パッケージのインストール。

# パッケージのインストール
!pip install distilabel[openai,argilla]

(2) 環境変数の準備。
左端の鍵アイコンで「OPENAI_API_KEY」に自分のOpenAI APIキーを設定してからセルを実行してください。

import os
from google.colab import userdata

# 環境変数の準備 (左端の鍵アイコンでOPENAI_API_KEYを設定)
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

3. Self-Instructデータセットの作成

「Self-Instruct データセット」は「SFT」などで利用するデータセットです。
作成手順は、次のとおりです。

(1) 左端のフォルダアイコンでファイル一覧を表示し、「/usr/local/lib/python3.10/dist-packages/distilabel/tasks/_templates/self-instruct.jinja2」を開き、「Criteria for Querirs」を以下のように変更。
日本語で出力するようにプロンプトを修正しています。

# Criteria for Queries
Incorporate a diverse range of verbs, avoiding repetition.
Ensure queries are compatible with AI model\'s text generation functions and are limited to 1-2 sentences.
Design queries to be self-contained and standalone.
Blend interrogative (e.g., xの意味は何ですか?) and imperative (e.g., xのプロセスを詳しく説明してください。) styles.
Write each query on a separate line and avoid using numbered lists or bullet points.
Please be sure to write your queries in Japanese. Do not write in English.

(2) 入力データセットの作成。
指示文のトピックを準備します。

from distilabel.tasks import SelfInstructTask
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from datasets import Dataset

# 指示文のトピックの準備
math_topics = [
    "Algebraic Expressions",
    "Linear Equations",
    "Quadratic Equations",
    "Polynomial Functions",
    "Rational Expressions",
    "Exponential Functions",
    "Logarithmic Functions",
    "Sequences and Series",
    "Matrices",
    "Determinants",
    "Complex Numbers",
    "Trigonometry",
    "Geometry",
    "Coordinate Geometry",
    "Vector Algebra",
    "Statistics",
    "Probability",
    "Calculus",
    "Differential Calculus",
    "Integral Calculus",
    "Limits and Continuity",
    "Differentiation",
    "Integration",
    "Theorems of Calculus",
    "Mathematical Reasoning",
    "Set Theory",
    "Number Theory",
    "Permutations and Combinations",
    "Binomial Theorem",
    "Arithmetic Progressions",
    "Geometric Progressions",
    "Harmonic Progressions",
    "Trigonometric Ratios",
    "Trigonometric Identities",
    "Inverse Trigonometric Functions",
    "Hyperbolic Functions",
    "Conic Sections",
    "Circle Geometry",
    "Ellipse Geometry",
    "Parabola Geometry",
    "Hyperbola Geometry",
    "Function Theory",
    "Graph Theory",
    "Differential Equations",
    "Mathematical Induction",
    "Discrete Mathematics",
]

# 入力データセットの準備
dataset = Dataset.from_dict({
    "input": math_topics
})

(3) Self-Instructタスクの準備。
トピックから指示文を生成するタスクになります。

# アプリケーションの説明
application_description = (
    "An AI assistant adept at answering a wide array of math, logic, and reasoning puzzles, trivia, "
    "and general questions. Users of this assistant love to ask the assistant to think and outlines "
    "the solutions step by step. It expects complete questions from users providing all the details "
    "to solve the proposed problem or respond to general knowledge questions. It covers general "
    "knowledge about math, puzzles, reasoning exercises, and real-life scenarios where math and "
    "reasoning are important."
)

# Self-Instructタスクの準備
instruction_task = SelfInstructTask(
    application_description=application_description
)

(4) Self-Instructタスクの実行。

# Self-Instructionタスクを実行
instruction_generator = OpenAILLM(
    task=instruction_task,
    num_threads=8,
    max_new_tokens=1024,
    temperature=0.7
)
pipeline = Pipeline(generator=instruction_generator)
distiset = pipeline.generate(
    dataset=dataset,
    num_generations=10,
    batch_size=4
)

(5) 出力データセットの作成。

import re

# 出力データセットの作成
def transform(inst: str) -> str:
    """Remove 1., 2., ... from the instruction."""
    clean_inst = re.sub(r'^\d+\.\s*', '', inst)
    return f"{clean_inst}"

instructions = [
    transform(instruction)
    for generations in distiset["raw_generation_responses"]
    for generation in generations
    for instruction in generation.split("\n")
    if instruction != ""
]
dataset = Dataset.from_dict({"instructions": instructions})

# 出力データセットの確認
print(dataset)
print(dataset[0])
print(dataset[1])
print(dataset[2])
print(dataset[3])
print(dataset[4])

Dataset({
    features: ['instructions'],
    num_rows: 2302
})
{'instructions': 'この代数的な式を解くための最初のステップは何ですか？'}
{'instructions': '式の要素を交換するための適切な手順を教えてください。'}
{'instructions': '式を単純化するための一般的な手法は何ですか？'}
{'instructions': '式を因数分解するための効果的な方法を教えてください。'}
{'instructions': '式を拡張するために使用される基本的な手順は何ですか？'}

2302個の指示文のSelf-Instructデータセットが生成されました。

4. Preferenceデータセットの作成

「Preferenceデータセット」は「DPO」などで利用するデータセットです。
作成手順は、次のとおりです。

(1) 入力データセットの作成。
Self-Instructデータセットの「instructions」を「input」に変更します。

# 入力データセットの作成
dataset = dataset.rename_column("instructions", "input")

(2) TextGenerationTaskタスクの準備。
指示文から回答文を生成するタスクになります。

from distilabel.tasks import TextGenerationTask

# TextGenerationTaskタスクの準備
text_generation_task = TextGenerationTask(
    principles_distribution={
        "harmlessness": 0.4,
        "helpfulness": 0.2,
        "truthfulness": 0.2,
        "honesty": 0.1,
        "verbalized_calibration": 0.1
    }
)

generator = OpenAILLM(
    task=text_generation_task,
    num_threads=8,
    max_new_tokens=1024
)

(3) UltraFeedbackTaskタスクの準備。
指示文に対する回答のレーティングを行うタスクになります。

from distilabel.tasks import UltraFeedbackTask

# UltraFeedbackTask 多数の準備
preference_labeller = OpenAILLM(
    task=UltraFeedbackTask.for_instruction_following(),
    num_threads=8,
    max_new_tokens=1024,
)

(4) パイプラインの実行。

# パイプラインの準備
pipeline = Pipeline(
    generator=generator,
    labeller=preference_labeller
)

# パイプラインの実行
distiset_pref = pipeline.generate(
    dataset=dataset.shuffle().select(range(100)),
    num_generations=3,
    batch_size=8
)

2302個の指示文に対して3つずつの回答とレーティングを持つ「Preferenceデータセット」が生成されました。

(5) カラムの確認。

# カラムの確認
distiset_pref.column_names

['input',
 'generation_model',
 'generation_prompt',
 'raw_generation_responses',
 'generations',
 'labelling_model',
 'labelling_prompt',
 'raw_labelling_response',
 'rating',
 'rationale']

(6) 指示文に対する回答の確認。
1つの指示文に対して3つの回答が生成されています。

print("Instruction:", distiset_pref[0]["input"])
for generation in distiset_pref[0]["generations"]:
    print("----")
    print(generation)

Instruction: 指数関数の定義を教えてください。
----
指数関数は、数学的には次のように定義されます。

指数関数とは、変数を指数とする形の関数であり、底となる数の値（通常は正の実数）を指数にべき乗することによって求められる関数です。

一般的な指数関数の形式は、f(x) = a^x です。ここで、aは底となる数であり、xは指数です。

指数関数は、底として使用される数が正である場合、xが正の整数の場合は単調に増加し、xが負の整数の場合は単調に減少します。aが1より大きい場合は指数関数が増加し、0より大きく1未満の場合は指数関数が減少します。
----
指数関数は数学的な関数の一種であり、次のように定義されます。

指数関数は、底（base）と呼ばれる正の定数 a（a ≠ 1）と、実数 x を引数として受け取ります。そして、その出力値は a の x 乗で定義されます。数式で表すと、指数関数は次のようになります：

f(x) = a^x

ここで、a を底とする指数関数と呼びます。x は指数と呼ばれ、指数関数の引数または指数として扱われます。指数関数では、底 a を正の定数として選ぶことができますが、底が 1 である場合は指数関数ではありません。

指数関数は、指数の増加に対して指数的に増加または指数的に減少します。言い換えると、指数関数の値は指数が増えるごとに急速に増大または減少します。
----
指数関数は、数学的な関数の一種であり、指数と呼ばれる数を底とし、指数が変数となる関数です。一般的な指数関数の形式は、f(x) = a^x と表されます。ここで、a は底と呼ばれる定数であり、x は指数と呼ばれる変数です。指数関数は、底の値が正の実数であり、指数が任意の実数である場合に定義されます。指数関数は、底の値が 1 より大きい場合は増加し、0 より小さい場合は減少する性質を持ちます。また、底の値が 1 より小さい場合は減少し、0 より大きい場合は増加する性質を持ちます。指数関数は、数学や自然科学などのさまざまな分野で広く使われます。

(7) 回答のレーティングの確認。

# 回答のレーティングの確認
distiset_pref[0]["rating"]

[4.0, 5.0, 5.0]

(8) 回答のレーティングの根拠の確認。

# 回答のレーティングの根拠の確認
distiset_pref[0]["rationale"]

['The text provides a clear and accurate definition of an exponential function, including the form and behavior of the function for positive and negative values of the exponent. However, it does not explicitly mention the restriction that the base must be a positive real number.',
 'The text provides a clear and accurate definition of an exponential function, including the form and behavior of the function. It explicitly mentions the restriction that the base must be a positive constant other than 1.',
 'The text provides a clear and accurate definition of an exponential function, including the form and behavior of the function. It explicitly mentions the restriction that the base must be a positive real number. It also mentions the behaviors of the function for different values of the base.']

この記事が気に入ったらサポートをしてみませんか？