【都知事候補 #安野たかひろ #AIあんの】GPT-4o + LangChain によるカテゴリ生成とカテゴリ分類

2024年7月6日 23:43

こんにちは、チームあんの技術班の志水です。我々は 7/6(土) 23:59 まで選挙運動としてのブログ執筆をすることができます。最後までお付き合いください！
この記事ではGPT-4oとLangchainを用いたカテゴリ生成とカテゴリ分類についてお話しします。これはAIあんのを賢くするための Human-in-the-loop なプロセスの1つになります。

AIあんのについては詳しくは以下の記事をぜひご覧ください。

AIあんのを賢くするための Human-in-the-loop なプロセス全体像に関しては、以下の記事をぜひご覧ください。

答えられなかった質問を分類したい

AIあんのは皆様からの質問にYoutube Live上でお答えしていますが、質問によっては回答することができません。この時答えられなかった質問がどのようなものであったかを分類しておくと便利です。人間の目でカテゴリ分けされた質問を確かめて、カテゴリごとに専門家の力を借りて FAQの整備などを進めることができるからです。
これによって回答できる質問を増やし、みなさんによりよく安野たかひろの政策について知っていただくことができます。

この時に難しいのが、ユーザーからどのような質問が寄せられるかを事前に予測・把握することができないことです。このことから、単純に質問全てに対してカテゴリを付与するだけでは足りません。
スプレッドシートをざっと見通した有識者からは、以下のようなカテゴリを提案していただきました。

ただしこれでは全体をカバーできていないこともわかったため、全部の質問を読んだ上でうまくカテゴリ抽出する仕組みを作るモチベーションがありました。

そこで最終的には以下のような設計になりました。

1. GPT-4o にカテゴリを生成してもらい
2. それを人間が判断するのに便利な数まで絞り
3. 再度質問にカテゴリを付与していく
という3ステップからなるパイプラインです。

AIあんのが回答できなかった質問を収集する過程については、以下の記事をご参照ください。

GPT-4oによるカテゴリ生成

AIあんのが答えられなかった質問は、以下のような形式でデータ化されています。

このデータを元に情報を抽出していきます。この全体の流れやコードについては同じチームの TTTC (Talk to the city) を参考にさせていただきました。
こちらの記事もぜひご覧ください。

プロンプト

LangChainを用いてLLM(大規模言語モデル)に作業をお願いする場合、作業の詳細を記載したプロンプトを用意します。そのプロンプトとデータを受けてLLMは作業を実行し、結果を返してくれます。
今回の作業ではYoutube Liveに寄せられた質問をカテゴライズしてほしい、というプロンプトを与えます。人間が作業しやすいよう、メインカテゴリを7つ、さらにその中に3つのサブカテゴリがあるようにカテゴリを生成するようお願いしています。
また結果を機械的に処理したいため、JSON 形式で結果がほしいことを明記しています。
その結果、以下のような長文のプロンプトとなりました。

# system prompt
system= """
you are a category labeling assistant. 
we are a team of volunteers who are working on a political campaign.
the campaign is for the mayor of tokyo in 2024, japan.
the candidate is a 33 year old, male, former AI engineer, 
has ran 2 successful startups,
and has written an award winning book of Science Fiction.
we have a 93 page policy document, and most of the answers come from there.
however, we would like to update/enrich the policy document 
with what the public is asking.
we have a youtube live, and users have been asking questions.
where there is sufficient information in the policy document, 
the answer is provided.
when there is not, we give an answer as 
'その質問には答えられません。私はまだ学習中であるため、答えられないこともあります。
申し訳ありません。'.

we would like to know the broad category of the questions provided, 
so we can tune 
our answers to the incoming questions.
your task is to come up with around 7 categories and 3 subcategories 
for each category.

each answer should be in a json format with keys 
'category', 'subcategory', 'subcategories'.
subcategory should be 1 of the 3 subcategories for the category.
this should be a plain one line json, with no newlines.

there will be {num_questions} questions to categorize.
the final format should be a list of jsons, one for each question.

the result should be in JAPANESE.

the question are categorized as follows:
"""

このプロンプトを実行するためのコードは以下のようになります。

def get_categories(batch, retries=3):
    num_questions = len(batch)
    human = get_human_template(num_questions)

    prompt = ChatPromptTemplate.from_messages([
        ("system", system),
        ("human", human),
    ])
    chain = prompt | llm

    questions = format_batch_questions(batch)
    result = chain.invoke(
        {
            # explode a dict like {"question_1": "foo", "answer_1": "bar"} 
            "num_questions": num_questions,
            **questions,
        }
    )
    try:
        res = json.loads(result.content)
        if len(res) != num_questions:
            raise ValueError("Expected the same number of results as questions")
        return res
    except (json.decoder.JSONDecodeError, ValueError) as e:
        if retries > 0:
            logging.info(f"error: {e} {retries=} ...")
            return get_categories(batch, retries=retries-1)
        else:
            logging.error(f"Could not parse response: {result.content}")
            default = {
                "category": "unknown",
                "subcategory": "unknown",
                "subcategories": ["unknown", "unknown", "unknown"],
            }
            return [default] * num_questions

この中では工夫として、JSON形式で返ってこなかった場合にリトライする処理を書いています。
実際には以下のような形式で質問に対してカテゴリーが付与されます。

question: 都道の渋滞対策はどのようなものを者大考えですか？
そうだよりもハード施策を教えてください

category: 都市計画とインフラ	
subcategory: 公共交通
subcategories: [{'subcategory': '公共交通', 'subcategory_count': 4},
 {'subcategory': 'インフラ', 'subcategory_count': 4}, 
{'subcategory': '住宅政策', 'subcategory_count': 4}]

マルチプロセスによる高速化

LLM に上記のリクエストを送ると1件あたりレスポンスに2-3秒かかりました。そのため並列化してリクエストを送ることで高速化を図り、以下のような実装をしました。10並列にすることで10倍の高速化ができました。

def parallel_process(batches: list, max_workers=10) -> list[list[dict]]:
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(get_categories, batch) for batch in batches]
        concurrent.futures.wait(futures)
    return [future.result() for future in futures]


def flatten(list_of_lists):
    """Flatten a list of lists into a single list."""
    return [item for sublist in list_of_lists for item in sublist]

バッチ処理による高速化

最初のコードでは質問1件あたり1回のAPIリクエストを行っていました。1回のレスポンスに2-3秒かかるため、前段のマルチプロセス化を行っても20分程度処理の時間にかかっていました。
ここで実はLLMは10件データを送って10個のレスポンスを返してくれる能力があるのではと考え、バッチ処理を実装しました。コードは以下のようになります。

def format_batch_questions(batch) -> dict:
    # make the batch in formatted questions with keys like
    # question_1: "foo", "answer_1": "bar"
    # question_2: "foo", "answer_2": "bar"
    # ...
    res = {}
    for i, d in enumerate(batch):
        res[f"question_{i}"] = d["question"]
        res[f"answer_{i}"] = d["answer"]
    return res


def get_human_template(num_questions: int):
    """makes a template for questions like

    question_0: {question_0}, answer_0: {answer_0}
    question_1: {question_1}, answer_1: {answer_1}
    """
    human_template = "question_{i}: {question_{i}}, answer_{i}: {answer_{i}}"
    all = []
    for i in range(num_questions):
        all.append(human_template.replace("{i}", str(i)))
    human_template_filled = "\n".join(all)
    return human_template_filled


def get_batch(df, batch_size, start=0):
    return df.iloc[start:start+batch_size].to_dict(orient="records")

面白いことに15個まではバッチ処理することができるが、それ以上だと正しい件数や正しいJSONとして結果が帰らなくなることがわかりました。何回か試した結果15のバッチサイズにし、15倍実行速度が速くなり、2分程度で処理を終えることができました。

最終的な extraction のコード

これらを組み合わせ、実行は worker 数と batchで高速化した以下のようなコードになりました。

def extract(df, batch_size=15, max_workers=10):
    all = []
    for i in tqdm(range(0, len(df), max_workers * batch_size)):
        # get max_workers number of batches
        batches = [get_batch(df, batch_size, start=i+j*batch_size) for j in range(max_workers)]
        batches_res = parallel_process(batches, max_workers=max_workers)
        # concat the results to original batches and save
        df_res = pd.concat(
            [
                pd.DataFrame(flatten(batches)),
                pd.DataFrame(flatten(batches_res)),
            ],
            axis=1,
        )
        all.append(df_res)
        all_df = pd.concat(all, ignore_index=True)
        all_df.to_csv("data/df_res_回答できなかった質問.tsv", sep="\t", index=False)
    return pd.concat(all, ignore_index=True)

実行の途中に正しく実行されているか確認したかったため、途中で何度も保存すような設計にしました。

GPT-4oによるカテゴリの絞り込み

200個の大カテゴリ

当初は上記のカテゴリ生成を使って直接カテゴリ分類ができると考えていました。直接分類に使えるよう、プロンプトの中では「7つの大カテゴリとその中の3つの小カテゴリ」として分類してくれと依頼していました。しかし実際の結果を見ると、1件1件では大分類と3つの小分類を返してくれるものの、全てを集めてみると以下のように200個の大カテゴリが生成されていました。

LLMは1回のカテゴリ生成でバッチサイズ分の15個のデータしか見ていないため、全体を通して整合性のあるカテゴリを作ることは難しいのは仕方がないのだと理解しました。

カテゴリの絞り込み

このため、LLMに今度は200個の大カテゴリと多数の小カテゴリを与え、これを最も意味のある形に7個の大カテゴリと3つずつの小カテゴリにまとめるよう依頼しました。プロンプトは以下のようになります。

system = """
you are a category labelling assistant. you need to make 7 categories and 21 subcategories.
these come from users's questions to the candidate of the mayor of tokyo for 2024. 

the candidate is a 33 year old, male, former AI engineer, has ran 2 successful startups,
and has written an award winning book of Science Fiction.
we have a 93 page policy document, and most of the answers come from there.
however, we would like to update/enrich the policy document with what the public is asking.
we have a youtube live, and users have been asking questions.

from these labels, make 7 categories and 21 subcategories, that sufficently map most of the interest.
the categories should be broad, and the subcategories should be specific.
1 category should have 3 subcategories.
the potential categories and subcategories data have counts associated with them.
the current candidates for categories are 
{current_categories}

they should be taken into account, but you don't need to follow them exactly.
however, 他候補者との関係 must be a subcategory>
as you can see, the top level cateogory should be a long word or sentence that 
takes into account many aspects of the subcategories.

the result should be a json file with the categories and subcategories.
just the json file, no headers or anything else. 
add the counts to the json file as well.
so the keys should be category, category_count, subcategory, subcategory_count.

the potential categories are {categories} and the subcategories are {subcategories}.
"""

この結果以下のようなカテゴリを得ることができました。

- **経済と財政政策** [経済政策, 財政政策, 税制改革]
- **テクノロジーとイノベーション** [AI技術, ブロックチェーン, 自動運転技術]
- **教育と子育て支援** [高等教育, 奨学金, 子育て支援]
- **社会福祉と福祉政策** [障害者支援, 高齢者支援, 生活保護]
- **環境とエネルギー政策** [再生可能エネルギー, エネルギー政策, 気候変動対策]
- **都市計画とインフラ** [公共交通, 交通インフラ, 住宅政策]
- **選挙と政治活動** [選挙戦略, 他候補者との関係, 選挙活動]

人間の目からみると重複しているところがあるように見えたため、一部だけ変更し以下を最終的なカテゴリとしました。

- **経済と財政政策** [経済政策, 財政政策, 税制改革]
- **テクノロジーとイノベーション** [AI技術, ブロックチェーン, 自動運転技術]
- **教育と子育て支援** [高等教育, 奨学金, 子育て支援]
- **社会福祉と福祉政策** [障害者支援, 高齢者支援, 生活保護]
- **環境とエネルギー政策** [再生可能エネルギー, 原発, 気候変動対策]
- **都市計画とインフラ** [公共交通, インフラ, 住宅政策]
- **選挙と政治活動** [選挙戦略, 他候補者との関係, 選挙活動]

GPT-4oによるカテゴリ分類

最後に上記のカテゴリを正として、再度各質問に該当するカテゴリとサブカテゴリを付与するよう依頼しました。プロンプトとしては以下になりました。

system = """
you are a category labelling assistant. you need to make 7 categories and 21 subcategories.
these come from users's questions to the candidate of the mayor of tokyo for 2024. 

the candidate is a 33 year old, male, former AI engineer, has ran 2 successful startups,
and has written an award winning book of Science Fiction.
we have a 93 page policy document, and most of the answers come from there.
however, we would like to update/enrich the policy document with what the public is asking.
we have a youtube live, and users have been asking questions.

each question should be labelled with a category and a subcategory.
the categories and subcategories are: {categories}
read all the categories and subcategories very carefully, and use your imagination to 
find the one that matches the best.
if there is really nothing that matches, you can use the category and subcategory as "other".
before labelling as "other", you must carefully read all the categories and subcategories.
and find a one that will with some context be a match.

each answer should be in a json format with keys 'category', 'subcategory', 'subcategories'.
subcategory should be 1 of the 3 subcategories for the category.
this should be a plain one line json, with no newlines.

there will be {batch_size} questions to categorize.
the final format should be a list of jsons, one for each question.

the result should be in JAPANESE.

this question is labeled as:
"""

その他のコードはカテゴリ生成のものを流用しました。面白いことに、バッチサイズを15にするとエラーが返ってくるようなったため、バッチサイズは5にしてカテゴリ分類を行いました。カテゴリの紐付けの方がカテゴリの生成よりLLMにとってはタスクとして難しいのかもしれません。
この結果をFAQを制作していただく分野ごとの専門家にお渡ししました。カテゴリがない時よりかなり作業が捗ると感謝していただけ、やって良かったなと感じることができました。

感謝の言葉をいただけました

まとめ

今回は生成、絞り組み、分類という3ステップに分けて回答不能質問に対するカテゴリ分類を行いました。今回私はlangchainをちゃんと使うのが初めてだったため、結果を逐次的に確認しながらこのように設計しました。おそらく、慣れた方であれば再起的にカテゴリ生成と分類を行うことで1ステップで行うこともできると思います。もしそれが実現したらぜひ教えていただければ幸いです。

長かった選挙戦も最終盤になり、明日は投票日です！ぜひ安野たかひろに１票をお願いいたします！
7/6(土) 23:50 分までは生配信を行っているため、皆様きていただけると幸いです！

#安野たかひろを都知事に

最新情報は、本人・事務所の公式X(Twitter)アカウントをフォローしてご覧ください！
安野たかひろ事務所(@annotakahiro24)
安野たかひろ本人(@takahiroanno)

【都知事候補 #安野たかひろ #AIあんの 】GPT-4o + LangChain によるカテゴリ生成とカテゴリ分類