【DLAI×LangChain講座①】 Models, Prompts and Output Parsers

harukary

2023年6月25日 15:17

背景

LangChainは気になってはいましたが、複雑そうだなとか、スタンダードになるかはわからないよなと思い、後回しにしていました。

そんなときに、DeepLearning.aiでLangChainの講座が公開されていたので、少し前に受講してみました。

第１回のOutputParserとか感動していたんですけど、先日OpenAI本家からFunction calling機能が出ましたね。

こちらの記事では、Function callingの、Output Parser的な機能に注目して試しています。

ということで今後はLangChainはあんまりかなと思っていましたが、

OpenAI functions | 🦜️🔗 Langchain

それもすぐに取り込まれていて、、早すぎますね。やっぱりLangChain使う可能性もあるな、ということで、講座の内容をまとめていきます。

アプローチ

DeepLearning.aiのLangChain講座の1の内容をまとめます。

DLAI - Learning Platform Beta

また、講座中に出てくるStructuredOutputParserは、階層構造を持つJsonを出せなさそうなので、それができるPydanticOutputParserも試しました。
これはほぼFunction callingですね。おそらくトークン数や精度でPydanticOutputParserが少し劣る感じなのでしょう。

サンプル

Import & Load API key

いつも通り、APIキーを読み込みます。

import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

OpenAI package

まずは、LangChainを使わない場合の例です。
promptに対してresponseを返す関数を定義しています。

def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(model=model,messages=messages,temperature=0)
    return response.choices[0].message["content"]

まずは、顧客からのEメールを翻訳するタスクです。
講座と違い、出力を日本語にしています。

customer_email = """
Arrr, I be fuming that me blender lid \
flew off and splattered me kitchen walls \
with smoothie! And to make matters worse,\
the warranty don't cover the cost of \
cleaning up me kitchen. I need yer help \
right now, matey!
"""

style = """Japanese \
in a calm and respectful tone
"""

prompt = f"""Translate the text \
that is delimited by triple backticks 
into a style that is {style}.
text: ```{customer_email}```"""

最終的なプロンプトはこちらです。

Translate the text that is delimited by triple backticks 
into a style that is Japanese in a calm and respectful tone.
text: 
```Arrr, I be fuming that me blender lid flew off and splattered me kitchen walls with smoothie! And to make matters worse,the warranty don't cover the cost of cleaning up me kitchen. I need yer help right now, matey!

response = get_completion(prompt)
response

「ああ、私のブレンダーの蓋が飛んで、スムージーで私のキッチンの壁に飛び散ったことに怒っています！さらに悪いことに、保証は私のキッチンの掃除の費用をカバーしていません。今すぐあなたの助けが必要です、仲間！」

余談ですが、ChatGPTは、くだけた日本語は弱いですよね。なかなか自然な文章になりません。

LangChain PromptTemplate

次は、LangChainのPromptTemplateを使う場合です。

from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(temperature=0.0)

PromptTemplateを作っていきます。
ここら辺は、stringのformat関数でもできますが、chatに入力できるmessagesの形が作れるところがメリットでしょうか。

template_string = """Translate the text \
that is delimited by triple backticks \
into a style that is {style}. \
text: ```{text}```
"""

from langchain.prompts import ChatPromptTemplate
prompt_template = ChatPromptTemplate.from_template(template_string)

PromptTemplate(input_variables=['style', 'text'], output_parser=None, partial_variables={}, template='Translate the text that is delimited by triple backticks into a style that is {style}. text: ```{text}```\n', template_format='f-string', validate_template=True)

テンプレートの文字列からPromptTemplateを作成しました。
input_variablesなどが確認できるようになっていますね。

prompt_template.messages[0].prompt.input_variables

['style', 'text']

ということで、スタイルを「日本語の穏やかで丁寧な調子」に変換していきます。

customer_style = """Japanese \
in a calm and respectful tone
"""

customer_email = """
Arrr, I be fuming that me blender lid \
flew off and splattered me kitchen walls \
with smoothie! And to make matters worse, \
the warranty don't cover the cost of \
cleaning up me kitchen. I need yer help \
right now, matey!
"""

customer_messages = prompt_template.format_messages(
                    style=customer_style,
                    text=customer_email)

customer_response = chat(customer_messages)

print(customer_response.content)

「ああ、私のブレンダーの蓋が飛んで、スムージーで私のキッチンの壁に飛び散ったことに怒っています！さらに悪いことに、保証は私のキッチンの掃除の費用をカバーしていません。今すぐあなたの助けが必要です、仲間よ！」

ということで、PromptTemplateを使って同じことが、少し簡単にできました。

次に、サービス運営からの返信メールを作成します。

service_reply = """Hey there customer, \
the warranty does not cover \
cleaning expenses for your kitchen \
because it's your fault that \
you misused your blender \
by forgetting to put the lid on before \
starting the blender. \
Tough luck! See ya!
"""

service_style_pirate = """\
a polite tone \
that speaks in Japanese Pirate\
"""

service_messages = prompt_template.format_messages(
    style=service_style_pirate,
    text=service_reply)

講座では、メールを"a polite tone that speaks in English Pirate"（丁寧・海賊の英語）に変換させようとしています。それではうまく動作しているか確認できないので、日本語にしました。（それでも動作確認できるのか？）

service_response = chat(service_messages)
print(service_response.content)

おおっと、お客さん、保証はキッチンのクリーニング費用をカバーしていないんだ。なぜなら、ブレンダーを始める前に蓋を忘れて使ったのはあなたのミスだからさ。残念だけど、それじゃあまたね！

。。。海賊なのかな。まあでもできたっぽいです。

LangChain Output Parsers

次は、Output Parserです。LangChainの講座では、これが一番汎用的に使えそうで感動しました。顧客のレビューコメントから、以下のデータを抽出します。

{
  "gift": False,
  "delivery_days": 5,
  "price_value": "pretty affordable!"
}

customer_review = """\
This leaf blower is pretty amazing.  It has four settings:\
candle blower, gentle breeze, windy city, and tornado. \
It arrived in two days, just in time for my wife's \
anniversary present. \
I think my wife liked it so much she was speechless. \
So far I've been the only one using it, and I've been \
using it every other morning to clear the leaves on our lawn. \
It's slightly more expensive than the other leaf blowers \
out there, but I think it's worth it for the extra features.
"""

review_template = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product \
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: {text}
"""

まずは、Output Parserを使わないパターンです。

from langchain.prompts import ChatPromptTemplate
prompt_template = ChatPromptTemplate.from_template(review_template)

messages = prompt_template.format_messages(text=customer_review)
chat = ChatOpenAI(temperature=0.0)
response = chat(messages)
print(response.content)

{
	"gift": true,
	"delivery_days": 2,
	"price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}

いい感じです。しかし文字列なのでパースして使う必要があります。
この講座を受ける前は、自己流で指示をしていました。結構Json部分以外に文章も入れてくるため、```json{data}```をreで検出してパースしていました。

そのため、以下の方法でデータを得ることはできません。

response.content.get('gift')

AttributeError                            Traceback (most recent call last)

    Cell In[111], line 4
    ----> 1 response.content.get('gift')

    AttributeError: 'str' object has no attribute 'get'

そこで、OutputParserの出番です。

from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

StructuredOutputParserでは、以下のように各要素についてキーの名前と説明を入力します。

gift_schema = ResponseSchema(name="gift",
                             description="Was the item purchased\
                             as a gift for someone else? \
                             Answer True if yes,\
                             False if not or unknown.")
delivery_days_schema = ResponseSchema(name="delivery_days",
                                      description="How many days\
                                      did it take for the product\
                                      to arrive? If this \
                                      information is not found,\
                                      output -1.")
price_value_schema = ResponseSchema(name="price_value",
                                    description="Extract any\
                                    sentences about the value or \
                                    price, and output them as a \
                                    comma separated Python list.")

response_schemas = [gift_schema, 
                    delivery_days_schema,
                    price_value_schema]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

format_instructions = output_parser.get_format_instructions()

こちらが指示部分です。Json内のコメントで説明をしていますね。これをプロンプトに加えることで、Json形式で出力させることができます。そして、その出力は、Parserでパースできます。
つまり、OutputParserでは、出力指示のプロンプトと、実際にパースして出力する機能がセットになっています。

print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":
    
```json
{
	"gift": string  // Was the item purchased                             as a gift for someone else?                              Answer True if yes,                             False if not or unknown.
	"delivery_days": string  // How many days                                      did it take for the product                                      to arrive? If this                                       information is not found,                                      output -1.
	"price_value": string  // Extract any                                    sentences about the value or                                     price, and output them as a                                     comma separated Python list.
}
```

こちらが出力形式を指示するプロンプトです。

review_template_2 = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product\
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

text: {text}

{format_instructions}
"""

prompt = ChatPromptTemplate.from_template(template=review_template_2)

messages = prompt.format_messages(text=customer_review, 
                                format_instructions=format_instructions)

response = chat(messages)

print(response.content)

```json
{
	"gift": true,
	"delivery_days": "2",
	"price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}
```

同様の文字列が出力できています。
これをOutputParserでパースします。

output_dict = output_parser.parse(response.content)

{'gift': True,
 'delivery_days': '2',
 'price_value': ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]}

無事パースされました。いい感じです。

output_dict.get('delivery_days')

'2'

先ほどはできなかった、Dict内の値を取り出すことができました。

おまけ：PydanticOutputParser

講座の中では、一階層のJsonをパースする方法が出てきました。
他にも、Pydanticを用いて、複雑なJsonを出力する方法があります。

Pydantic (JSON) parser | 🦜️🔗 Langchain

Import & Load API key

import json
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
chat = ChatOpenAI(temperature=0.0)

Extracting Food data

今回は、食材の栄養と相性のよい食材を以下の形式で出力する方法を試しました。

{
    "name": "str: 食材名",
    "nutrition": {
        "calories": "int: 100gあたりカロリー（kcal）",
        "carbohydrates": "float: 100gあたり炭水化物（g）",
        "protein": "float: 100gあたりたんぱく質（g）",
        "vitamin_c": "float: 100gあたりビタミンC（mg）",
        "fiber": "float: 100gあたり食物繊維（g）",
    },
    "compatible_foods": ["str: 相性の良い食材の名前"]
}

ちなみに、以前自己流で指示していた方法では、上記のフォーマットに対し、これに従ってくださいというものでした。まあまあの精度で抽出できていましたが、OutputParserとFunction callingがある今、使うことはなさそうです。。

以下のようにPydanticでクラス定義し、PydanticOutputParserを作ります。簡単ですね。

class Nutrition(BaseModel):
    calories: int = Field(description='100gあたりカロリー（kcal）')
    carbohydrates: float = Field(description='100gあたり炭水化物（g）')
    protein: float = Field(description='100gあたりたんぱく質（g）')

    vitamin_c: float = Field(description='100gあたりビタミンC（mg）')
    fiber: float = Field(description='100gあたり食物繊維（g）')

class Food(BaseModel):
    name: str = Field(description='食材名')
    nutirition: Nutrition = Field(description='食材の100gあたり栄養素')
    compatible_foods: List[str] = Field(description='相性の良い食材（5つまで）')

parser = PydanticOutputParser(pydantic_object=Food)

ここからは講座の例と同様です。

about_food_template = """
{food}について教えてください。相性の良い食材や100gあたりの栄養成分など
{format_instructions}
"""
prompt = ChatPromptTemplate.from_template(template=about_food_template)

試しに「白菜」の情報を聞いてみました。

food = '白菜'
messages = prompt.format_messages(
    food=food, format_instructions=parser.get_format_instructions()
)
response = chat(messages)
food_data = parser.parse(response.content)

print(json.dumps(food_data.dict(),indent=2,ensure_ascii=False))

{
  "name": "白菜",
  "nutirition": {
	"calories": 13,
	"carbohydrates": 2.2,
	"protein": 1.2,
	"vitamin_c": 45.0,
	"fiber": 1.2
  },
  "compatible_foods": [
	"豚肉",
	"鶏肉",
	"牛肉",
	"エビ",
	"イカ"
  ]
}

いい感じに出力されました。相性の良い食材は、５つまでの指定を外すともっと出してくれます。

このように、Pydanticを使うことで、複雑なJsonもパースすることができます。APIの出力として定義したPydanticのクラスをそのままここに使えそうですね。

まとめ

DeepLearning.aiのLangChain講座の1の内容をまとめました。

「LangChainの実装はわかっておきながら自前実装する」のがいいかもと言ってましたが、やっぱり便利なので使っていくべきかも。OpenAI APIもLangChainも高速でアップデートされていく中、どちらも追っていくしかなさそうです。

また、本筋ではないですが、講座中の例にある通り、ChatGPTはやっぱりくだけた日本語表現は弱い印象です。そのため、テキスト生成よりも、非構造化テキストを構造化するなど裏側で使うほうに魅力を感じます。
ただ、生成についても、言語計算に集中させればいけるとは思っているので、それに関してはまた書きたいです。

このLangChain講座は6まであるので、続きも書いていきます。よろしくお願いします。

参考

Function callingで複雑なJson形式を抽出する｜harukary
OpenAI functions | 🦜️🔗 Langchain
DLAI - Learning Platform Beta
Pydantic (JSON) parser | 🦜️🔗 Langchain
ChatGPTでURLから任意のJson形式でデータ抽出を行う｜harukary

サンプルコード

https://github.com/harukary/llm_samples/blob/main/LangChain/LangChain_DLAI_1/L1-Model_prompt_parser.ipynb
https://github.com/harukary/llm_samples/blob/main/LangChain/LangChain_DLAI_1/plus_L1.ipynb

この記事が気に入ったらサポートをしてみませんか？