ChatGPTでEmbeddings-based searchを使ってみよう!
Embeddings-based searchとは、何かと疑問に持つ方もいるでしょう。適切な日本語が無いようですので、無難なところでベクトル型検索としておきます。
ベクトル型検索とは、情報をベクトルにして、角度や距離を利用して類似度を探しあてる仕組みのことです。高校生で習う三角関数のcos(θ)あたりです。今回は、このようなことは忘れても大丈夫です。
今回は、ChatGPTのOpenAIがこのベクトル型検索についての記事がありましたので、これをコードで実行してベクトル型検索がどういうものかみていきます。
問題の前提は、ChatGPTは2021年9月以降の情報は持っていないということです。最新のウェブページの情報や、新規公開の情報には対応していないということです。
この問題を解決するために、最新の情報をベクトル化して、ChatGPTが最新情報にアクセスできるようにしようというのが今回紹介する内容になります。
今回は、下記のサイトを参考にさせて頂きました。
今回のベクトル型検索の大枠は、以下です。
検索データの準備
収集
チャンク化
ベクトル化
保存
検索
質問
とりあえず実行してみよう
検索データの準備では、以下コードを実行します。Google Colabで動くように修正しています。Your-API-KeyにはOpenAIのAPIキーを記載下さい。
下記のコードは、ChatGPTは2021年9月以降の情報は持っていないので回答することが出来ないので、2022年のオリンピックの結果については回答できません。2022年のオリンピックの結果情報をベクトル化して読み込ませて回答させるという流れになります。
!pip install openai
!pip install tiktoken
# imports
import ast # for converting embeddings saved as strings back to arrays
import openai # for calling the OpenAI API
import pandas as pd # for storing text and embeddings data
import tiktoken # for counting tokens
from scipy import spatial # for calculating vector similarities for search
#OpenAPIKey
openai.api_key = "Your-API-Key"
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"
# text copied and pasted from: https://en.wikipedia.org/wiki/Curling_at_the_2022_Winter_Olympics
# I didn't bother to format or clean the text, but GPT will still understand it
# the entire article is too long for gpt-3.5-turbo, so I only included the top few sections
wikipedia_article_on_curling = """Curling at the 2022 Winter Olympics
Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue Beijing National Aquatics Centre
Dates 2–20 February 2022
No. of events 3 (1 men, 1 women, 1 mixed)
Competitors 114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s) Sweden
2nd place, silver medalist(s) Great Britain
3rd place, bronze medalist(s) Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s) Great Britain
2nd place, silver medalist(s) Japan
3rd place, bronze medalist(s) Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s) Italy
2nd place, silver medalist(s) Norway
3rd place, bronze medalist(s) Sweden
Curling at the
2022 Winter Olympics
Curling pictogram.svg
Qualification
Statistics
Tournament
Men
Women
Mixed doubles
vte
The curling competitions of the 2022 Winter Olympics were held at the Beijing National Aquatics Centre, one of the Olympic Green venues. Curling competitions were scheduled for every day of the games, from February 2 to February 20.[1] This was the eighth time that curling was part of the Olympic program.
In each of the men's, women's, and mixed doubles competitions, 10 nations competed. The mixed doubles competition was expanded for its second appearance in the Olympics.[2] A total of 120 quota spots (60 per sex) were distributed to the sport of curling, an increase of four from the 2018 Winter Olympics.[3] A total of 3 events were contested, one for men, one for women, and one mixed.[4]
Qualification
Main article: Curling at the 2022 Winter Olympics – Qualification
Qualification to the Men's and Women's curling tournaments at the Winter Olympics was determined through two methods (in addition to the host nation). Nations qualified teams by placing in the top six at the 2021 World Curling Championships. Teams could also qualify through Olympic qualification events which were held in 2021. Six nations qualified via World Championship qualification placement, while three nations qualified through qualification events. In men's and women's play, a host will be selected for the Olympic Qualification Event (OQE). They would be joined by the teams which competed at the 2021 World Championships but did not qualify for the Olympics, and two qualifiers from the Pre-Olympic Qualification Event (Pre-OQE). The Pre-OQE was open to all member associations.[5]
For the mixed doubles competition in 2022, the tournament field was expanded from eight competitor nations to ten.[2] The top seven ranked teams at the 2021 World Mixed Doubles Curling Championship qualified, along with two teams from the Olympic Qualification Event (OQE) – Mixed Doubles. This OQE was open to a nominated host and the fifteen nations with the highest qualification points not already qualified to the Olympics. As the host nation, China qualified teams automatically, thus making a total of ten teams per event in the curling tournaments.[6]
Summary
Nations Men Women Mixed doubles Athletes
Australia Yes 2
Canada Yes Yes Yes 12
China Yes Yes Yes 12
Czech Republic Yes 2
Denmark Yes Yes 10
Great Britain Yes Yes Yes 10
Italy Yes Yes 6
Japan Yes 5
Norway Yes Yes 6
ROC Yes Yes 10
South Korea Yes 5
Sweden Yes Yes Yes 11
Switzerland Yes Yes Yes 12
United States Yes Yes Yes 11
Total: 14 NOCs 10 10 10 114
Competition schedule
The Beijing National Aquatics Centre served as the venue of the curling competitions.
Curling competitions started two days before the Opening Ceremony and finished on the last day of the games, meaning the sport was the only one to have had a competition every day of the games. The following was the competition schedule for the curling competitions:
RR Round robin SF Semifinals B 3rd place play-off F Final
Date
Event
Wed 2 Thu 3 Fri 4 Sat 5 Sun 6 Mon 7 Tue 8 Wed 9 Thu 10 Fri 11 Sat 12 Sun 13 Mon 14 Tue 15 Wed 16 Thu 17 Fri 18 Sat 19 Sun 20
Men's tournament RR RR RR RR RR RR RR RR RR SF B F
Women's tournament RR RR RR RR RR RR RR RR SF B F
Mixed doubles RR RR RR RR RR RR SF B F
Medal summary
Medal table
Rank Nation Gold Silver Bronze Total
1 Great Britain 1 1 0 2
2 Sweden 1 0 2 3
3 Italy 1 0 0 1
4 Japan 0 1 0 1
Norway 0 1 0 1
6 Canada 0 0 1 1
Totals (6 entries) 3 3 3 9
Medalists
Event Gold Silver Bronze
Men
details Sweden
Niklas Edin
Oskar Eriksson
Rasmus Wranå
Christoffer Sundgren
Daniel Magnusson Great Britain
Bruce Mouat
Grant Hardie
Bobby Lammie
Hammy McMillan Jr.
Ross Whyte Canada
Brad Gushue
Mark Nichols
Brett Gallant
Geoff Walker
Marc Kennedy
Women
details Great Britain
Eve Muirhead
Vicky Wright
Jennifer Dodds
Hailey Duff
Mili Smith Japan
Satsuki Fujisawa
Chinami Yoshida
Yumi Suzuki
Yurika Yoshida
Kotomi Ishizaki Sweden
Anna Hasselborg
Sara McManus
Agnes Knochenhauer
Sofia Mabergs
Johanna Heldin
Mixed doubles
details Italy
Stefania Constantini
Amos Mosaner Norway
Kristin Skaslien
Magnus Nedregotten Sweden
Almida de Val
Oskar Eriksson
Teams
Men
Canada China Denmark Great Britain Italy
Skip: Brad Gushue
Third: Mark Nichols
Second: Brett Gallant
Lead: Geoff Walker
Alternate: Marc Kennedy
Skip: Ma Xiuyue
Third: Zou Qiang
Second: Wang Zhiyu
Lead: Xu Jingtao
Alternate: Jiang Dongxu
Skip: Mikkel Krause
Third: Mads Nørgård
Second: Henrik Holtermann
Lead: Kasper Wiksten
Alternate: Tobias Thune
Skip: Bruce Mouat
Third: Grant Hardie
Second: Bobby Lammie
Lead: Hammy McMillan Jr.
Alternate: Ross Whyte
Skip: Joël Retornaz
Third: Amos Mosaner
Second: Sebastiano Arman
Lead: Simone Gonin
Alternate: Mattia Giovanella
Norway ROC Sweden Switzerland United States
Skip: Steffen Walstad
Third: Torger Nergård
Second: Markus Høiberg
Lead: Magnus Vågberg
Alternate: Magnus Nedregotten
Skip: Sergey Glukhov
Third: Evgeny Klimov
Second: Dmitry Mironov
Lead: Anton Kalalb
Alternate: Daniil Goriachev
Skip: Niklas Edin
Third: Oskar Eriksson
Second: Rasmus Wranå
Lead: Christoffer Sundgren
Alternate: Daniel Magnusson
Fourth: Benoît Schwarz
Third: Sven Michel
Skip: Peter de Cruz
Lead: Valentin Tanner
Alternate: Pablo Lachat
Skip: John Shuster
Third: Chris Plys
Second: Matt Hamilton
Lead: John Landsteiner
Alternate: Colin Hufman
Women
Canada China Denmark Great Britain Japan
Skip: Jennifer Jones
Third: Kaitlyn Lawes
Second: Jocelyn Peterman
Lead: Dawn McEwen
Alternate: Lisa Weagle
Skip: Han Yu
Third: Wang Rui
Second: Dong Ziqi
Lead: Zhang Lijun
Alternate: Jiang Xindi
Skip: Madeleine Dupont
Third: Mathilde Halse
Second: Denise Dupont
Lead: My Larsen
Alternate: Jasmin Lander
Skip: Eve Muirhead
Third: Vicky Wright
Second: Jennifer Dodds
Lead: Hailey Duff
Alternate: Mili Smith
Skip: Satsuki Fujisawa
Third: Chinami Yoshida
Second: Yumi Suzuki
Lead: Yurika Yoshida
Alternate: Kotomi Ishizaki
ROC South Korea Sweden Switzerland United States
Skip: Alina Kovaleva
Third: Yulia Portunova
Second: Galina Arsenkina
Lead: Ekaterina Kuzmina
Alternate: Maria Komarova
Skip: Kim Eun-jung
Third: Kim Kyeong-ae
Second: Kim Cho-hi
Lead: Kim Seon-yeong
Alternate: Kim Yeong-mi
Skip: Anna Hasselborg
Third: Sara McManus
Second: Agnes Knochenhauer
Lead: Sofia Mabergs
Alternate: Johanna Heldin
Fourth: Alina Pätz
Skip: Silvana Tirinzoni
Second: Esther Neuenschwander
Lead: Melanie Barbezat
Alternate: Carole Howald
Skip: Tabitha Peterson
Third: Nina Roth
Second: Becca Hamilton
Lead: Tara Peterson
Alternate: Aileen Geving
Mixed doubles
Australia Canada China Czech Republic Great Britain
Female: Tahli Gill
Male: Dean Hewitt
Female: Rachel Homan
Male: John Morris
Female: Fan Suyuan
Male: Ling Zhi
Female: Zuzana Paulová
Male: Tomáš Paul
Female: Jennifer Dodds
Male: Bruce Mouat
Italy Norway Sweden Switzerland United States
Female: Stefania Constantini
Male: Amos Mosaner
Female: Kristin Skaslien
Male: Magnus Nedregotten
Female: Almida de Val
Male: Oskar Eriksson
Female: Jenny Perret
Male: Martin Rios
Female: Vicky Persinger
Male: Chris Plys
"""
query = f"""Use the below article on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found, write "I don't know."
Article:
\"\"\"
{wikipedia_article_on_curling}
\"\"\"
Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"""
response = openai.ChatCompletion.create(
messages=[
{'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
{'role': 'user', 'content': query},
],
model=GPT_MODEL,
temperature=0,
)
print(response['choices'][0]['message']['content'])
上記を実行してきますと、下記のようにベクトル化して読み込ませた情報に対して回答します。
下記以降は、一般的な情報をベクトル化させる流れとなりますので、情報のベクトル化の流れと関連度の表示に関して備忘録として残したいため記載しています。
検索データ準備
# download pre-chunked text and pre-computed embeddings
# this file is ~200 MB, so may take a minute depending on your connection speed
embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"
df = pd.read_csv(embeddings_path)
# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)
# the dataframe has two columns: "text" and "embedding"
df
下記の実行結果が表示されます。
検索
検索フェーズでは、関数を準備します。この関数は、与えられた質問に対して、関連度が高い順に並べる関数となります。
# search function
def strings_ranked_by_relatedness(
query: str,
df: pd.DataFrame,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int = 100
) -> tuple[list[str], list[float]]:
"""Returns a list of strings and relatednesses, sorted from most related to least."""
query_embedding_response = openai.Embedding.create(
model=EMBEDDING_MODEL,
input=query,
)
query_embedding = query_embedding_response["data"][0]["embedding"]
strings_and_relatednesses = [
(row["text"], relatedness_fn(query_embedding, row["embedding"]))
for i, row in df.iterrows()
]
strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
strings, relatednesses = zip(*strings_and_relatednesses)
return strings[:top_n], relatednesses[:top_n]
以下コードは、curling gold medalの関連度を高い順に5つ並べるコードとなります。
# examples
strings, relatednesses = strings_ranked_by_relatedness("curling gold medal", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
print(f"{relatedness=:.3f}")
display(string)
下記のように、relatedness=0.879、0.872と関連度が高い順に表示されます。
質問
ここでは、指定された質問に対して、GPTを利用して回答を生成する関数となります。
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
"""Return the number of tokens in a string."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def query_message(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int
) -> str:
"""Return a message for GPT, with relevant source texts pulled from a dataframe."""
strings, relatednesses = strings_ranked_by_relatedness(query, df)
introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
question = f"\n\nQuestion: {query}"
message = introduction
for string in strings:
next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
if (
num_tokens(message + next_article + question, model=model)
> token_budget
):
break
else:
message += next_article
return message + question
def ask(
query: str,
df: pd.DataFrame = df,
model: str = GPT_MODEL,
token_budget: int = 4096 - 500,
print_message: bool = False,
) -> str:
"""Answers a query using GPT and a dataframe of relevant texts and embeddings."""
message = query_message(query, df, model=model, token_budget=token_budget)
if print_message:
print(message)
messages = [
{"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
{"role": "user", "content": message},
]
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0
)
response_message = response["choices"][0]["message"]["content"]
return response_message
次の質問をしてみます。
ask('Which athletes won the gold medal in curling at the 2022 Winter Olympics?')