LlamaIndex CohereAI Embeddings

2024年5月3日 22:22

LlamaIndexのCohereAI Embeddings

先日Cohere & LlamaIndexを使い、外部データを参照するコードを試していました。

調べている際にLlamaIndexのCohereAI Embeddingsも試したので、そのメモです。

上記サイトのコードを参照し、わからないところは Cohere Coral や ChatGPT（いずれも無料版）に聞きながら進めました。
特に Cohere Coral の回答は下記の文章によく使っています。

埋め込み (embeddings)とは

埋め込み (embeddings) は、自然言語処理 (NLP) および機械学習の分野で広く使用されるテクニックであり、テキスト、画像、オーディオ、またはその他のデータを高次元空間内の数値表現に変換することを関します。

埋め込みは、データの表現方法であり、多くの場合、ベクトルまたはテンソルとして表されます。これらの埋め込み表現は、元のデータの重要な特徴とパターンをキャプチャし、アルゴリズムがそれらのデータをより効率的かつ効果的に処理できるようにします。

埋め込みを使用する主な利点の 1 つは、次元削減です。埋め込み表現は、元のデータよりもはるかに低い次元数でデータを表現できます。これにより、計算コストが削減され、多くの場合、アルゴリズムの効率が向上します。

埋め込みは、類似性検索、クラスタリング、分類、およびその他の NLP タスクで一般的に使用されます。たとえば、単語埋め込みを使用すると、単語間の意味的関係をキャプチャし、単語間の類似性を計算できるようになります。同様に、画像埋め込みは、画像のコンテンツと視覚的特徴をキャプチャし、画像分類や画像検索タスクに役立ちます。

埋め込みは、データをより効率的に処理および分析できるようにする強力なツールであり、多くの NLP および機械学習アプリケーションの基礎となっています。

Cohere Coral の回答

コードを動かす準備として、パッケージのインストールとCohereのAPIキーのセットを実行します。

# パッケージのインストール
!pip install llama-index-llms-cohere
!pip install llama-index-embeddings-cohere
!pip install llama-index
!pip install llama-index-postprocessor-cohere-rerank

# Google Colabのシークレットに登録したCohereのトライアルAPIキーを呼び出し
from google.colab import userdata
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

埋め込み（Embedding）タイプ

Cohere Embed は、embedding_type として、float（浮動小数）、int8（8ビット整数）、binary（バイナリー）、ubinary（これはわかりませんでした）をサポートします。
精度、メモリ効率、または使用目的の要件に応じて選択します。
使用したのは float と int8 です。

embedding_type ="float"

デフォルトの embedding_type は float です。

# Cohere Embeddingsを使用してテキストの埋め込み（Embedding)
from llama_index.embeddings.cohere import CohereEmbedding

# input_typ='search_query'
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    # 埋め込みモデルに送信されるテキストが検索クエリであることを示す
    input_type="search_query",
)

# テキスト "Hello CohereAI!" の埋め込み表現を取得。埋め込みベクトルを返します
embeddings = embed_model.get_text_embedding("Hello CohereAI!")

# len() 関数を使用して 埋め込みベクトルの次元数を確認
print(len(embeddings))
# 埋め込みベクトルの最初の 5 つの要素
print(embeddings[:5])

1024
[-0.03074646, -0.0029201508, -0.058044434, -0.015457153, -0.02331543]

embedding_type="int8"

embedding_type が int8 の場合です。

# int8 embedding_typeで確認
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    input_type="search_query",
    embedding_type="int8",
)

embeddings = embed_model.get_text_embedding("Hello CohereAI!")

print(len(embeddings))
print(embeddings[:5])

1024
[-53, -29, -90, -15, -25]

model_name="embed-english-v3.0" で指定している embed-english-v3.0 は、Cohere Embeddings のモデルです。
このモデルは、英語テキストの埋め込み表現を生成するようにトレーニングされています。
マルチリンガル用のモデルとしては、embed-multilingual-v3.0 などがあります。

Embeddings への送信データタイプ

input_type には search_documet と search_query が指定できます。
まず検索対象のデータをインデックスとして指定するために search_documet を使用。
次にそのデータに対して投げるデータ（クエリー）を定義するために search_query を使い、回答を得ます。

input_type="search_document"

ベクターデータベース（インデックス）に保存するテキストデータ（ドキュメント）に適用します。
Cohere Embeddings （埋め込み）の処理のために送信されるテキストが、検索対象のドキュメントまたはコンテンツであることを示します。
次に述べる input_type="search_query" で指定したクエリー（問い合わせ）を、ここで指定したドキュメントまたはコンテンツに投げます。

input_type="search_query"

ベクターデータベース内の最も関連性の高いドキュメントを検索するためのテキスト（検索クエリー）に使用します。
Cohere Embeddings に送信されるテキストが検索クエリーであることを示します。
埋め込みモデルがクエリーの意図を理解し、ベクターデータベースから関連する結果を返します。

インデックス & クエリーのためのパッケージインポート

int8 の embedding_type で試してみます。

# パッケージのインポート
# Llama Index ライブラリのユーティリティ関数をインポート
# lama Index ライブラリと Cohere クライアントを初期化
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cohere import Cohere
from llama_index.core.response.notebook_utils import display_source_node

# 参照サイトに載っているコードにRerankを使った回答する機能を追加するために追記
from llama_index.postprocessor.cohere_rerank import CohereRerank

データをダウンロード

CohereAI Embeddingsで使われているサンプルデータをダウンロードします。

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

# SimpleDirectoryReader クラスを使用して ./data/paul_graham/ ディレクトリ内のファイルを読み込み、ドキュメントのリストを取得
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

With int8 embedding_type

この設定では、埋め込み表現は 8 ビット整数（int8）として表されます。
各次元の値は 8 ビット整数であり、0 から 255 の範囲の整数値を取ります。
8 ビット整数の埋め込み表現は、メモリ効率が高く、ストレージ要件が低くなります。

Build index with input_type = 'search_document'

検索クエリを実行するために使用するインデックス（Index）を作成します。
インデックスにはダウンロードしたドキュメントとサービスコンテキストを収納します。
サービスコンテキストには LLM と埋め込み（Embedding）モデルを指定しました。
サービスコンテキストを使用すると、複数のサービスやモデルを 1 つの機能として呼び出せます。

# Build index with input_type = 'search_document'
# Cohere 言語モデルを初期化
llm = Cohere(model="command-nightly", api_key=COHERE_API_KEY)
# CohereEmbedding クラスを使用して埋め込みモデルを初期化
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    input_type="search_document",
    embedding_type="int8",
)

# 実行する処理を格納するサービスコンテキストを用意
# 質問に対する要約回答を得るには、LLMと高次元の埋め込みベクトル使用
from llama_index.core import ServiceContext
service_context = ServiceContext.from_defaults(
    llm=Cohere(api_key=COHERE_API_KEY, model="command"),
    embed_model=embed_model
    )
# indexにドキュメントとサービスコンテキストを設定
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context

Build retriever with input_type = 'search_query'

上記で用意した インデックス に投げるクエリーを Build retriever with input_type = 'search_query' で設定します。

# Build retriever with input_type = 'search_query'
# 埋め込みモデルで検索クエリとドキュメント間の類似性を計算
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    input_type="search_query",
    embedding_type="int8",
)

# as_retriever() メソッドは、インデックスを検索クエリ取得器に変換
search_query_retriever = index.as_retriever()

# 検索クエリ取得器を使用して検索クエリを実行
search_query_retrieved_nodes = search_query_retriever.retrieve(
    "What happened in the summer of 1995?"
)

検索クエリに基づいて関連する文章をそのまま出力します。

# source_length=2000 パラメーターは、表示されるソースコードの文字数を指定
# ノードを識別し、参照するために使用できる一意の ID、ノードと検索クエリ間の類似度スコアを出力
for n in search_query_retrieved_nodes:
    display_source_node(n, source_length=2000)

出力結果。

Node ID: dfafd3a8-c57b-48b2-9575-83cb7865a8a3
Similarity: 0.30553992833446125
Text: Then I'll be able to work on whatever I want.

Meanwhile I'd been hearing more and more about this new thing called the World Wide Web. Robert Morris showed it to me when I visited him in Cambridge, where he was now in grad school at Harvard. It seemed to me that the web would be a big deal. I'd seen what graphical user interfaces had done for the popularity of microcomputers. It seemed like the web would do the same for the internet.
・・・

Node ID: 3e379be1-d389-4b87-8d5b-935008a9f1f2
Similarity: 0.2973328063324062
Text: But it didn't feel very valuable to me; I had no idea how to value a business, but I was all too keenly aware of the near-death experiences we seemed to have every few months. Nor had I changed my grad student lifestyle significantly since we started. So when Yahoo bought us it felt like going from rags to riches. Since we were going to California, I bought a car, a yellow 1998 VW GTI. I remember thinking that its leather seats alone were by far the most luxurious thing I owned.
・・・

With float embedding_type

上記処理の埋め込み表現を浮動小数点数 (float) として設定します。
各次元の値は小数点以下の精度を持つことができます。
浮動小数点数の埋め込み表現は、より高い精度を提供し、より滑らかな連続空間を可能にします。

Build index with input_type = 'search_document'

llm = Cohere(model="command-nightly", api_key=cohere_api_key)
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_document",
    embedding_type="float",
)

index = VectorStoreIndex.from_documents(
    documents=documents, embed_model=embed_model
)

Build retriever with input_type = 'search_query'

embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_query",
    embedding_type="float",
)

search_query_retriever = index.as_retriever()

search_query_retrieved_nodes = search_query_retriever.retrieve(
    "What happened in the summer of 1995?"
)

for n in search_query_retrieved_nodes:
    display_source_node(n, source_length=2000)

Node ID: 306088b5-6073-4730-9063-31a3168e2fed
Similarity: 0.304231493396034
Text: Then I'll be able to work on whatever I want.

Meanwhile I'd been hearing more and more about this new thing called the World Wide Web. Robert Morris showed it to me when I visited him in Cambridge, where he was now in grad school at Harvard. It seemed to me that the web would be a big deal. I'd seen what graphical user interfaces had done for the popularity of microcomputers. It seemed like the web would do the same for the internet.
・・・

Node ID: dc52b068-5a96-4cd4-8528-8d9cbb62606f
Similarity: 0.29574399929426565
Text: But it didn't feel very valuable to me; I had no idea how to value a business, but I was all too keenly aware of the near-death experiences we seemed to have every few months. Nor had I changed my grad student lifestyle significantly since we started. So when Yahoo bought us it felt like going from rags to riches. Since we were going to California, I bought a car, a yellow 1998 VW GTI. I remember thinking that its leather seats alone were by far the most luxurious thing I owned.
・・・

回答を生成

上記ではクエリーに関連した文章をそのまま取り出しました。
今度は回答を生成します。

# CohereRerank クラスのインスタンスを作成。検索結果を再ランク付け
cohere_rerank = CohereRerank(api_key=COHERE_API_KEY)

# Create the query engine
# クエリ結果の再ランク付けを有効にしてクエリエンジンを作成
query_engine = index.as_query_engine(node_postprocessors=[cohere_rerank])

# Generate the response
# クエリ "・・・?" の応答を取得
response = query_engine.query("What happened in the summer of 1995?")

生成結果です。

print(response)

In the summer of 1995, the narrator began trying to write software to build online stores. At first, this was intended to be normal desktop software, but they quickly realized that running the software on the server and letting users control it by clicking on links would be more efficient. They decided to try making a version of their store builder that could be controlled through a browser. On August 12th, they had a version that worked. Despite a horrible UI, it proved that a whole store could be built through a browser without the need for any client software or typing anything into the command line on the server.

データをアップロード

別の投稿で使った、AIの歴史に関するテキストを使います。
要約した回答を得るために、上述と同様の手順で進めます。

1) パッケージのインストール

# パッケージのインストール
!pip install llama-index-llms-cohere
!pip install llama-index-embeddings-cohere
!pip install llama-index
!pip install --upgrade llama_index
!pip install llama-index-postprocessor-cohere-rerank

2) CohereのAPIキーで認証

# Google Colabのシークレットに登録したCohereのトライアルAPIキーを呼び出し
from google.colab import userdata
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

3) ライブラリのインポート

from llama_index.embeddings.cohere import CohereEmbedding

# Llama Index ライブラリのユーティリティ関数をインポートします
#このコードは、Llama Index ライブラリと Cohere クライアントを初期化
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cohere import Cohere
? from llama_index.core.response.notebook_utils import display_source_node

# 参照サイトに載っているコードにRerankを使った回答する機能を追加するために追記
from llama_index.postprocessor.cohere_rerank import CohereRerank

4) データのアップロード

# 手元のデータアップロードするフォルダ
# このフォルダにテキストデータをアップロードし
!mkdir -p 'data/mydata/'

5) ドキュメントリスト取得

# ドキュメントのリストを取得
documents = SimpleDirectoryReader("./data/mydata/").load_data()

6) サービスコンテキストとインデックスを用意

# Build index with input_type = 'search_document' & embedding_type="float"
llm = Cohere(model="command-nightly", api_key=COHERE_API_KEY)
embed_model = CohereEmbedding(
    cohere_api_key=COHERE_API_KEY,
    model_name="embed-english-v3.0",
    input_type="search_document",
    embedding_type="float",
)

## LLMを含むサービスコンテキストを用意
from llama_index.core import ServiceContext
service_context = ServiceContext.from_defaults(
    llm=Cohere(api_key=COHERE_API_KEY, model="command"),
    embed_model=embed_model
    )

# indexにサービスコンテキストを含ませる
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
    )

6) Relankとクエリー

## 要約した回答をもらう
## エンベディングとLLM含むサービスコンテクストを使用
# CohereRerank クラスのインスタンスを作成。検索結果を再ランク付け
cohere_rerank = CohereRerank(api_key=COHERE_API_KEY)

# Create the query engine
query_engine = index.as_query_engine(node_postprocessors=[cohere_rerank])

# Generate the response
# クエリ "・・・?" の応答を取得
response = query_engine.query("What are the names of the LLMs provided by Meta, Microsoft, and Google that represent the rise of open source LLMs ?")

7) 生成した回答を確認

print(response)

The LLMs provided by Meta, Microsoft, and Google that represent the rise of open source LLMs are:

1. Meta's 'Llama'
2. Microsoft's 'Phi'
3. Google's 'Gemma'

Open source LLMs (large language models) have revolutionized the field of generative AI, leading to new avenues for exploration and development. The appearance of these models has led to a shift in the landscape of AI, with increased accessibility and creative potential for users. The underlying factors that led to the rise of open-source LLMs include technological advancements in AI, the widespread adoption of cloud computing, and the growth of open-source communities. These models have had a significant impact on various fields and domains, and they have also paved the way for further advancements in AI research and development.

モデル変える

「6) サービスコンテキストとインデックスを用意」のコードで読み込むモデルを command から上位の command-r 、command-r-plus 変えてみました。

command-r が生成した回答

Meta provides "Llama", Microsoft offers "Phi", and Google introduces "Gemma" as their respective open-source LLMs. These models signify the breakthrough of open-source Large Language Models, paving the way for a new era in the realm of generative AI.

command-r-plus が生成した回答

Llama (Meta), Phi (Microsoft), and Gemma (Google) are the large language models that represent the rise of open-source LLMs.

この記事が気に入ったらサポートをしてみませんか？