OpenAI 勉強日記(#15) マルチモーダルRAGのLangChain公式サンプルコードを読み解く

mizutory

2024年6月23日 17:49

LangChainの公式サンプルにはマルチモーダルRAGの良いサンプルがあり、今回はこれを読み解いてみた。

コード

このサンプルコードを作者が解説したYoutube動画もあり、このサンプルの方法に至った経緯なども聞くことができる。

YouTube

インストール

このサンプルコードを動かす前に、あらかじめホストOSにインストールするものが２つあるので注意。

poppler

tesseract

私はJupyterをDockerコンテナで動かしているので、Dockerファイルに以下のように４つのモジュールをインストールする記述を足す必要があった。

# Install Poppler utilities
RUN apt-get update && apt-get install -y poppler-utils tesseract-ocr libtesseract-dev

その他には、サンプル内でpipで以下の２つのモジュールをインストールしている。（これはサンプル内にあるので、自分で何かを追加しなくてもOKですがLangChainでRAGを構築する上で頻出のライブラリなので、下にリンクをメモとして載せておきます）

サンプルを動かすためにやったこと

サンプルをそのまま動かしても動かない箇所があり、最初は原因に気づくのに時間がかかってしまった。

extract_pdf_elements関数内の、partition_pdf()を呼び出す箇所でextract_images_in_pdfがサンプルではFalseとなっている。
これだと画像を抜き出してくれないので、Trueとする必要があった。

# Extract elements from PDF
def extract_pdf_elements(path, fname):
    """
    Extract images, tables, and chunk text from a PDF file.
    path: File path, which is used to dump images (.jpg)
    fname: File name
    """
    full_path = os.path.join(path, fname)
    return partition_pdf(
        filename=full_path,
        extract_images_in_pdf=False,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=path,
    )

画像のサマリを作るgenerate_img_summaries関数に、PDFのファイルがあるディレクトリのトップを指定しているのだが、

# Image summaries
img_base64_list, image_summaries = generate_img_summaries(fpath)

実際には、extract_pdf_elementsは、"figures"というディレクトリの下に抽出した画像ファイルを配置するので、下のように'figures'を渡すと、うまく動くようになった。

# Image summaries
img_dir = 'figures'
img_base64_list, image_summaries = generate_img_summaries(img_dir)

このサンプルのコアアイデア

このサンプルの最も大事なところがcreate_multi_vector_retrieverだ。


def create_multi_vector_retriever(
    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images
):
    """
    Create retriever that indexes summaries, but returns raw images or texts
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts)
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables)
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images)

ここで、MutliVectorRetrieverのvectorstoreにサマリーを、docstoreにオリジナルストアを登録する。サマリーとオリジナルのセットは、テキスト、テーブル、画像、である。
ここを理解すれば、このサンプルコードの90%は理解したようなものと行っていいほど重要なところである。ここを読んだ時「なるほど！」と思ったのである。

MultiVectorRetrieverの挙動はよりシンプルな公式サンプルがあるので、それを解説した以下のブログを参考にしてもらえたらいいと思う。最初に上のコードを読んで意味がわからなかった人は、下のシンプルな公式サンプルを一読すると、上のコードも読めるようになるのではないかと思う。

結果の理解を深めるために

このサンプルの最後ではchainを作ってそれに対してinvoke関数を呼び出しているが、chainではなく、retrieverに対してinvokeを行ってその結果をまずみると、コアの部分の動作がわかって良かった。
このサンプルの作者も以下のようなCheck retrievalというコードセクションを設けてくれている。

# Check retrieval
query = "Give me company names that are interesting investments based on EV / NTM and NTM rev growth. Consider EV / NTM multiples vs historical?"
docs = retriever_multi_vector_img.invoke(query, limit=6)

# We get 4 docs
len(docs)

ここではdocsが何個抽出されたかを表示しているだけだが、

同サンプル内にあるsplit_image_text_types関数を拝借して、docsの中身を表示させると、retrieverがどのようなものをdocsの中に取り出したのかがよく分かる。

split_image_text_types(docs)

下のように、docsが画像であれば画像の内容、テキストであればテキストの内容が表示され理解が深まる。

このサンプルはファイナンスの資料のPDFを使っているので、質問と回答の整合性がどれほどあっているのか正直私には全くわからなかった💦
もう少しシンプルな資料のPDFで改めて試してみたいと思う。

続く。

この記事の内容が気に入っていただけましたら、いいね、あるいはフォローをいただけると大変励みになります！

このブログに関する質問や、弊社（Goldrush Computing）へのOpenAI API、LLM、LangChain関連の開発案件の依頼は↓↓↓からお願いします。

mizutori@goldrushcomputing.com