Table TransformerとGPT-4Vを用いたPDF内の表の解析

2024年3月12日 11:36

こんにちは。QunaSysの小澤です。
先日弊社山口のRAG紹介の記事が公開されました。

RAGは非常に有用なツールですが、PDFの論文などを扱う際には、表データを正しく読み取れない場合があります。
表の構造を適切に処理することは難しく、いくつかの改善策が提案されています。

例えば、RAGを構築するのに使われるライブラリであるLlamaIndexのドキュメントに以下のような情報があります。

このドキュメントでは表を含むデータを扱う方法として、PDFを一旦すべて画像データに変換し、画像として表の形式を保持したままGPT-4Vでデータを解析することを提案しています。

ただ、PDF1ページ分の画像をそのままGPT-4Vに解析させても精度はあまり良くないようで、後述するTable Transformerを使って表部分の画像のみ抽出してから解析を行うことで、より良い結果が得られたのことでした。

本記事では、この方法を用いてPDF内の表の解析を試してみます。

手順としては
1. PDFの全ページを画像化
2. LlamaIndexのMultiModalVectorStoreIndexを使って、手順1で作成した画像をインデックス化
3. ユーザーの質問文に基づき、MultiModalVectorStoreIndexから類似度の高い画像を取得
4 手順3で取得した画像から、表部分の画像のみを抽出
5 表部分の画像と質問文をGPT-4Vに送り回答を生成
となります。

今回表の解析に使用する論文はこちらです。

この論文には大量の化学データでファインチューニングしたモデル(LlaSMol)が、特定の化学タスクで、GPT-4より性能が高かったと書かれています。

RAGを使って、この論文内でのGPT-4にLlaSMolの比較結果を生成してみましょう。

この論文自体も興味深い内容なので、機会がありましたら別途検証結果もブログに載せたいと思っています。

今回の実行はGoogle Colabで行います。
それでは始めていきましょう。

まず、必要なライブラリをインストールします。

%pip install llama-index qdrant_client pyMuPDF tools frontend git+https://github.com/openai/CLIP.git easyocr
%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-clip
%pip install llama-index-embeddings-azure-openai
%pip install llama-index-llms-azure-openai

続けてライブラリのimportをします。

import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Patch
import io
from PIL import Image, ImageDraw
import numpy as np
import csv
import pandas as pd

from torchvision import transforms

from transformers import AutoModelForObjectDetection
import torch
import openai
import os
import fitz

import qdrant_client
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.schema import ImageDocument

from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode

from llama_index.core.indices.multi_modal.retriever import (
    MultiModalVectorIndexRetriever,
)

論文のpdfをダウンロードしておきます。

!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2402.09391.pdf" -O "llasmol.pdf"

次に、論文をページごとに読み込んで画像に変換してフォルダに保存します。

pdf_file = "llasmol.pdf"

# Split the base name and extension
output_directory_path, _ = os.path.splitext(pdf_file)

if not os.path.exists(output_directory_path):
    os.makedirs(output_directory_path)

# Open the PDF file
pdf_document = fitz.open(pdf_file)

# Iterate through each page and convert to an image
for page_number in range(pdf_document.page_count):
    # Get the page
    page = pdf_document[page_number]

    # Convert the page to an image
    pix = page.get_pixmap()

    # Create a Pillow Image object from the pixmap
    image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

    # Save the image
    image.save(f"./{output_directory_path}/page_{page_number + 1}.png")

# Close the PDF file
pdf_document.close()

テキストを数値のベクトルに変換するために、embeddingを作成します。
今回は画像だけを使用するのでテキストのembeddingは不要ですが、設定しないとエラーになるので作っておきます。

from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="<DEPLOYMENT_NAME>",
    api_key="<API_KET>",
    azure_endpoint="<AZURE_ENDPOINT>",
    api_version="2024-02-15-preview",
)


Settings.embed_model = embed_model

画像のベクトル化にはOpenAIのCLIPが内部的に使用されます。

CLIPは、指定したテキストの意味的な類似度に基づいて、関連する画像を取得することができます。
「猫が含まれる画像」を取得するときなどに使われますが、画像に含まれるテキスト情報も認識できる能力を持っているようです。

試しに、CRSと書かれた画像を渡して、CRS,CSR,SRC,RSCのどれが類似度が高いかを解析させたところ、正しくCRSが一番スコアが高くなりました。

ベクトル化したデータはベクトルDBに格納します。
今回はRust製のベクトルデータベースのQdrantを使用します。

client = qdrant_client.QdrantClient(path="qdrant_index")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

画像とテキストをベクトル化して格納するMultiModalVectorStoreIndexを作成します。

SimpleDirectoryReaderはディレクトリにあるデータをまとめて取り出してくれるもので、テキストデータや画像データもまとめて読み込んでくれます。

MultiModalVectorStoreIndex.from_documentsで、テキストはテキストのembeddingを、画像は画像のembeddingを行い、ベクトルDBに格納されます。

documents_images = SimpleDirectoryReader("./llasmol/").load_data()

index = MultiModalVectorStoreIndex.from_documents(
    documents_images,
    storage_context=storage_context,
)

retriever_engine = index.as_retriever(image_similarity_top_k=2)

質問文を投げてみます。

query = "Compare LlaSMol with GPT-4."
retrieval_results = retriever_engine.text_to_image_retrieve(query)

確認のため、取得した画像を表示してみます。

def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 9:
                break

retrieved_images = []
for res_node in retrieval_results:
    if isinstance(res_node.node, ImageNode):
        retrieved_images.append(res_node.node.metadata["file_path"])
    else:
        display_source_node(res_node, source_length=200)

plot_images(retrieved_images)

GPT-4とLlaSMolのスコアの比較をしているページが取れているようです

続いて、画像から表の部分だけを抜き出します。
Microsoftが公開しているTable Transformerというモデルを使用します。
Table TransformerはPDFや画像から表部分だけを検出する深層学習モデルです

以下は、画像から表を抜き出すのに必要なコード群です。

class MaxResize(object):
    def __init__(self, max_size=800):
        self.max_size = max_size

    def __call__(self, image):
        width, height = image.size
        current_max_size = max(width, height)
        scale = self.max_size / current_max_size
        resized_image = image.resize(
            (int(round(scale * width)), int(round(scale * height)))
        )

        return resized_image


detection_transform = transforms.Compose(
    [
        MaxResize(800),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]
)

structure_transform = transforms.Compose(
    [
        MaxResize(1000),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]
)

# load table detection model
# processor = TableTransformerImageProcessor(max_size=800)
model = AutoModelForObjectDetection.from_pretrained(
    "microsoft/table-transformer-detection", revision="no_timm"
).to(device)

# load table structure recognition model
# structure_processor = TableTransformerImageProcessor(max_size=1000)
structure_model = AutoModelForObjectDetection.from_pretrained(
    "microsoft/table-transformer-structure-recognition-v1.1-all"
).to(device)


# for output bounding box post-processing
def box_cxcywh_to_xyxy(x):
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=1)


def rescale_bboxes(out_bbox, size):
    width, height = size
    boxes = box_cxcywh_to_xyxy(out_bbox)
    boxes = boxes * torch.tensor(
        [width, height, width, height], dtype=torch.float32
    )
    return boxes


def outputs_to_objects(outputs, img_size, id2label):
    m = outputs.logits.softmax(-1).max(-1)
    pred_labels = list(m.indices.detach().cpu().numpy())[0]
    pred_scores = list(m.values.detach().cpu().numpy())[0]
    pred_bboxes = outputs["pred_boxes"].detach().cpu()[0]
    pred_bboxes = [
        elem.tolist() for elem in rescale_bboxes(pred_bboxes, img_size)
    ]

    objects = []
    for label, score, bbox in zip(pred_labels, pred_scores, pred_bboxes):
        class_label = id2label[int(label)]
        if not class_label == "no object":
            objects.append(
                {
                    "label": class_label,
                    "score": float(score),
                    "bbox": [float(elem) for elem in bbox],
                }
            )

    return objects


def detect_and_crop_save_table(
    file_path, cropped_table_directory="./table_images/"
):
    image = Image.open(file_path)

    filename, _ = os.path.splitext(file_path.split("/")[-1])

    if not os.path.exists(cropped_table_directory):
        os.makedirs(cropped_table_directory)

    # prepare image for the model
    # pixel_values = processor(image, return_tensors="pt").pixel_values
    pixel_values = detection_transform(image).unsqueeze(0).to(device)

    # forward pass
    with torch.no_grad():
        outputs = model(pixel_values)

    # postprocess to get detected tables
    id2label = model.config.id2label
    id2label[len(model.config.id2label)] = "no object"
    detected_tables = outputs_to_objects(outputs, image.size, id2label)

    print(f"number of tables detected {len(detected_tables)}")

    for idx in range(len(detected_tables)):
        #   # crop detected table out of image
        cropped_table = image.crop(detected_tables[idx]["bbox"])
        cropped_table.save(f"./{cropped_table_directory}/{filename}_{idx}.png")

画像から表を抜き出して、一つのフォルダに書き出します。
ここではdata_folderフォルダに出力されます。

for file_path in retrieved_images:
    detect_and_crop_save_table(file_path)

出力された画像を見てみましょう。

こちらも正常に表だけ切り抜かれているようです。

ここまでかなり長くなってしまいました。
いよいよ最後の手順になります。
取得した表の画像と質問文をGPT-4Vに投げて、回答を生成します。

from llama_index.multi_modal_llms.azure_openai import AzureOpenAIMultiModal

image_documents = SimpleDirectoryReader("./table_images/").load_data()

azure_openai_mm_llm = AzureOpenAIMultiModal(
    engine="<engine-name>",
    api_key="<API_KEY>",
    azure_endpoint="<AZURE_ENDPOINT>",
    api_version="2024-02-15-preview",
    model="gpt-4-vision-preview",
    max_new_tokens=300,
)

response = azure_openai_mm_llm.complete(
    prompt="Compare LlaSMol with GPT-4 briefly.",
    image_documents=image_documents,
)

回答は下記のようになりました。

Llasmol and GPT-4 are both language models, 
but they have been trained and fine-tuned for different purposes. 

Llasmol appears to be specifically tuned for molecular science tasks, 
as indicated by its performance on various benchmarks related to 
molecular properties and drug discovery, 
such as ESOL, Lipophilicity, PP-BBBP, PP-ClinTox, PP-HIV, and PP-SIDER.

GPT-4, on the other hand, is a more general language model 
that can perform a wide range of tasks without specific fine-tuning. 

It shows its versatility by being able to handle one-shot and few-shot learning 
scenarios, where it is given a small number of examples to adapt to a task.

In the tables provided, Llasmol generally outperforms GPT-4 
in tasks related to molecular science, 
which suggests that its fine-tuning has made it more specialized for 
this domain. GPT-4, while not as specialized, 
still shows strong performance across different tasks, 
highlighting its adaptability as a general-purpose language model.

(日本語訳)
LlasmolとGPT-4はどちらも言語モデルですが、異なる目的のために学習され、微調整されています。
Llasmolは、ESOL、Lipophilicity、PP-BBBP、PP-ClinTox、PP-HIV、PP-SIDERなど、
分子特性と創薬に関連するさまざまなベンチマークでの性能からわかるように、
分子科学タスク向けに特別にチューニングされているようです。

一方、GPT-4はより一般的な言語モデルであり、特別な微調整を行わなくても幅広いタスクを実行することができます。
GPT-4は、タスクに適応するために少数の例を与えるワンショット学習シナリオや少数ショット学習シナリオを扱うことができ、
その多用途性を示しています。

提供された表では、Llasmolは分子科学に関連するタスクでGPT-4を一般的に上回っています。
これは、Llasmolのファインチューニングがこのドメインにより特化していることを示唆しています。
GPT-4は、それほど特化されてはいないものの、さまざまなタスクで強力な性能を示し、
汎用言語モデルとしての適応性を強調しています。

LlaSMolが分子化学に関連するタスクでGPT-4を上回っているのが認識できているようです。

続いて、表の中の数値を正しく認識できているか聞いてみます。
MC(Molecule Captioning)のMETEOR(機械翻訳等の品質を評価するための指標の一つ)の数値がどうだったか聞いてみましょう。

response = azure_openai_mm_llm.complete(
    prompt="Compare LlaSMol with GPT-4 with MC METEOR scores briefly.",
    image_documents=image_documents,
)

回答は下記になりました。

The tables provided show the performance of various language models,
including llasmol and GPT-4, on different tasks and datasets. 

The MC METEOR scores are specifically shown in the third table.

In the MC (Multiple Choice) task, llasmol models generally outperform GPT-4. 
The llasmol models have METEOR scores ranging from 0.394 to 0.452, 
while GPT-4 has a lower score of 0.193. 

This suggests that llasmol models are better at understanding 
and generating language in a way that aligns with human judgment, 
as measured by the METEOR metric.

The llasmol models show a significant improvement when instruction-tuned, 
with the llasmol (instruction-tuned) achieving a METEOR score of 0.452, 
which is the highest among all the models listed. 

This indicates that instruction tuning has a positive impact 
on the model's performance in multiple-choice tasks.

Overall, llasmol models, especially when instruction-tuned, 
demonstrate superior performance in the MC task compared to GPT-4, 
according to the METEOR scores presented.

(日本語訳)
提供された表は、llasmolとGPT-4を含む様々な言語モデルの、
異なるタスクとデータセットでの性能を示しています。MC METEORスコアは特に3番目の表に示されています。

MC（複数選択肢）タスクでは、llasmolモデルはGPT-4を概ね上回っています。
llasmolモデルのMETEORスコアは0.394から0.452で、GPT-4のスコアは0.193と低い。
これはllasmolモデルがMETEOR指標で測定される人間の判断に沿った形で言語を理解し
生成することに優れていることを示唆しています。

llasmolモデルはインストラクションチューニングされた場合に大きな改善を示し、
llasmol (インストラクションチューニング)はMETEORスコア0.452を達成し、
これはリストアップされたすべてのモデルの中で最高です。
これは、インストラクション・チューニングが多肢選択課題におけるモデルのパフォーマンスにプラスの影響を
与えることを示しています。

全体として、llasmolモデルは、特にインストラクション・チューニングされた場合、
GPT-4と比較してMCタスクで優れたパフォーマンスを示します。

数値も正しく取れているようです。
MC(Molecule Captioning)をMultiple Choiceと認識していますが、これは表には略語しかないので仕方のないところかと思います。

結論

PDFに含まれる表をRAGで扱う手法を見てきました。
結果はそれなりに良好でしたが、何度か同じ質問を試すと、正しい回答が得られないこともありました。
このあたりはGPT-4Vの精度によるものが大きそうです。

また、今回は画像だけでRAGを構築したので、PDFのテキストも組み合わせればさらなる改善が見込めるかもしれません。

一方で、ChatGPTやClaude等の画像認識ができるモデルの性能があがれば、表の抽出などの前処理を行わなくても、PDFのページ画像全体から表の情報が正確に解析できるようになる可能性があります。

このあたりの技術の進化にあわせた最適なRAGの構成は、適宜検討の必要があるかと思いました。

CRSチームでは今後も化学の研究活動のサポートに向けて調査・開発を進めて行きます。もしお困りのことがありましたらお気軽に下記までお問い合わせください！