LangChainを用いて大量ファイルをロードするVectorDBを作ってみた(7)

2024年5月25日 22:42

はじめに

前回、３つのVectorDB（chroma、Qdrant、FAISS）を用いて、生成AIに質問を投げてみたのですが、芳しくない結果となってしまいました。

https://qiita.com/ogi_kimura/items/dacebc6d548af229d257

そこで今回は、ファイルの情報をそのままVectorDBに登録するのではなく、ある程度選別してVectorDBへ格納したらどなるんだろうということで、検証していきたいと思います。

XMLファイルの書式について

今回もインプットデータのサンプルとして特許庁のファイルを採用します。画像ファイルやCSVファイルなどもあるのですが、過去の記事同様に請求文章が含まれているXML形式のファイルだけを対象にしてVectorDBを作っていきます。
XML形式ファイルの中の必要な部位だけを抽出してVectorDBに格納するため、特許庁のXML形式ファイルの仕様を理解する必要があります。

それではXML形式ファイルを見ていくことにします。

名前空間について

今回採用するXML形式ファイルのサンプルを以下に示します。

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../../../../../XSL/JPRegisteredPatentPublication.xsl"?>
<jppat:RegisteredPatentPublication xmlns:jpcom="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPCommon"
                                   xmlns:jppat="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPPatent"
                                   xmlns:com="http://www.wipo.int/standards/XMLSchema/ST96/Common"
                                   xmlns:pat="http://www.wipo.int/standards/XMLSchema/ST96/Patent"
                                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                   xsi:schemaLocation="http://www.jpo.go.jp/standards/XMLSchema/ST96/JPPatent ../../../../../XSD/JPRegisteredPatentPublication_V1_0.xsd"
                                   com:languageCode="ja"
                                   com:st96Version="V3_1"
                                   com:ipoVersion="JP_V1_0">
   <com:IPOfficeCode>JP</com:IPOfficeCode>
   <jppat:RegisteredPatentPublicationBibliographicData com:languageCode="ja">
      <com:IPOfficeCode>JP</com:IPOfficeCode>
      <jppat:PatentPublicationIdentification>
         <com:IPOfficeCode>JP</com:IPOfficeCode>
         <pat:PublicationNumber>7354391</pat:PublicationNumber>
         <com:PublicationDate>2023-10-02</com:PublicationDate>
      </jppat:PatentPublicationIdentification>
      <pat:PlainLanguageDesignationText>特許公報(B2)</pat:PlainLanguageDesignationText>
      <com:RegistrationDate>2023-09-22</com:RegistrationDate>
      <jppat:ApplicationIdentification>
         <com:ApplicationNumber>
            <com:ApplicationNumberText>2022166159</com:ApplicationNumberText>
         </com:ApplicationNumber>
         <pat:FilingDate>2022-10-17</pat:FilingDate>
      </jppat:ApplicationIdentification>
      <pat:InventionTitle>半導体装置</pat:InventionTitle>
      <jppat:RegisteredPatentPublicationPartyBag>
         <jppat:ApplicantsRegisteredPractitionersBag>
            <jppat:ApplicantRegisteredPractitionerBag com:sequenceNumber="1">
               <jppat:Applicant com:sequenceNumber="1">
                  <com:PartyIdentifier>000153878</com:PartyIdentifier>
                  <jpcom:Contact>
                     <com:Name>
                        <com:EntityName>株式会社半導体エネルギー研究所</com:EntityName>
・・・・・

XMLファイルの4行目に`com:IPOfficeCode`という部分があります。
左側の`com`は「名前空間」で、上部`jppat:RegisteredPatentPublication`に`xmlns:com="http://www.wipo.int/standards/XMLSchema/ST96/Common"`という内容があることから、ソースコード内では`com`を`http://www.wipo.int/standards/XMLSchema/ST96/Common`　で置き換えることになります。

主要なタグ選定

タグにはいろいろなものがありますが、以下のような仕様だろうと推測しました。

com:EntityName：エントリーをした人・組織の名前
com:P：文章
pat:PublicationNumber：出版番号
pat:PublicationDate：出版日付
pat:RegistrationDate：登録日付
pat:RegistrationDate：請求文章
・・・

これらを主要なタグとして、該当するタグ情報をVectorDBへ格納するようにします。
エレメントから取得したtag情報が主要タグに該当するか否かを走査するため、リストを作成しました。ソースコードの中にあるのは、ちょっとブサイクですが。

# 取り出したい名前空間-タグ名
name_spaces_tag_names = [
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationNumber",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}RegistrationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}ApplicationNumberText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PartyIdentifier",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PostalAddressText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PersonFullName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}P",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}FigureReference",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PlainLanguageDesignationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FilingDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}MainClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FurtherClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PatentClassificationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}SearchFieldText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}ClaimText",
]

XMLのエレメント取得と入れ子構造

次に、XMLパースを実施した結果をどうやってVectorDBに格納するかを考えることにしました。
XMLはタグがネストをしており、親子関係が存在しますが、これを単純にデータベースのテーブルで表現することは難しく、結果的にネスト構造をフラットな配置にすることとしました。


def set_element(level, trees, el):
    trees.append({"tag" : el.tag, "attrib" : el.attrib, "content_page" :el.text})

def set_child(level, trees, el):
    set_element(level, trees, el)
    for child in el:
        set_child(level+1, trees, child)

def parse_and_get_element(input_file):
    tmp_elements = []
    new_elements = []
    tree = ET.parse(input_file)
    root = tree.getroot()
    set_child(1, tmp_elements, root)
    for name_space_tag_name in name_spaces_tag_names:
        for tmp_element in tmp_elements:
            if tmp_element["tag"] == name_space_tag_name:
                new_elements.append(tmp_element)
    return new_elements

上記プログラムは、`set_child`を再帰的に呼び出し、樹形構造のエレメントを`tmp_elements`に`append`しています。最終的に`name_spaces_tag_names`リストと合致する`tag`のみを`new_elements`に`append`して、リストを返しています（`return new_elements`）。

メタデータをVectorDBに格納

今回は３つのVectorDBの中から「Chroma」を採用しました。

https://qiita.com/ogi_kimura/items/d1d263ece0e23c7d7576

上記の記事でも３つのVectorDBのテーブル構造について少し比較をしているのですが、テーブルの内容を解析しやすいのがChromaであるためです。

VectorDBを作成するためのプログラムソースコードを記載します。

import glob
import os
import xml.etree.ElementTree as ET
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

load_dotenv()

docs = []

# 取り出したい名前空間-タグ名
name_spaces_tag_names = [
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationNumber",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PublicationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}RegistrationDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}ApplicationNumberText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PartyIdentifier",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PostalAddressText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PersonFullName",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}P",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Common}FigureReference",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PlainLanguageDesignationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FilingDate",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}MainClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}FurtherClassification",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}PatentClassificationText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}SearchFieldText",
    "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}ClaimText",
]

def set_element(level, trees, el):
    trees.append({"tag" : el.tag, "attrib" : el.attrib, "content_page" :el.text})

def set_child(level, trees, el):
    set_element(level, trees, el)
    for child in el:
        set_child(level+1, trees, child)


def parse_and_get_element(input_file):
    tmp_elements = []
    new_elements = []
    tree = ET.parse(input_file)
    root = tree.getroot()
    set_child(1, tmp_elements, root)
    for name_space_tag_name in name_spaces_tag_names:
        for tmp_element in tmp_elements:
            if tmp_element["tag"] == name_space_tag_name:
                new_elements.append(tmp_element)
    return new_elements

title = ""
entryName = ""
patentCitationText = ""

files = glob.glob(os.path.join("C:\\Users\\ogiki\\JPB_2023185", "**/*.*"), recursive=True)
for file in files:
    base, ext = os.path.splitext(file)
    if ext == '.xml':
        # --- topic名称 ---
        topic_name = os.path.splitext(os.path.basename(file))[0]
        # --- file名称 ---
        print(file)

        text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
        new_elements = parse_and_get_element(file)
        for new_element in new_elements:
            text = new_element["content_page"]
            tag = new_element["tag"]
            title = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle" else ""
            entryName = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName" else ""
            patentCitationText = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText" else ""

            documents = text_splitter.create_documents(texts=[text], metadatas=[{
                "name": topic_name, 
                "source": file, 
                "tag": tag, 
                "title": title,
                "entry_name": entryName, 
                "patent_citation_text" : patentCitationText}]
            )
            docs.extend(documents)


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma(persist_directory="C:\\Users\\ogiki\\vectorDB\\local_chroma", embedding_function=embeddings)

# トークン数制限のため、500 documentずつ処理をする
intv = 500
ln = len(docs)
max_loop = int(ln / intv) + 1
for i in range(max_loop):
    splitted_documents = text_splitter.split_documents(docs[intv * i : intv * (i+1)])
    db.add_documents(splitted_documents)

↓「一部抜粋」の部分についてですが、
・`tag`==`"{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle"`の時　→　metadataの`title`にセット
・`tag`==`{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName`の時　、→　metadataの`entryName`にセット
・`tag`==`{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText`の時　、→　metadataの`patentCitationText`にセット
を実施しています。
生成AIへ質問する際に、metadataの属性(`title`や`entryName`など)を予め絞り込むことで、より精度の高い回答結果がアウトプットされると考え、metadataの属性へ追加することにしました。
ちなみに、以前のEmbeddingモデルとしてtext-embedding-ada-002を利用していましたが、めちゃくちゃ高額だったためtext-embedding-3-smallに変更しています。
ちなみに、以前のEmbeddingモデルとしてtext-embedding-ada-002を利用していましたが、めちゃくちゃ高額だったためtext-embedding-3-smallに変更しています。

       for new_element in new_elements:
            text = new_element["content_page"]
            tag = new_element["tag"]
            title = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Patent}InventionTitle" else ""
            entryName = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}EntityName" else ""
            patentCitationText = text if tag == "{http://www.wipo.int/standards/XMLSchema/ST96/Common}PatentCitationText" else ""

            documents = text_splitter.create_documents(texts=[text], metadatas=[{
                "name": topic_name, 
                "source": file, 
                "tag": tag, 
                "title": title,
                "entry_name": entryName, 
                "patent_citation_text" : patentCitationText}]
            )
            docs.extend(documents)

実行後データベースを確認

　実際にコマンドを実行し、データベースを「DB Browser for SQLite」で確認をしました。

`key`列の中に、`name`や`source`の他に`title'や'tag'などが追加されていることがわかるかと思います。検索側で`filter`をかけてから生成AIを呼び出すことで、回答結果の精度が上がるのではないかと考えております。

まとめ

VectorDB(SQLite)のmetadata設定が出来ました(`tag`, `title`などを追加)。
次回は`chainlit`を適用して、VectorDBのデータを絞り、生成AIから精度の高い回答がなされるよう、プログラム改修をしていこうと思います。

この記事が参加している募集

88,323件

7月23日まで

この記事が気に入ったらサポートをしてみませんか？