見出し画像

Local-LLM+RAG, 1/n

特定の情報を基にLLMに応答させる手法として、言わずとも知れたRAG(検索拡張生成) システムですが、関連のテクニックが溢れて大変なので、メモがてら残していきます。

今回はLlamaIndexで子chunksを使ったRecursive Retriever + Node References検索でhitしたtextの前後のtextも参照するNode Sentence Windowです。

RAGではベクトル化したtextをベクトルの近さとかで検索するので、chunkが大きいと精度よく検索できなかったり、小さいとtext生成に有用な情報が欠けたりします。いずれもそれを補ってくれる手法ですね。


0. 環境

OS:Windows
CPU:Intel(R) Core i9-13900KF
RAM:128GB
GPU:RTX 4090

1. 通常のRAG

PDFデータからテキスト抽出
今回は別件で集めたNLP関係のArxivの論文のうちの一つ、"CIS2: A Simplified Commonsense Inference Evaluation for Story Prose"を参照します。

import glob
from pdfminer.high_level import extract_text
from llama_index import Document

path = r"C:\Users\__\Documents\data\arxiv\NLP" # データの保存場所
files = glob.glob(f"{path}/*.pdf")

text = extract_text(files[0]).replace("\n","")[:10000]
docs = [Document(text=text)]

LlamaIndexでchunksを整形

from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode

node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
base_nodes = node_parser.get_nodes_from_documents(docs)

embedding model にはBAAI/bge-small-en-v1.5を、LLMはHuggingFaceH4/zephyr-7b-betaを使用。

from llama_index.embeddings import resolve_embed_model

embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")

import torch
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=2048,
    max_new_tokens=512,
    model_kwargs={"torch_dtype": torch.bfloat16},
    generate_kwargs={"temperature": 0.1, "do_sample":True,},
    device_map="auto",
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model,
)

IndexとRetrieverを作成。

from llama_index import VectorStoreIndex, ServiceContext

base_index = VectorStoreIndex(base_nodes, service_context=service_context)
base_retriever = base_index.as_retriever(similarity_top_k=3)

Retrieverで取得できるtextを確認。

from llama_index.response.notebook_utils import display_source_node

retrievals = base_retriever.retrieve(
    "What are the key concepts described in this paper?"
)
for retrieval in retrievals:
    display_source_node(retrieval, source_length=500)

Node ID: 337c4c08-3fef-4781-9b95-dc7c4e359907
Similarity: 0.6075321430143236
Text: We designed several diagnostic taskswhich selectively omit sentences of the input andinvestigate which sentences contribute the most toparaphrasing/generation. We replicate their results,then finetune T5 models (Raffel et al., 2020) oneach of our diagnostic tasks, to show the significantconflation of language understanding and genera-tion in the original GLUCOSE T5 model.Second, we propose CIS2 (Contextual Com-monsense Inference in Sentence Selection), a sim-plified task for more fairly evaluatin...

Node ID: d2987a03-6b98-4796-acbb-4427a0622b34
Similarity: 0.593029658839695
Text: COMET (Bosselut et al.,2019) is a transformer language model designedon top of ATOMIC relations, showing languagemodels can encode and generalize commonsenseinformation.However, Wang et al. (2021) show that lan-guage models struggle to perform generaliz-able commonsense inference across three pop-ular CKG datasets: ConceptNet (Speer et al.,2017), TupleKB (Dalvi Mishra et al., 2017), andATOMIC (Sap et al., 2019). They found that LMstrained on several CKGs have limited ability totransfer knowle...

Node ID: 09a351e1-51a1-4c36-a0c9-b3af1ae2e152
Similarity: 0.5904037347711903
Text: # DescriptionRelation TextParameterText1 Event that causes or enables X2 Emotion/basic human drive that mo->Causes/Enables>>Motivates>Storytivates X3 Location state that enables X4 Possession state that enables X5 Other attributes enabling X>Enables>>Enables>>Enables>6 Event that X causes or enables7 An emotion that is caused by X8 A change in location that X results in >Results in>>Results in>9 A change of possession that X results>Causes/Enables>>Causes>in10 Other changes in property that X...

実際に質疑。

from llama_index.query_engine import RetrieverQueryEngine

question = "What are the key concepts described in this paper?"

query_engine_base = RetrieverQueryEngine.from_args(
    base_retriever, service_context=service_context
)
response = query_engine_base.query(question)
print(str(response))

The paper discusses the following key concepts:
1. Commonsense inference: The ability to use prior knowledge based on real-world experiences to infer what has happened or will happen.
2. Commonsense knowledge graphs (CKGs): Structured representations of everyday events and their relationships, which encode commonsense information.
3. Language models: Neural network models that can generate human-like text based on input prompts.
4. Commonsense reasoning: The use of prior knowledge to make inferences and decisions in various tasks, such as story generation and commonsense inference.
5. Diagnostic tasks: Specific tasks designed to selectively omit sentences from an input and investigate which sentences contribute the most to paraphrasing or generation.
6. Contextual commonsense inference: The ability to make commonsense inferences in the context of a story or text, which involves understanding the relationships between events and entities.
7. Storytelling: The process of creating a narrative or story, which involves generating coherent and engaging text.
8. Story generation: The use of AI systems to automatically generate stories, which involves understanding the structure and coherence of stories.
9. Story cloze test: A multiple-choice task for evaluating the quality of machine-generated story endings based on commonsense reasoning.
10. Neurosymbolic systems: AI systems that combine neural networks and symbolic reasoning to perform tasks, such as story generation and commonsense reasoning.
11. VerbNet: A lexical database of verb classes and their semantic properties, which can be used to ground neural story generation in commonsense reasoning.
12. Contextual commonsense inference in sentence selection (CIS2): A simplified task for evaluating commonsense inference in storytelling, which abstracts away the natural language generation component and focuses on understanding the relationships between events and entities in a story.

2. 親chunksと子chunksの設定

先ほど作成した親chunks(size 512)に対し、sizeが128と256の子chunksを設定。

sub_chunk_sizes = [128, 256]
sub_node_parsers = [
    SimpleNodeParser.from_defaults(chunk_size=c) for c in sub_chunk_sizes
]

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)

IndexとRetrieverを設定

from llama_index.retrievers import RecursiveRetriever

vector_index_chunk = VectorStoreIndex(
    all_nodes, service_context=service_context
)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=3)
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

質疑

query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk, service_context=service_context
)
response = query_engine_chunk.query(question)

print(str(response))

The paper discusses Contextual Commonsense Inference (CCI), which is the problem of inferring causal relations between events in a text, such as a story. The paper highlights that prior work in this area has conflated CCI with language generation, which has resulted in models that struggle with CCI in isolation. To address this, the paper reframes CCI as a classification problem and introduces a new system called CIS2, which forces the model to focus on CCI directly by providing it the original text of the story to use for understanding while having it generate only the bare minimum: indices to sentences. The paper compares the performance of CIS2 against the GLUCOSE dataset and finds that models trained on CIS2 indices achieve a higher CCI accuracy than those trained for generating full phrases. The paper also mentions the TellMeWhy dataset, which consists of questions on why characters perform their actions and their corresponding answers. The paper notes that evaluating CCI with word-matching based metrics such as BLEU is problematic and can lead to models that struggle with CCI in isolation.

一応機械翻訳を添付

本稿では、ストーリーテリングにおけるコモンセンス推論の評価、言語生成と言語理解タスクを切り離すことの意義、一般化可能なコモンセンス推論を行う上での言語モデルの限界、より良いパフォーマンスのために言語モデルを未見の主語やオブジェクトに適応させる提案、といった重要な概念について述べる。本稿では、自然言語生成の要素を抽象化することで、ストーリーテリングにおけるコモンセンス推論をより公平に評価するために、CIS2(Contextual Commonsense Inference in Sentence Selection)と呼ばれる簡略化されたタスクを紹介する。また、オリジナルのGLUCOSE T5モデルにおける言語理解と生成の混同を強調し、言い換えと生成への寄与を調査するために文を選択的に省略するGLUCOSE診断タスクの例を示す。本稿では、知識グラフとしてのコモンセンス推論の形式化、自動ストーリー生成におけるコモンセンス推論の利用、ニューラルストーリー生成システムへのコモンセンス推論の統合など、関連する研究に言及している。本稿は、言語理解タスクを実行する際に、言語生成を注意深く切り離すよう、今後の研究を促すことで結んでいる。

3. Sentence Windowの設定

Window node parserを作成

from llama_index.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original",
)

Indexとquery engineの設定

from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

sentence_nodes = node_parser.get_nodes_from_documents(docs)
sentence_index = VectorStoreIndex(sentence_nodes, service_context=service_conext)
query_engine = sentence_index.as_query_engine(
    similarity_top_k=3,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

質疑

response = query_engine.query(question)
print(response)

The key concepts described in this paper are:
1. Commonsense inference: The ability of AI systems to use prior knowledge based on real-world experiences to infer what has happened or will happen.
2. Commonsense knowledge graphs: A formalization of the commonsense inference task as a knowledge triple, where the subject, object, and relation are nodes connected by edges.
3. Contextual commonsense inference (CCI): A task that involves predicting the object of a relation given the subject and relation in a commonsense knowledge graph.
4. GLUCOSE dataset: A dataset used for the CCI task, which includes story sentences and their corresponding causal relations.
5. Language generation: The process of generating new text based on existing text.
6. Language understanding: The ability of AI systems to comprehend and interpret human language.
7. Diagnostic tasks: Tasks designed to selectively omit sentences from an input and investigate which sentences contribute the most to paraphrasing or generation.
8. CIS2 (Contextual Commonsense Inference in Sentence Selection): A simplified task proposed in the paper to more fairly evaluate common-sense inference in storytelling, which abstracts away the natural language generation component almost entirely.
9. Heuristic: A rule of thumb or a practical method that is used to solve a problem or make a decision.
10. Causal relations: The relationships between events, emotions, locations, possessions, and other attributes that enable or result in other events or changes.
11. Storytelling: The process of telling a story, which involves sequencing events, emotions, locations, possessions, and other attributes to create a coherent and engaging narrative.
12. Plot coherence: The degree to which a story has a logical and coherent structure, with clear causal relationships between events and characters.
13. Neurosymbolic system: A type of AI system that combines neural networks and symbolic reasoning to perform tasks such as story generation and commonsense reasoning.
14. VerbNet: A lexical database of English verbs and their semantic properties, which can be used to ground neural story generation in commonsense reasoning.
15. T5 models: A type of transformer-based language model developed.

機械翻訳

本稿で説明するキーコンセプトは以下の通りである:

1. コモンセンス推論: AIシステムが、実世界の経験に基づく事前知識を用いて、何が起こったか、あるいは起こるかを推論する能力。
2. コモンセンス知識グラフ: コモンセンス推論タスクを知識トリプルとして形式化したもので、主語、目的語、関係がエッジで接続されたノードとなる。
3. 文脈的コモンセンス推論(CCI): コモンセンス知識グラフの主語と関係から、関係の対象を予測するタスク。
4. GLUCOSEデータセット: ストーリー文とそれに対応する因果関係を含む。
5. 言語生成: 既存のテキストを基に新しいテキストを生成するプロセス。
6. 言語理解: AIシステムが人間の言語を理解・解釈する能力。
7. 診断タスク: 入力から文章を選択的に省略し、どの文章が言い換えや生成に最も貢献しているかを調査するように設計されたタスク。
8. CIS2(Contextual Commonsense Inference in Sentence Selection): ストーリーテリングにおけるコモンセンス推論をより公平に評価するために論文で提案された簡略化されたタスクで、自然言語生成要素をほぼ完全に抽象化している。
9. ヒューリスティック: 問題解決や意思決定に用いられる経験則や実用的な方法。
10. 因果関係: 出来事、感情、場所、所有物、その他の属性の間の関係で、他の出来事や変化を可能にしたり、その結果生じさせたりするもの。
11. ストーリーテリング: 首尾一貫した魅力的な物語を創り出すために、出来事、感情、場所、所有物、その他の属性を順序立てて説明すること。
12. プロットの一貫性: 物語が論理的で首尾一貫した構造を持ち、出来事や登場人物の間に明確な因果関係がある度合い。
13. ニューロシンボリック・システム: ニューラル・ネットワークと記号的推論を組み合わせて、ストーリー生成や常識的推論などのタスクを実行するAIシステムの一種。
14. 動詞ネット: 英語の動詞とその意味特性の語彙データベースで、コモンセンス推論におけるニューラル・ストーリー生成の基礎となる。
15. T5モデル: 変換器ベースの言語モデルの一種。

今回は論文のキーコンセプトをZephyrに答えてもらいました。

子chunksを設定した方では密度の高い要約が、Sentence Windowを設定した方ではキーコンセプトというよりキーワード抽出とその概要説明みたいになりました。

outputのフォーマットはpromptで調整すればよいでしょうが、特許や論文のデータ群を俯瞰するための前処理としては十分使えるんじゃないでしょうか。

4. 参考

https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes.html

https://docs.llamaindex.ai/en/latest/examples/node_postprocessor/MetadataReplacementDemo.html

この記事が気に入ったらサポートをしてみませんか?