社長(AI)に頼んで日本語マルチターンデータセットを作成しました。まずは10K

2024年1月7日 11:27

Qarasu14Bがかなり高性能だったので、これを使えば念願の商用利用可能な日本語マルチターンデータセットを作れる!と思(ったのですがよく調べるとQarasuはShareGPTを使っているのでOpenAIの規約的にはダメそう。残念)、正月早々、うちの社長(AIスーパーコンピュータ継之助)に頼んでWikipediaのデータから日本語マルチターンデータセットを生成してもらいました。三日位かかったけど、まずは1万(10K)会話を収集できました。

また、要望が多かったので、Wikipediaの本文も収録したバージョンと、そのままAxolotl(ウーパールーパー)に投げ込めるバージョンの二つを用意しました。

Wikipediaの内容入りバージョン

Axollotlに投げ込めるバージョン

ソースコード

今回、Qarasu14Bがかなりの高性能だったことに救われましたが、やはりそのままだと使える状態のJSON形式のファイルが出てくるのは稀だったので適宜リトライしています。再現実験をする人のためにソースコードも公開しておきます。毎回ランダムにWikipediaの項目が選ばれるので、誰が実行してもおそらく新しい会話データが取得できるはずです。

from transformers import AutoTokenizer, AutoModelForCausalLM,pipeline
import json
import random
import string
import time
import sys
import torch


from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("lightblue/qarasu-14B-chat-plus-unleashed",trust_remote_code=True,torch_dtype=torch.bfloat16,device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("lightblue/qarasu-14B-chat-plus-unleashed",trust_remote_code=True)

prompt_template="USER:%s\nASSISTANT: \n```conversation=[{'生徒':'%sについて教えてください','先生':'「%s」についてだね。簡単に言うと"


pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)



def gpt(utterance,theme):
    messages = [{"role": "system", "content": "あなたは役に立つAIです。ユーザの質問、依頼を正確に便利に答えてください。正解がわからない場合に「正解がわかりません」と答えてください。全てJSON形式で返します。"}]
    messages.append({"role": "user", "content": utterance})

    prompt = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
    #print("prompt",prompt)
    prompt+='\n{"conversations":[{"生徒":"%sって何ですか?","先生":"'%(theme)
    res = pipe(prompt, max_new_tokens=1000, do_sample=False, temperature=0.8, return_full_text=False)

    return res

def generate_random_string(length):
    letters = string.ascii_letters
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

import time
import datasets
data=datasets.load_dataset("izumi-lab/wikipedia-ja-20230720",split="train").shuffle()
import json
cnt=0
import re

with open("qarasu_log.txt","w") as f:
    for row in data:
        if row["title"]==row["text"]:
            continue
        if len(row["text"])<500:
            continue
        try:
                res=gpt(f"{row['title']}について書かれた以下の文章を読んで先生と生徒で会話する会話文を作りなさい。\n\n▪️{row['title']}\n\n"
                        f"{row['text'][:4096]}\n\n" 
                        '上記の文章について日本語での質問文と返答文のセットを作り、```"conversations":[{"生徒":"<質問1>","先生":"<回答1>"},{"生徒":"<質問2>",'
                        '"先生":"<回答2>"},{"生徒":"<質問3>","先生":"<回答3>"},{"生徒":"<質問4>","先生":"<回答4>"}]```のように4つ以上の質問と答えを考え、'
                        "それをJSON形式で返しなさい。ダブルクォーテーションは適切にエスケープしなさい\n",row['title'])
                
                res=res[0]["generated_text"].replace("```json","").replace("```","")
                res='{"conversations":[{"生徒":"%sって何ですか?","先生":"'%(row["title"])+res.split("\n")[0]

                if '{"生徒": "<質問' in res or "<回答" in res:
                    continue
                data = json.loads(res)

                #print(data, file=sys.stderr)
                if len(data['conversations'])<1:
                    continue
                data['title']=row['title']
                data['body']=row['text']
                print(json.dumps(data, ensure_ascii=False))
                f.write(json.dumps(data, ensure_ascii=False)+"\n")
                
        except Exception as e:
                print('### エラーが発生しました。%s'%res, file=sys.stderr)
                print(e, file=sys.stderr)
                pass

        cnt+=1
        if cnt>100000: #目標はとりあえず10万会話
                break