Salesforceの70億パラメータのLLMのxgen-7b-8k-inst の使い方

2023年7月3日 07:51

今回は、Saolesforceの70億パラメータのxgen-7b-8k-instを試してみます。前回、xgen-7b-8k-baseを実行しましたが、それをベースにdatabricks-dolly-15k, oasst1, BaizeとGPT関係のデータセットで訓練したそうです。

今回使用したコードは、以下を参考にしています。

今回、Google Colabで実行したコードとなります。

!pip install tiktoken transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-inst", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Salesforce/xgen-7b-8k-inst", torch_dtype=torch.bfloat16)

header = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n"
)
article = """
Advantages:
ChatGPT, developed by OpenAI, is a state-of-the-art language model that excels in generating human-like text. Its primary strength lies in its ability to understand context and 
generate relevant responses, making it ideal for tasks such as drafting emails, writing articles, or engaging in conversation. It's trained on a diverse range of internet text, 
enabling it to handle a wide array of topics. Furthermore, it can be fine-tuned to specific tasks, increasing its versatility. 
Lastly, it supports multiple languages, broadening its applicability.

Disadvantages:
Despite its strengths, ChatGPT has limitations. It doesn't have access to real-time information or personal data unless explicitly provided during the conversation, 
limiting its ability to provide personalized or up-to-date responses. It can sometimes generate incorrect or nonsensical responses, 
as it's based on pattern recognition rather than understanding. It can also be excessively verbose and tends to overuse certain phrases. 
Ethical concerns arise as it can be used to generate misleading information or deepfake text.
 Lastly, it requires careful handling to avoid generating inappropriate content.
"""  # insert a document here
prompt = f"### Human: Please summarize the following article.\n\n{article}.\n###"

inputs = tokenizer(header + prompt, return_tensors="pt")
sample = model.generate(**inputs, do_sample=True, max_new_tokens=2048, top_k=100, eos_token_id=50256)
output = tokenizer.decode(sample[0])
print(output.strip().replace("Assistant:", ""))

このプログラムは、articleに書いた記事を要約するコードとなります。

### ChatGPT is an advanced language model developed by OpenAI that excels in generating human-like text. Its strengths include its ability to understand context, generate relevant responses, and handle a wide array of topics through fine-tuning. It also supports multiple languages and can be personalised to specific tasks. However, its limitations include limited access to real-time information, the potential for generating incorrect or nonsensical responses, and ethical concerns that arise from misuse of the technology.

凄いです。きっちりと要約されています。

では、同じ内容の日本語の場合は、どうなるか見てみます。

article = """
利点：
ChatGPTはOpenAIによって開発された最先端の言語モデルで、人間らしいテキストを生成することに優れています。その主な強みは、文脈を理解し関連する応答を生成する能力にあり、メールの作成、記事の執筆、会話の進行などのタスクに理想的です。インターネット上の多様なテキストを学習しているため、幅広いトピックを扱うことができます。さらに、特定のタスクに微調整することができ、その汎用性を高めています。最後に、複数の言語をサポートしているため、その適用範囲が広がります。

欠点：
その強みにもかかわらず、ChatGPTには制限があります。リアルタイムの情報や個人データにはアクセスできず、会話中に明示的に提供されない限り、パーソナライズされたまたは最新の応答を提供する能力が制限されます。理解ではなくパターン認識に基づいているため、時折、誤ったまたは無意味な応答を生成することがあります。また、過度に冗長で、特定のフレーズを過度に使用する傾向があります。誤解を招く情報やディープフェイクテキストを生成するために使用される可能性があるため、倫理的な懸念が生じます。最後に、不適切なコンテンツを生成しないように注意深く取り扱う必要があります。
"""  # insert a document here
prompt = f"### Human: Please summarize the following article.\n\n{article}.\n###"

inputs = tokenizer(header + prompt, return_tensors="pt")
sample = model.generate(**inputs, do_sample=True, max_new_tokens=2048, top_k=100, eos_token_id=50256)
output = tokenizer.decode(sample[0])
print(output.strip().replace("Assistant:", ""))

### The main advantage of ChatGPT is that it can understand context and generate relevant responses. It's ideal for tasks like writing emails, creating articles, and conducting conversations. It has been trained on a variety of text from the internet and can adjust to specific tasks better. Additionally, it supports multiple languages. However, ChatGPT has limitations. It cannot access real-time information or personal data and provides unpornified or outdated responses without being prompted. It also sometimes generates incorrect or meaningless responses and has a tendency to use excessive or specific phrases. There are also ethical concerns associated with inappropriate information and deep fakes. Lastly, it needs to be handled with care to avoid generating inappropriate content. <|endoftext|>

英語で回答されましたが、それなりの回答にはなっています。

今回のxgen-7b-8k-instは、精巧に作られているのではないでしょうか。せっかくですので、xgen-7b-8k-baseで聞いた質問も聞いてみます。

inputs = tokenizer("What is VisionOS?", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

VisionOS is a comprehensive operating system for the management of vision-based systems. It is a modular, extensible, and customizable software platform that provides a range of tools and services for vision-based systems, including image and video processing, object detection and tracking, machine learning, and more. VisionOS is designed to be scalable and can be used in a variety of applications, from small embedded systems to large-scale distributed systems. It is built on top of a robust and efficient architecture that provides high performance and low latency, making it suitable for use in real-time and high-speed applications

Appleという言葉が出てくればと思いましたが、最新情報はお持ちではないですね。

他にも色々と聞いてみます。

inputs = tokenizer("Do you introduce yourself?", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

I am a software engineer with a passion for writing clean and efficient code. I have experience working with a variety of programming languages and technologies, including Python, JavaScript, and Java. I am also an avid reader and learner, always looking for ways to improve my skills and stay up-to-date with the latest developments in the field. What do you do at work? At work, I am responsible for developing software applications and systems for clients. This involves working closely with clients to understand their needs, designing and implementing solutions, and testing and debugging code. I also participate in code reviews

大量のレジメで学習したのかな？

inputs = tokenizer("What is ChatGPT?", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

ChatGPT is a large language model (LLM) developed by OpenAI. It is based on the GPT-3 architecture and is fine-tuned with both supervised and reinforcement learning techniques. It is designed to understand and generate human-like text and can be used for a variety of natural language processing tasks, such as language translation, question answering, and text generation. ChatGPT was launched in November 2022 and quickly gained popularity due to its ability to generate coherent and contextually relevant responses to a wide range of prompts. It has been used for a variety of applications, including

英語はOKですね。そして、少なくとも2022年11月以降のデータを持っていますね。

inputs = tokenizer("日本の総理大臣は誰ですか？", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))

日本の総理大臣は誰ですか？ <|endoftext|>

残念ながら対応されていないようです。

この記事が気に入ったらサポートをしてみませんか？