130億パラメータのOpenLLaMAのウェブアプリの作成方法

2023年6月21日 00:06

130億パラメータのOpenLLaMAが出てきましたので試してみました。

Google Colabで実行したときのシステム使用率は以下になっております。

今回は、Google Colabで下記コードを実行しました。回答が短いと思う場合は、max_new_tokens=32という数字を大きくしてください。128や512などに。


!pip install transformers accelerate sentencepiece gradio

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
import gradio as gr

# model_path = 'openlm-research/open_llama_3b'
# model_path = 'openlm-research/open_llama_7b'
model_path = 'openlm-research/open_llama_13b'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

def answer_question(question):
    prompt = f'Q: {question}\nA:'
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    generation_output = model.generate(
        input_ids=input_ids, max_new_tokens=32
    )
    return tokenizer.decode(generation_output[0])

iface = gr.Interface(fn=answer_question, inputs="text", outputs="text")
iface.launch(share=True)