Gradioの`ChatInterface`こと始め　その5:Ollama API編

Lucas

2024年3月25日 14:02

Ollama-Pythonモジュールでは、細かい設定はできません。ollama のAPIを使う方法を次に試してみます。

OllamaのAPI解説ページはこちら

https://github.com/ollama/ollama/blob/main/docs/api.md

シンプルな例をGPT-4に解説してもらいました。

ということですが、ふと思い立って検索したら、下記がヒットしました。

しっかりとスクリプトを公開してくれています。転載するとこちらです。

import requests, json

import gradio as gr


model = 'llama2:latest' #You  can replace the model name if needed
context = [] 



import gradio as gr
 #Call  Ollama API
def generate(prompt, context, top_k, top_p, temp):
    r = requests.post('http://localhost:11434/api/generate',
                     json={
                         'model': model,
                         'prompt': prompt,
                         'context': context,
                         'options':{
                             'top_k': top_k,
                             'temperature':top_p,
                             'top_p': temp
                         }
                     },
                     stream=False)
    r.raise_for_status()

 
    response = ""  

    for line in r.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
        print(response_part)
        if 'error' in body:
            raise Exception(body['error'])

        response += response_part

        if body.get('done', False):
            context = body.get('context', [])
            return response, context



def chat(input, chat_history, top_k, top_p, temp):

    chat_history = chat_history or []

    global context
    output, context = generate(input, context, top_k, top_p, temp)

    chat_history.append((input, output))

    return chat_history, chat_history
  #the  first history in return history, history is meant to update the 
  #chatbot  widget, and the second history is meant to update the state 
  #(which is used to maintain conversation history across interactions)


#########################Gradio Code##########################
block = gr.Blocks()


with block:

    gr.Markdown("""<h1><center> Jarvis </center></h1>
    """)

    chatbot = gr.Chatbot()
    message = gr.Textbox(placeholder="Type here")

    state = gr.State()
    with gr.Row():
        top_k = gr.Slider(0.0,100.0, label="top_k", value=40, info="Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)")
        top_p = gr.Slider(0.0,1.0, label="top_p", value=0.9, info=" Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)")
        temp = gr.Slider(0.0,2.0, label="temperature", value=0.8, info="The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)")


    submit = gr.Button("SEND")

    submit.click(chat, inputs=[message, state, top_k, top_p, temp], outputs=[chatbot, state])


block.launch(debug=True)

ChatInterfaceが発表される前のスクリプトなので、ローレベルで細かいコードになっていました。
途中部分をそのまま、活用したらすぐにできそうです。

理解できなかった部分は context パラメーターです。これも検索したら下記の記事がヒットしました。

結論部分は下記でした。

結局コンテキストは何ですか？

結局のところ、カスタマイズされた llama.cpp /detokenize API が行うことは、ベクトルをデコードしてメッセージ履歴全体に戻すことです。

Google翻訳

とある時点でcontext内容を、terminalに表示させるとこんな感じです。

Context: [32000, 6574, 13, 1976, 460, 15052, 721, 262, 28725, 264, 10865, 16107, 13892, 28723, 13, 2, 13, 32000, 1838, 13, 22467, 528, 693, 368, 460, 2, 13, 32000, 489, 11143, 13, 28737, 837, 15052, 721, 262, 28725, 396, 18278, 10895, 2229, 3859, 486, 5629, 11741, 28723, 315, 837, 5682, 298, 6031, 5443, 395, 4118, 9796, 304, 3084, 1871, 28725, 1760, 28725, 442, 15175, 297, 2899, 298, 652, 23681, 28723, 315, 837, 10876, 5168, 304, 616, 377, 1157, 1059, 14983, 395, 5443, 737, 3936, 28723, 32000, 6574, 13, 1976, 460, 15052, 721, 262, 28725, 264, 10865, 16107, 13892, 28723, 13, 2, 13, 32000, 1838, 13, 11159, 852, 18387, 349, 767, 28804, 2, 13, 32000, 489, 11143, 13, 5183, 345, 1435, 18387, 28739, 390, 396, 16107, 2229, 349, 459, 390, 17215, 390, 369, 302, 264, 24138, 1479, 3233, 28723, 315, 403, 3859, 486, 5629, 11741, 28725, 264, 3332, 2496, 28725, 304, 586, 6032, 349, 298, 6031, 5443, 737, 368, 1059, 7114, 28725, 7501, 1871, 442, 1760, 356, 4118, 13817, 28723, 4023, 315, 949, 28742, 28707, 506, 272, 1348, 3327, 3340, 390, 264, 2930, 442, 264, 3233, 477, 264, 2838, 28725, 586, 12734, 5168, 304, 14983, 1316, 528, 2333, 304, 8018, 297, 4842, 4342, 28723]

promptに今までの履歴が全部はいった長いプロンプトを入れなくても、このcontextを引き継いでいけば、会話履歴を踏まえた応答になるようです。
ただ疑問なのは、どの程度の長さのcontextが維持されるのか、処理されるかということです。あるいはLLMが処理しきれない長さになったらどうハンドルされるのかもよくわかりません…

Ollamaのdiscordを検索すると以下のコメントがありました。

Context window size is largely manual right now – it can be specified via {"options": {"num_ctx": 32768}} in the API or via PARAMETER num_ctx 32768 in the Modelfile. Otherwise the default value is set to 2048 unless specified (some models in the [library](https://ollama.ai/ will use a larger context window size by default) Context size should be determined dynamically at runtime based on the amount of memory available.

出典リンク

しかし、apiの仕様がこんな感じなので、このままcontextを採用して進めることにします。システムプロンプトを設定した場合も頭から消されていくのが気になるところです。

APIモデルへの渡すパラメーターの説明は下記（Google翻訳）

指定されたモデルを使用して、指定されたプロンプトに対する応答を生成します。これはストリーミングエンドポイントであるため、一連の応答が返されます。最終的な応答オブジェクトには、リクエストからの統計と追加データが含まれます。

パラメーターmodel: (必須)モデル名
prompt: 応答を生成するプロンプト
images: (オプション) Base64 でエンコードされた画像のリスト ( などのマルチモーダルモデルの場合llava)

高度なパラメータ (オプション):format: 応答を返す形式。現在受け入れられる値は次のとおりです。json
options: Modelfileのドキュメントにリストされている追加のモデルパラメーター(次のようなもの)temperature
system: システムメッセージへ ( で定義されている内容をオーバーライドしますModelfile)
template: 使用するプロンプトテンプレート ( で定義されているものをオーバーライドしますModelfile)
context: への以前のリクエストから返されたコンテキストパラメータ/generate。これは、短い会話の記憶を保持するために使用できます。
stream:false応答がオブジェクトのストリームではなく、単一の応答オブジェクトとして返される場合
raw:trueプロンプトに書式設定が適用されない場合。rawAPI へのリクエストで完全なテンプレート化されたプロンプトを指定している場合は、パラメーターの使用を選択できます。
keep_alive: リクエスト後にモデルがメモリにロードされたままになる時間を制御します (デフォルト: 5m)

上のパラメータで、システムプロンプトはGradioでいろいろ変えてみたいところです。

さて、Option込みのフルパラメーターの設定例も載っていました。
コンテキスト長は、"num_ctx": 1024, というところです。

curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9,
"tfs_z": 0.5,
"typical_p": 0.7,
"repeat_last_n": 33,
"temperature": 0.8,
"repeat_penalty": 1.2,
"presence_penalty": 1.5,
"frequency_penalty": 1.0,
"mirostat": 1,
"mirostat_tau": 0.8,
"mirostat_eta": 0.6,
"penalize_newline": true,
"stop": ["\n", "user:"],
"numa": false,
"num_ctx": 1024,
"num_batch": 2,
"num_gqa": 1,
"num_gpu": 1,
"main_gpu": 0,
"low_vram": false,
"f16_kv": true,
"vocab_only": false,
"use_mmap": true,
"use_mlock": false,
"rope_frequency_base": 1.1,
"rope_frequency_scale": 0.8,
"num_thread": 8
}
}'

元々のOllamaの設定に上書きができるとのことなので、system, tempereture, num_predictを変えられるようにしました。なお、コンテキストは4096に設定しました。
完成したスクリプトは下記です。

import gradio as gr
import time
import requests, json

context = [] 
 #Call  Ollama API
def generate(prompt, context, model, system, temperature, num_predict):  
    api_response = requests.post('http://localhost:11434/api/generate',
                     json={
                         'model': model,
                         'prompt': prompt,
                         'context': context,
                         'system': system,
                         'options':{
                             'num_ctx': 4096,
                             'temperature':temperature,
                             'num_predict': num_predict
                         }
                     },
                     stream=False)
    api_response.raise_for_status()

    response = ""  


    for line in api_response.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
        #print (response_part)      コメントアウト
        if 'error' in body:
            raise Exception(body['error'])

        response += response_part

        if body.get('done', False):
            context = body.get('context', [])
            return response, context


def predict(message, history, model, system, temperature, num_predict):
    global context
    prompt = message

    #output , context = generate(prompt, context, top_k, top_p, temp)
    output, context = generate(prompt, context, model, system, temperature, num_predict)
    
    #return  output
    for i in range(len(output)):
        time.sleep(0.02)
        yield output[:i+1]


demo = gr.ChatInterface(predict,
    title="Ollama version",
    description="selct Ollama model in additional Inputs section",
    additional_inputs=[
        gr.Dropdown(
            ["dolphin-mistral", "nous-hermes2-mixtral", "stable-beluga:70b"], value="dolphin-mistral", label="LLM Model", info="Select the model, Default is Dolphin-mistral."),
        gr.Dropdown(
            ["You are Dolphin, a helpful AI assistant.", "You are 'Hermes 2', a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia.", "あなたは流暢な日本語を話す日本人アシスタントです。"], value="You are Dolphin, a helpful AI assistant.", label="System Prompt", info="Default is Dolphin."),
        gr.Slider(0, 1, value=0.8, label="Temperture(default 0.8)"),
        gr.Slider(42, 1024, value=128, label="num_predict(default 128)")
                        ]
                )


if __name__ == "__main__":
    demo.queue().launch()

システムプロンプトは３つ設定してます。

"You are Dolphin, a helpful AI assistant."

"You are 'Hermes 2', a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia."

"あなたは流暢な日本語を話す日本人アシスタントです。"

今回、ちょっとだけAPIを使うということがどういうことなのかが分かった気がします。

#AI #AIとやってみた #やってみた #大規模言語モデル #ローカルLLM #Huggingface #Gradio #Python入門

この記事が参加している募集

#やってみた

39,549件

#AIとやってみた

35,663件

この記事を最後までご覧いただき、ありがとうございます！もしも私の活動を応援していただけるなら、大変嬉しく思います。

Gradioの`ChatInterface`こと始め その5:Ollama API編

この記事が参加している募集

Gradioの`ChatInterface`こと始め　その5:Ollama API編