gpt-4-visionとTTS の APIを用いて動画に自動でナレーション音声を作成

2023年11月9日 07:52

はじめに

OpenAIの記事を参考に、gpt-4-visionとTTS （Text-to-Speech）の APIを用いて、動画のナレーション音声の作成を試みました。
GPT-4は直接動画を入力として受け取ることはできませんが、視覚機能と128Kコンテキストウィンドウを使用して、動画全体の静止フレームを一度に説明することができます。生成した動画の説明文から、TTS APIを使用してナレーション音声を作成しました。

利用データ

Berkeley DeepDriveの自動車の走行動画を入力に利用しました。
自動車が空港のターミナルに沿った道を走行しており、画面の奥には国内線と国際線を示す青と黄色の看板が確認できます。

実装

動画からフレームを取り出して、base64でエンコードします。

import cv2
import base64
import requests
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

# Function to encode the image
def encode_image_to_base64(frame):
    _, buffer = cv2.imencode(".jpg", frame)
    return base64.b64encode(buffer).decode('utf-8')

# Read video and encode frames to base64
video = cv2.VideoCapture("283.mp4")
base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    base64_image = encode_image_to_base64(frame)
    base64Frames.append(base64_image)
video.release()
print(len(base64Frames), "frames read.")

#300 frames read.

ビデオフレームを取得したら、プロンプトを作成し、GPT にリクエストを送信します (一部のフレームのみを送信しています)。

# Prepare the messages payload
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a very short voiceover script. Only include the narration.",
            *map(lambda x: {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{x}"}}, base64Frames[0::10]),
        ],
    },
]

# API call parameters
params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

# Make the API call
result = client.chat.completions.create(**params)
result.choices[0].message.content

実行結果
動画の説明文が生成されました。文章は画像の具体的な状況を正確に表現するとだけでなく、比喩を用いて情感を加えています。

"As we glide along the bustling airport roadway, the dance of arriving and departing vehicles unfolds. Signs overhead point wayfarers to their destinations, the blue and yellow guidance akin to airport constellations. To the right, the international terminal beckons travelers from far-off lands, while Terminal 1 awaits the familiar faces of domestic flyers. In this nexus of hellos and goodbyes, each journey is both an ending and a beginning."

TTSを用いて、説明文から音声を合成します。

from IPython.display import display, Image, Audio

import cv2  # We're using OpenCV to read video
import base64
import time
import openai
import os
import requests

response = requests.post(
    "https://api.openai.com/v1/audio/speech",
    headers={
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    },
    json={
        "model": "tts-1",
        "input": result.choices[0].message.content,
        "voice": "onyx",
    },
)

audio = b""
for chunk in response.iter_content(chunk_size=1024 * 1024):
    audio += chunk
Audio(audio)

実行結果

この記事が気に入ったらサポートをしてみませんか？