【AI技術を活用したビデオ編集術】英語解説を日本語で読む【2023年11月17日｜@All About AI】

2023年11月18日 09:42

この動画では、自動生成された字幕の作成プロセスが紹介されています。まず、MP4ビデオクリップからフレームが抽出され、GPT-4 Vision APIを使用してそれらのフレームを要約し、説明を生成します。次に、OpenAI TTS APIを使ってこれらの説明を音声に変換し、MP3ファイルに保存され、最終的にMP4ファイルと結合されて新しい音声オーバーが作成されます。また、AI搭載のビデオエディタであるCapCutの使用法も解説されており、これを使って簡単に高品質なビデオを作成できます。動画では、Pythonスクリプトを使った字幕生成の詳細も説明され、実際にいくつかのビデオクリップでこのスクリプトを実行し、その効果を検証しています。
公開日：2023年11月17日
※動画を再生してから読むのがオススメです。

And here we go.

さあ、行くぞ。

Grim sneaking up.

忍び寄るグリム。

Tension's thick.

緊張が走る。

Spot one, bam, headshot.

一人発見、バム、ヘッドショット。

Sees another.

もう一人発見。

Relentless.

容赦ない。

And he, oh no, he's down.

やばい、倒れた。

Fazeclan clinches it.

ファゼクランがクリンチ。

Ecstasy on their faces.

恍惚の表情。

They erupt in victory.

勝利の歓声。

Crowd's going wild.

観客は熱狂。

What an electric moment.

なんともエキサイティングな瞬間だ。

The commentary you just heard on that Counter Strike clip was autonomously generated by just inputting a video file.

さっきのカウンターストライクの解説は、ビデオファイルを入力するだけで自律的に生成されたんだ。

Let me show you how.

その方法をお見せしよう。

Let's start by looking at the flow chart of the system.

まず、システムのフローチャートを見てみよう。

Here, you can see in the top left corner, we put in an MP4 video clip.

左上にあるように、MP4ビデオクリップを入力します。

And from that video clip, we extract a set number of frames.

そのビデオクリップから、設定したフレーム数を抽出します。

This could be 30, 60, that's kind of up to you.

これは30でも60でも構いません。

These frames will get converted into the base 64 encoding.

これらのフレームはベース64エンコーディングに変換されます。

So with this encoding, we can kind of take each frame and create some sort of like a summary of all the frames we have.

このエンコーディングによって、各フレームを取り出し、すべてのフレームの要約のようなものを作成することができます。

And that summary can be used with the gp4 vision API to create a full description of all the frames we collected.

このサマリーはgp4 vision APIを使って、収集したすべてのフレームの完全な説明を作成することができる。

And this description will be passed onto the OpenAI TTS API.

この説明はOpenAI TTS APIに渡されます。

So we kind of adjust this description to be a voiceover with just some prompt engineering.

私たちはこの説明を、いくつかのプロンプトエンジニアリングでナレーションになるように調整します。

And this description from the TTS API will be saved as an MP3 file.

TTS APIからのこの説明はMP3ファイルとして保存されます。

Right?

そうでしょう？

And then we can merge the MP4 file with the MP3 file.

そしてMP4ファイルとMP3ファイルをマージします。

And we will end up with the final video with the new kind of voiceover that is created with the gp4 vision API.

そして最終的に、gp4 vision APIで作成された新しい種類のナレーションが入ったビデオが完成します。

But before I show you how this works, let's take a quick look at today's sponsor, capcut online.

しかし、この仕組みをお見せする前に、今日のスポンサーであるcapcut onlineをちょっと見てみましょう。

As a content creator, I'm always on the lookout for tools that make my workflow seem more smooth and efficient.

コンテンツクリエイターとして、私はワークフローをよりスムーズで効率的にしてくれるツールを常に探しています。

That's why I'm so excited to share capcut online's video editor.

だからこそ、capcut onlineのビデオエディターを紹介できることをとても嬉しく思っています。

Innovative features.

革新的な機能

Let's talk about capcut online's AI-powered magic tools.

capcut onlineのAI搭載マジックツールについてお話ししましょう。

This suite feature is a game changer for anyone who wants to create high-quality videos without the hassle.

このスイート機能は、手間をかけずにハイクオリティなビデオを作成したい人にとって、画期的なものです。

Whether you're a seasoned creator or just starting out, these tools simplify the editing process so you can focus more on your creativity.

ベテランのクリエイターでも、これから始めるクリエイターでも、これらのツールは編集プロセスを簡素化するので、クリエイティビティにより集中することができます。

A standard feature is capcut online script to video.

標準機能として、capcut online script to videoがあります。

Just provide a prompt and watch as capcut online writes a ready-to-use script and generates a copyright-compliant video ready for you to tweak and perfectly capture your voice and style.

プロンプトを入力するだけで、capcut onlineがすぐに使えるスクリプトを作成し、著作権に準拠したビデオを生成します。

And with capcut online's long video to shorts, you can effortlessly transform your longer content into engaging short videos, perfect for platforms like TikTok.

また、capcut onlineの長尺動画から短尺動画への変換機能を使えば、長尺コンテンツをTikTokのようなプラットフォームに最適な魅力的な短尺動画に簡単に変換することができます。

This feature is ideal for repurposing content and reaching a wider audience across different social media channels.

この機能は、コンテンツを再利用し、異なるソーシャルメディアチャンネルでより多くの視聴者にリーチするのに理想的です。

All these incredible features and more are just one click away.

これらの素晴らしい機能はすべて、ワンクリックでご利用いただけます。

So simply click the link in the description below and start using capcut online today and explore its array of AI-powered tools.

以下の説明のリンクをクリックし、今日からオンラインでcapcutを使い始め、AIを搭載したツールの数々を探求してみてください。

So creating professional-looking videos for YouTube, TikTok, or any other platform has never been easier.

YouTube、TikTok、またはその他のプラットフォーム用にプロ並みのビデオを作成するのが、かつてないほど簡単になりました。

But now let's get back to the video.

さて、動画に戻りましょう。

So the first function I wanted to show you is the extract frames function.

最初に紹介するのは、フレームを抽出する機能です。

Here, you can kind of set our frame interval.

ここでフレーム間隔を設定します。

You can see I have set it to 60.

私は60に設定しています。

Now, this means that we take one frame each second.

つまり、1秒ごとに1フレームを取り出します。

So if we have a 10-second video, we will end up with 10 frames.

つまり、10秒のビデオなら10フレームになります。

And here, you can see we have a get video duration function.

そしてここに、動画の継続時間を取得する関数があります。

This is just because we want to know how long the video is so we can adjust this in the prompt.

これは動画の長さを知りたいので、プロンプトで調整するためです。

I'm going to show you that.

これからそれをお見せします。

Here is where we encode our image to base 64 encoding.

ここで画像を64進エンコードする。

That's pretty straightforward.

これはとても簡単だ。

And here we have the get frame description.

そして、フレームの説明を取得します。

So here is kind of where we introduce the gpt-4 Vision API.

ここでgpt-4 Vision APIを紹介します。

I put the max tokens to 2,000 and I added some temperature here so we can kind of adjust it.

トークンの最大値を2,000に設定し、温度を追加しました。

I've been running with zero now, but you can set this to one and get a more creative or different output.

今はゼロにしていますが、これを1に設定すれば、よりクリエイティブで異なる出力を得ることができます。

I also put in like a server error attempt here, so it's kind of like a retry function.

サーバーエラーの試行もここに入れて、リトライ機能のようなものです。

Here is the create voiceover function.

これがナレーション作成機能です。

We use the TTS model one from OpenAI and we picked a voice named Echo.

OpenAIのTTSモデルを使い、Echoという声を選びました。

Yeah, we have some voices to pick from, not many.

そうですね、選べる声はいくつかありますが、多くはありません。

Here we use MovieP to kind of merge the audio and video like we talked about.

ここではMoviePを使って音声とビデオを統合します。

So the MP3 and the MP4.

MP3とMP4ですね。

And here you can kind of see we set our video path.

そしてここで、ビデオパスを設定します。

So here is our MP4 file.

これがMP4ファイルです。

I was too lazy to create a UI for this.

UIを作るのが面倒くさかったんだ。

Yeah, that's pretty bad, but yeah, we're just going to do it like this now.

でも、今はこうしておこう。

And then, yeah, let's just run it.

そして実行しよう。

So, I wanted to take a big look at the prompt here because we have done some kind of prompt engineering, you can call it that.

というのも、プロンプト・エンジニアリングとでも呼べるようなことをやったからだ。

You can see we have a word count.

単語数が表示されています。

So, this is how many words I want to be said in like the whole video.

これは、ビデオ全体の中で何語言って欲しいかということです。

So, let's say you say two words per second or let's say 2.5 words per second, maybe for the commentary on some esports.

例えば、1秒間に2ワード、あるいは2.5ワードとします。

So, we have the video duration times two.

つまり、動画の長さを2倍します。

Let's say the video is 10 seconds.

ビデオが10秒だとします。

We have 25 words.

文字数は25文字です。

So, we're going to print the word count.

単語数を表示します。

And here you can see we have this video is only video duration seconds long.

この動画は秒数しかありません。

So, it could be 10, 10 seconds long.

つまり、10秒、10秒の長さです。

So, make sure the voiceover must be less than word count.

ですから、ナレーションは単語数以下でなければなりません。

So, let's say 25 words, and that is kind of our initial prompt.

つまり、25ワードとします。これが最初のプロンプトのようなものです。

And then we have, uh, like the personalized prompt for each video.

そして、各ビデオにパーソナライズされたプロンプトを用意します。

So, act as an OpenAI tutorial guide in a conversational style.

会話形式で、OpenAIのチュートリアルガイドのように振る舞います。

Explain step by step of what is happening in the frames suitable for a voiceover.

ナレーションに適したフレームで、何が起きているのかを段階的に説明します。

And then we just add our, yeah, this kind of instruction prompt here at the end.

そして、最後にこのような指示プロンプトを追加します。

So, that is kind of our final prompt.

これが最終的なプロンプトです。

So, we just feed that into the VIS API model, right?

これをVISのAPIモデルに入力します。

And we get an MP3 file out and we get a video, and we kind of merge them together.

そしてMP3ファイルを取り出し、ビデオを取り出し、それらをマージします。

Uh, we can also take the original clip and just add the MP3 file if we want that.

元のクリップにMP3ファイルを追加することもできます。

So, that's a bit detail on the Python script, but, uh, yeah, let's run it on some different video clips and see what kind of cool stuff we can get.

Pythonスクリプトの詳細が少し出てきましたが、では、いろいろなビデオクリップで動かしてみて、どんなクールなものができるか見てみましょう。

So, I thought we can start up with this esports clip.

まずはesportsの映像から。

This is kind of League of Legends.

これはリーグ・オブ・レジェンドのようなものです。

So, you can see there's no sound here, it's just the image, right?

音はなく、映像だけです。

And it's some kind of player getting like a Penta kill in this game.

このゲームでペンタをキルしたプレイヤーの映像です。

So, you can kind of see all what's happening here.

だから、ここで何が起こっているのか、すべてわかると思う。

So, what I thought is we just going to take this file here.

それで、このファイルを使おうと思ったんだ。

This is called lol.MP4, so we're just going to set LOL.

これはlol.MP4というファイルなので、lolにします。

And we want to Output file to be, let's say, League of Legends 2.MP4.

出力ファイルは、例えばLeague of Legends 2.MP4とします。

And let's change up the prompt here.

そして、プロンプトを変更しましょう。

So, let's say, Act as an engaging League of Legends commentator in a short conversational style.

リーグ・オブ・レジェンドの魅力的なコメンテーターとして、短い会話形式で演じてください。

Explain what is happening in the frames suitable for a voiceover.

ナレーションに適したフレームで何が起こっているかを説明してください。

Okay, that is pretty much it.

オーケー、これだけです。

Now, let's run it and see what kind of results we can get.

では、これを実行して、どのような結果が得られるか見てみましょう。

So, you can see on the right here, now we have no extracted frames.

この右側を見てください、今は抽出されたフレームがありません。

But when we run this now, you can kind of see at the moment we get all the frames we need for this, right?

しかし、これを実行すると、必要なフレームがすべて抽出されるのがわかりますね？

And we see the word count is 35.

ワード数は35。

It's 40, 14 seconds.

40、14秒です。

We extracted 15 frames.

15フレームを抽出しました。

And now let's see what kind of, yeah, voiceover we can get there.

では、どんなナレーションができるか見てみましょう。

So, here you can kind of see the description.

説明文をご覧ください。

Does this fit?

これで合うかな？

I don't know.

どうかな。

Let's find out.

やってみましょう。

Okay, so let's take a look here.

では、ここを見てみましょう。

Oh, it's chaos in the base, a dazzling play.

ああ、ベースはカオスだ。

Double kill, they're relentless, chasing down for more.

ダブルキル、容赦なく追い詰める。

Triple kill, can they secure it?

トリプル・キル。

Quadra kill, one more for the glory.

クアドラキル、栄光のためにもう1つ。

Pentakill, unbelievable.

ペンタキル、信じられない。

Okay, that was not too bad, but you can kind of see it missed a bit on the timing here.

まあ、悪くはなかったが、ちょっとタイミングを逃したのがわかるだろう。

It kind of said Pentakill, like three seconds before it actually happens.

実際に起こる3秒前にペンタキルと表示されたんだ。

So, I noticed that sometimes happen.

だから、そういうこともあるんだ。

I don't know, I tried to like amp up the frame capture, that seems to help a bit.

フレームキャプチャーのレベルを上げてみたら、少しはマシになった。

But, uh, you can also kind of adjust the video duration, how many words you want.

でも、動画の長さや単語数を調整することもできるんだ。

So, there is need for some adjustments, but it worked pretty well, right?

だから、多少の調整は必要だけど、かなりうまくいったよ。

Uh, I think we just going to move on to the next clip.

次のクリップに進みましょう。

Okay, so the next clip I have is I just went to like OpenAI playground.

次のクリップは、OpenAIのプレイグラウンドに行ったところです。

I just clicked around, tried to create an assistant here.

クリックしまくって、アシスタントを作ろうとしたんだ。

So, I gave it like a name, I gave it some instructions, I kind of picked a model, right?

名前をつけて、指示を出して、モデルを選んで。

And then I tried to kind of run the model inside this playground window here, and that is basically the screen capture I did.

そして、このプレイグラウンド・ウィンドウの中でモデルを動かしてみました。

Now let's try to add some voice over to this.

では、これにボイスオーバーを加えてみましょう。

So we just picked act as an OpenAI tutorial guide in a conversational style, explain step by step what is happening in the frames.

OpenAIのチュートリアルガイドとして、会話のようなスタイルで、フレームの中で何が起こっているのかをステップごとに説明します。

Yeah, uh, we pick a video duration.

動画の長さを決めます。

Let's do two, maybe let's stick with 2.5.

2.5秒にしましょう。

I think that was pretty good.

なかなか良かったと思います。

Uh, yeah, I think we just going to stay on 60 frames.

このまま60フレームでいこう。

And yeah, let's run it so we can see 28 seconds, 70 words, 29 frames.

そして、28秒、70ワード、29フレームを見ることができるように実行しましょう。

Okay, okay, so let's check it out.

オーケー、オーケー、ではチェックしてみましょう。

So we go OpenAI 2 down here, welcome to the OpenAI playground.

OpenAIのプレイグラウンドへようこそ。

Here we're creating a new assistant.

ここでは新しいアシスタントを作成します。

We start by naming our project, typing in tutorial for clarity.

まず、プロジェクトに名前を付け、わかりやすくするためにtutorialと入力します。

Next, we provide instructions describing our assistant task to help with the tutorial.

次に、チュートリアルに役立つアシスタントのタスクを説明する指示を与えます。

We then select the model, choosing the latest GPT-4 known for its advanced capabilities.

次にモデルを選択し、高度な機能で知られる最新のGPT-4を選択します。

With the model set, we're ready to interact.

モデルがセットされたので、対話の準備ができた。

We type say hi, a simple prompt to test our assistant.

アシスタントをテストするための簡単なプロンプト、say hiを入力します。

And there it is.

そうです。

So it said, said hi.

そう、"say hi "だ。

It's it say it, but that's fine.

と言うのです。

Our assistant responds with high.

アシスタントはハイと返事をします。

No, it responds with hello.

いや、"こんにちは "と答える。

But perfect timing, I guess, confirming it's ready to assist.

しかし、完璧なタイミングだ。

We've successfully set up a basic OpenAI assistant in the playground.

OpenAIの基本的なアシスタントをプレイグラウンドにセットアップすることに成功しました。

Now the possibilities for interaction are endless.

今やインタラクションの可能性は無限大だ。

Let's explore what this AI can do.

このAIができることを探求してみよう。

Wow, that was great.

わあ、すごい。

It got the details wrong.

詳細が間違ってた。

Uh, it said the latest GPT-4 model.

GPT-4の最新モデルって書いてあったんだけど。

I guess that was okay, but uh, it missed some details like the name, the input, and the response.

でも、名前とか、入力とか、反応とか、そういう細かいところは間違ってた。

But that's fine.

でも大丈夫。

I think that was very good.

とても良かったと思う。

Pretty cool, right?

クールでしょ？

Uh, let's move on to kind of a different clip and let's try to change up our final or like our prompt a bit.

では、別のクリップに移って、最終的なプロンプトを少し変えてみましょう。

Now let's try, I found this clip from Netflix series.

Netflixシリーズからこんなクリップを見つけました。

So this is kind of like a nature clip.

これは自然の映像です。

So we have this Siberian tiger here walking just on the snow.

シベリアトラが雪の上を歩いています。

And I think we ended up with like a crow here at the end.

そして最後にはカラスが登場します。

So the prompt I picked here was like act as a nature documentary commentator in the style of David Attenborough in a conversational style.

そこで私が選んだプロンプトは、デヴィッド・アッテンボローのような自然ドキュメンタリーのコメンテーターとして、会話形式で演じるというものでした。

Uh, explain what is happening in the frames to before our voice over.

ボイスオーバーの前に、フレームの中で起こっていることを説明してください。

Okay, so I haven't tested this before.

オーケー、これはまだテストしたことがないんだ。

So let's see what happens.

どうなるか見てみましょう。

Okay, so let's run it.

では、実行してみましょう。

17 seconds.

17秒です。

Perfect.

完璧です。

Here comes the frames.

フレームが出てきました。

Yes, so 177 frames extracted.

はい、177フレームが抽出されました。

The word count is around 43.

単語数は43前後です。

So we don't want any more words than 43.

だから、43語以上はいらない。

That's good.

それでいい。

Okay, so here you can see the description.

では、説明文をご覧ください。

Yeah, that seems pretty accurate to be honest.

正直なところ、かなり正確だと思います。

Kind of in the David Attenborough style.

デービッド・アッテンボロー風だ。

Uh, let's have a listen.

では、聴いてみましょう。

Okay, so here it is.

よし、これだ。

Tiger 2.

タイガー2

In the hushed glow of dawn, a magnificent tiger pads softly through the snow-covered forest.

夜明けの静寂の中、雄大なトラが雪に覆われた森の中をそっと歩く。

Its stripes a stark contrast to the serene white.

その縞模様は静謐な白とは対照的だ。

A symbol of wild endurance, it moves with purpose and power.

野生の耐久力の象徴であるトラは、目的と力を持って動いている。

A solitary figure in the vast wilderness.

広大な荒野の中の孤独な姿。

Yeah, that was good, right?

ああ、よかっただろ？

It didn't mention the crow though, but I think that was pretty good.

カラスのことは書いてなかったけど、なかなか良かったと思うよ。

It kind of reminded me of like a David Attenborough documentary, right?

デヴィッド・アッテンボローのドキュメンタリーを思い出したよ。

So I really like that.

だからすごく好きなんだ。

Let me add some kind of some music and stuff on top of this and see how good it really can be.

この上に音楽とかを追加して、どれだけ良くなるか見てみよう。

Okay, so I went ahead and added some background music.

では、BGMを追加してみました。

Let's have a listen now.

それでは聴いてみましょう。

I think this was much better.

こっちの方がずっといいと思う。

In the hushed glow of dawn, a magnificent tiger pads softly through the snow-covered forest.

夜明けの静まり返った光の中、雪に覆われた森の中を雄大な虎がそっと駆けていく。

Its stripes a stark contrast to the green white.

その縞模様は緑の白とは対照的だ。

A symbol of wild endurance, it moves with purpose and power.

野生の耐久力の象徴であるトラは、目的と力を持って動いている。

A solitary figure in the vast wilderness.

広大な荒野の中の孤独な姿。

Yeah, I'm pretty happy with that.

ああ、かなり満足だ。

Pretty cool, right?

カッコイイだろ？

Okay, so that was quite a cool project, if you ask me.

さて、私に言わせれば、なかなかクールなプロジェクトだった。

If you're interested in a code, you can find a link to my YouTube membership down in the description.

もしコードに興味があるなら、説明文の下の方に僕のYouTube会員へのリンクがある。

If you become a member, you can get access to the GitHub.

会員になればGitHubにアクセスできる。

So I'm going to upload this to the GitHub.

だから、これをGitHubにアップロードするよ。

And also, don't forget to check out CapCut online.

それから、CapCutのオンライン・チェックもお忘れなく。

Thank you for tuning in.

ご視聴ありがとうございました。

Have a great day, and I'll see you in the next one.

それではまた次回。

この記事が気に入ったらサポートをしてみませんか？