【Zeroscope: テキストからビデオ生成】英語解説を日本語で読む【2023年6月29日｜@Matt Wolfe】

2023年6月29日 23:18

テキストから動画を生成する新しいツールが登場し、その結果が注目されています。Pano headというプロジェクトでは、単一の画像から立体的な頭部を作成することができます。また、motion GPTでは人間の動きのテキスト説明や予測を生成します。ビデオでは、テキストから動画を生成するツールに焦点を当てて紹介し、以前のオプションであるRunwayMLやmodel scopeと比較して、無料で利用できるZeroscopeという新しいツールを紹介します。Zeroscopeで作成された印象的な動画生成の例も示されます。
公開日：2023年6月29日
※動画を再生してから読むのがオススメです。

A brand new tool was just made available for text to video generation, and the results have been absolutely wild so far.

テキストをビデオに変換する全く新しいツールが利用可能になったばかりだが、その結果は今のところ、実に素晴らしいものだ。

I think we have a brand new standard in what text of video can actually do.

私たちは、テキストからビデオへの変換で実際にできることの、まったく新しい標準を手に入れたと思う。

In this video, I'm going to break down some of the coolest Generations that I've come across and show you how you can use a tool like this yourself, totally for free.

このビデオでは、私が出会った最もクールなジェネレーションのいくつかを分解し、このようなツールを完全に無料で自分で使う方法をお見せします。

This video is going to be quicker than most of my videos, but I do want to show you a couple of other things before I get into this new text-to-video model, so let's dig in.

このビデオは、私のビデオの大半よりも早く終わるだろうが、この新しいテキスト・トゥ・ビデオ・モデルに入る前に、他にもいくつかお見せしたいことがある。

So here's some new research called Pano head, which essentially allows you to make a 3D head based on just a single image.

これはパノヘッドと呼ばれる新しい研究で、1枚の画像から3Dの頭部を作ることができます。

You can see these are 3D generated heads right here, where it's rotating around the head.

これが3Dで生成された頭部で、頭の周りを回転しているのがわかります。

But if we scroll down, you can see in this example there's a picture of The Rock, and what it turns it into is this sort of 3D, you know, rotatable version of The Rock.

しかし、下にスクロールしてみると、この例ではザ・ロックの写真があり、それをこのような3D、つまり回転可能なザ・ロックに変えることができます。

Now, his head shape is kind of not perfect because it's sort of guessing.

さて、彼の頭の形は完璧ではないです。推測によるものなので。

And what we're seeing here with this little animation is essentially, again, you might have heard the term Gan before.

そして、この小さなアニメーションで私たちが見ているのは、本質的に、また、ガンという言葉を聞いたことがあるかもしれない。

It stands for gender active adversarial Network, and basically what it's doing is it's looking at this initial picture, it's making a picture and saying, Does this picture look like this one?

これはジェンダー・アクティブ・アドバーサリアル・ネットワークの略で、基本的にこれがやっていることは、この最初の絵を見て、絵を作り、「この絵はこの絵に似ているか？

No?

違う？

Okay, how about this one?

じゃあ、これはどう？

No?

違うか？

How about this one?

これはどうですか？

And then it keeps going back and forth until it gets as close as it possibly can to the original picture.

そして、それは元の画像にできるだけ近づくまで繰り返し戻ります。

Now, that's a total oversimplification, but once it figures out a close enough image, then it tries to guess essentially all angles of that image.

さて、これは完全に単純化しすぎだが、いったん十分に近い画像を見つけ出すと、その画像の基本的にすべての角度を推測しようとする。

Here's an example that I came across on Twitter from @hackmans, where they're trying to generate the image on the left.

これは@hackmansのツイッターで見つけた例で、彼らは左の画像を生成しようとしている。

Once the image on the left is finally generated and looks close enough, then it generates the 3D version of it.

左の画像が最終的に生成され、十分に近く見えたら、その3Dバージョンを生成します。

Here's another one.

もうひとつはこちら。

It's generating that image on the left, and you end up with this 3D image here.

左側の画像を生成し、ここで3Dの画像が得られます。

And here's another one.

もうひとつはこちらです。

I think this one's interesting because you could tell that her hair is sort of supposed to be in a bun, but then when it guesses what the hair is supposed to look like behind it, it doesn't quite look like that bun.

これは興味深いものだと思います。彼女の髪がまとめられるべきであることがわかりますが、そのまとめ髪の後ろにあるはずの髪の姿を推測すると、まとめ髪のようには見えません。

And then again, here's another example.

そして、もう1つの例です。

Now, this Pano head is open source research that is available on GitHub.

このパノヘッドはオープンソースの研究で、GitHubで公開されています。

You can actually find it here at this URL.

このURLで見つけることができます。

I'll make sure it's linked below in the description, of course.

もちろん、説明文の下にリンクがあることを確認しておく。

However, if you do want to use it, it does say here under the requirements, one to eight high-end NVIDIA GPUs.

しかし、もしこれを使いたいのであれば、要件のところに1～8個のハイエンドNVIDIA GPUと書いてあります。

So, if you have an RTX 3090 or something similar at home, you probably could use this on your home computer.

ということは、もし自宅にRTX 3090かそれに近いものがあれば、おそらく自宅のコンピューターでこれを使えるだろう。

But I have a feeling it might take a while.

しかし、時間がかかりそうな予感がする。

I did a search on both Hugging Face and replicate.com and have not found a publicly available Cloud version of this yet.

Hugging Faceとreplicate.comの両方で検索してみたが、これの公開クラウド版はまだ見つからなかった。

But I imagine it's only a matter of time before we'll be able to use it in one of those Cloud environments.

ただし、いずれクラウド環境の1つで使用できるようになるまで時間の問題だと思います。

Now, this other research that I recently came across that I wanted to show off real quick was called motion GPT human motion as foreign language.

さて、最近見つけた他の研究で、すぐに紹介したいものがあります。それは、motion GPT human motion as foreign languageと呼ばれるものです。

You can see it essentially generates text emotion.

これは基本的にテキスト感情を生成するものです。

Can you show me a person who is practicing Karate kicks?

空手のキックを練習している人を見せてくれませんか？

And it generates this text emotion of somebody generating Karate kicks.

すると、空手のキックを練習している人の感情がテキストで生成されます。

It's also capable of motion to text, explaining the motion demonstrated on.

また、動きをテキスト化し、その動きを説明することもできます。

And then in the little video that we're seeing over on the right here in English, so you see the character walking around.

そして、この右側にある英語の小さなビデオでは、キャラクターが歩き回っているのを見ることができます。

And then their response from the computer, a person walks in a semi-circular pattern, tiptoeing.

そして、コンピューターからの反応ですが、人が半円を描くように、つま先立ちで歩いています。

It could also predict the next movements from what it sees.

また、見たものから次の動きを予測することもできます。

And you can see it's trying to predict the next movement down in the bottom left here.

ここでは、左下に次の動きを予測しようとしています。

We can see some more examples here.

他にもいくつかの例があります。

A person is walking forwards but stumbles and steps back, then carries on forward.

ある人が前方に歩いているのですが、つまずいて後ずさりし、そのまま前方に進みます。

You can see they're walking forward, they kind of stumble a little bit, then continue.

彼らが前に進み、少しよろけてから続けるのが見えます。

A person moves their hands back and forth as if using a broom.

人はほうきを使っているかのように手を前後に動かします。

So, this is pretty interesting research that's coming out.

それで、これは非常に興味深い研究です。

Unfortunately, it doesn't seem like we have access to it yet.

残念ながら、私たちはまだこの研究にアクセスできていないようだ。

If we look at their GitHub repository, it explains what it does and you can read a little bit more about how it works.

GitHubのレポジトリを見れば、それが何をするものなのかが説明されているし、どのように機能するのかについても少し読むことができる。

But there's not much information about it just yet, so we don't actually know when we'll be able to use it ourselves.

しかし、まだあまり情報はありませんので、実際にいつ使用できるようになるかはわかりません。

Now, let's talk about text to video.

さて、次はテキストからビデオへの変換についてです。

Previously, if you want to do text to video, you'd have to use something like runwayml.

以前は、テキストをビデオに変換する場合、runwaymlなどを使う必要があった。

Personally, I am a huge fan of runwayml.

個人的には、私はrunwaymlの大ファンだ。

It does generate some pretty good generations, although my fish seems to have tails on both sides of it.

個人的には、runwaymlの大ファンだ。私の魚は両側に尾があるように見えるが、かなり良い世代を生成してくれる。

And generating videos does cost credits, of which if you're generating a decent amount of videos, you will eat through credits very quickly.

そして、ビデオを生成するにはクレジットが必要で、もしあなたがまともな量のビデオを生成しているなら、クレジットをあっという間に使い果たしてしまうだろう。

The costs do add up fairly quick.

コストはかなり速く積み上がります。

Your other option, other than Gen 2, has been something like model scope, which you can use for free on Hugging Face.

Gen 2以外の選択肢としては、Hugging Faceで無料で使えるモデル・スコープのようなものがあります。

And you can see here's one that I just quickly generated that says a monkey riding roller skates, and we get something that looks like this.

これは、Hugging Faceで無料で使えるモデルスコープのようなもので、猿がローラースケートに乗っているという設定で、このようなものができました。

But if you remember from past videos where we've talked about model scope, they all seem to have this Shutterstock watermark across pretty much every video because clearly it was trained on Shutterstock data.

しかし、モデルの範囲について話した過去のビデオを覚えているならば、どのビデオでもShutterstockの透かしマークがほぼすべてのビデオに表示されていることがわかるでしょう。明らかにShutterstockのデータで訓練されたからです。

But now, recently, we've been given access to Zeroscope, and this one is actually available for free over on Hugging Face.

しかし最近、私たちはZeroscopeにアクセスできるようになりました。これは実際にHugging Faceで無料で入手できます。

You can find it at this URL.

このURLで見つけることができる。

Now, it still makes fairly short generations, but as you can see from this one, it doesn't have the watermark, and it actually feels slightly more coherent than what we were seeing from model scope.

さて、それでもかなり短い世代しか作れないが、この1枚を見てわかるように、ウォーターマークは入っていないし、実際、モデルスコープから見ていたものよりも若干首尾一貫しているように感じられる。

I will warn you, however, though, if you do want to generate something with the free version of Zeroscope over on Hugging Face, the generation time can usually be fairly long.

ただし、Hugging FaceにあるZeroscopeの無料版で何かを生成したい場合は、生成時間がかなり長くなることがあることをお断りしておく。

And if you're using it at peak hours, it may just not work at all and tell you that it's too busy.

また、ピーク時に使用すると、まったく機能せず、「混雑しています」と言われることもある。

A monkey on roller skates, submit something went wrong, the application is too busy, keep trying.

ローラースケートを履いた猿のように、何か問題が起きたのなら、アプリケーションが混み合っているのだ。

So apparently, there's too many people using it right now, and I can't generate another video.

というわけで、どうやら今使っている人が多すぎて、別の動画を生成することができないようだ。

However, you can duplicate the space if you'd like to.

しかし、そのスペースを複製することは可能だ。

And the recommended hardware is using an NVIDIA A10 G, which will cost you about three dollars and fifteen cents per hour.

そして、推奨されるハードウェアはNVIDIA A10 Gを使うことで、1時間あたり約3ドルと15セントかかる。

However, using this method, it only takes about one minute to generate a video, less than a minute, maybe 55 seconds or so to generate a video.

しかし、この方法を使えば、ビデオを生成するのにかかる時間は約1分、1分もかからない、55秒かそこらだ。

For three dollars and fifteen cents per hour, you could theoretically generate anywhere from 50 to 60 videos if you are really fast at prompting.

時間あたり3ドルと15セントで、プロンプトを出すのが本当に速ければ、理論的には50本から60本のビデオを作ることができる。

I duplicated the space here myself, so we'll play with this in a second.

ここではスペースを自分で複製していますので、これをすぐに試してみましょう。

But before we do, I want to show you some of the cooler generations that I've come across so far, just to show you what this is capable of.

しかし、それを行う前に、これまでに出会った中で最もすごい生成物のいくつかを紹介したいと思います。これが何ができるのかをお見せします。

So this video is from Pharma psychotic here, and it's this really cool generation of this like robot cat that has lasers or guns or something.

この動画はPharma psychoticのもので、レーザーや銃などを持ったロボット猫のような、とてもクールな世代です。

And I just love how this one came out.

レーザーや銃を持っています。

It's so cool.

とてもクールだ。

I mean, you're not getting something like this out of Model scope.

モデル・スコープからこんなものは出てこないよ。

Here's another one I came across from Spencer Sterling.

スペンサー・スターリングから見つけた別の作品です。

This, like, this weird underwater creature sort of scenario.

これは、この奇妙な水中生物のようなシナリオです。

But the colors and the sort of definition, the quality to it, just seems to be so much better than what we were getting out of Model scope.

しかし、色合いや定義、品質は、Model scopeよりもはるかに優れているように思われます。

And quite honestly, although I do love Runway and I love all their Suite of tools and I love what you can get out of Gen 2, I think what we're getting out of Zeroscope right now is actually a little bit better.

正直なところ、僕はRunwayが大好きだし、彼らのツール群も大好きだし、Gen 2から得られるものも大好きなんだけど、今僕らがZeroscopeから得ているものは、実際にはもう少し優れていると思うんだ。

Oh, this is probably not the best example because this is a bunch of creepy-looking sea monsters or something.

ああ、これはおそらく最良の例ではありません。これは何か怖いような海の怪物のようなものです。

Here's another one that I really enjoyed that I came across from Vania.

もうひとつ、Vaniaの作品に出会ってとても楽しかったものがあります。

But this is like this celebration with fireworks and everybody cheering.

でも、これは花火とみんなの歓声があるお祝いのようなもの。

And then we've got this sort of psychedelic visuals that they just blow me away.

そして、私たちはこのサイケデリックなビジュアルを持っています。本当に私を感動させるものです。

I love the colors and I love the definition of these videos.

私はこの色彩が大好きだし、この映像の鮮明さが大好きなんだ。

Here's one I came across from Lyle.

こちらはLyleさんから見つけたものです。

I can actually play the music in this one because the music was generated with music gen.

この映像の音楽はミュージック・ジェネレーションで作られたものだから、実際に演奏することができるんだ。

But this one has that sort of painterly style.

でも、これは絵画的なスタイルだね。

They almost look like they can be like Vincent van Gogh paintings that came to life.

まるでヴィンセント・ファン・ゴッホの絵に命が吹き込まれたかのようだ。

And this was generated with Zeroscope.

これはZeroscopeで生成したものだ。

Here's one that I came across from rupe renisto.

こちらはrupe renistoで見つけたものです。

I'm sorry if I mispronounced your name.

お名前を間違えていたらごめんなさい。

And I think it's supposed to be Jerry Seinfeld.

ジェリー・サインフェルドのつもりなんだけど。

It's so, is it a good thought?

それは、いい考えですか？

Isn't it strange how socks go into the washing machine as a pair and come out single?

靴下がペアで洗濯機に入って、シングルで出てくるのって不思議じゃない？

Men and women often seem like they're from different planets.

男と女って違う惑星から来たみたいに見えることがよくある。

I just think it's hilarious.

私はそれが滑稽だと思う。

Obviously, these aren't fooling anybody into thinking this is an actual real video of people.

明らかに、これらは誰もが実際の人のビデオだと思うことはありません。

And the images sort of blur together inside of the video.

映像の中では、映像がブラーとなって混ざり合っているんだ。

But there's something just fun and interesting about watching these videos knowing that AI generated these videos.

しかし、AIがこれらの動画を生成したと知りながら、これらの動画を見るのは何か楽しくて面白い。

And then getting this, you know, weird kind of borderline creepy result as you watch it.

そして、見ているうちに、奇妙な、境界線上の不気味な結果を得る。

Here's another one that I came across from three deal over on Reddit instead of the AI video subreddit.

RedditのAIビデオ・サブRedditの代わりに、3つのディールから見つけた別のものを紹介しよう。

You can see we've got different characters walking.

さまざまなキャラクターが歩いているのがわかるだろう。

You've got the Knight, you've got the soldiers, you've got a robot, got a different style robot.

あなたはナイトを持っています、兵士を持っています、ロボットを持っています、異なるスタイルのロボットを持っています。

It's just really cool.

本当にクールだよ。

If you're wondering how they got these videos that are longer than three seconds, they're just generating a bunch of different videos and pushing them together.

どうやって3秒以上の動画を作っているのかと思ったら、いろいろな動画を生成して、それを組み合わせているだけなんだ。

This one, they may have even just generated one video of, you know, a monkey walking or a person walking and then run it through something like gen 1 to change what the image looks like.

この動画は、猿が歩いている動画や人が歩いている動画を1つだけ生成して、それをGEN 1のようなものに通して、映像の見え方を変えているのかもしれない。

I'm not totally sure how they achieved this effect.

どのようにしてこのような効果が得られたのかはまったくわかりません。

And here's zeroscope.

これがゼロスコープです。

Here's what it looks like when you use it inside of Hugging Face.

Hugging Faceの中で使うとこんな感じです。

I did duplicate the space so that I could generate whatever I want and do it fairly quickly.

私はスペースを複製したので、自分が望むものを生成し、それをかなり速く行うことができました。

So I generated a monkey on roller skates.

そこで、ローラースケートを履いた猿を生成してみた。

Here's this one's version of a monkey on roller skates.

これがローラースケートを履いた猿のバージョンです。

A little bit more cartoony.

もう少し漫画っぽいです。

It didn't try to go for that super realism.

超リアルさを追求したわけではない。

Let's try to recreate some of the other ones we saw earlier.

前に見た他のもののいくつかを再現してみましょう。

Colorful underwater sea life.

カラフルな水中生物。

You can see it's estimating about 53 seconds to generate this video.

このビデオを生成するのに約53秒かかっているのがわかるだろう。

And here's what we got out of that.

そしてこれがその結果だ。

If you remember my earlier Gen 2 generation, that one looked a little bit less like a fish than this one, if I'm being honest.

以前のGen2世代を覚えている人は、正直に言うと、この動画よりも魚のように見えなかったと思う。

Let's do a swimming octopus in a vibrant blue ocean.

鮮やかな青い海の中を泳ぐタコをやってみよう。

This one's gonna take about 51 seconds.

これは約51秒かかります。

And here's what we get out of that one.

それによって得られるものはこちらです。

Not bad.

悪くない。

I mean, you definitely know what it is.

つまり、あなたはそれが何であるか間違いなく知っている。

Elon Musk wrestling with Mark Zuckerberg.

イーロン・マスクがマーク・ザッカーバーグと格闘。

And here's what that looks like.

それがどのように見えるかをここに示します。

I don't know what's going on.

何が起こっているのかわからない。

Now I actually took a whole bunch of generations of Elon Musk and Mark Zuckerberg fighting and this was the result.

実際にイーロン・マスクとマーク・ザッカーバーグが戦っているところを何世代にもわたって撮影した結果がこれだ。

And that's exactly how I imagine it going down too.

それがまさに私がそれが起こると想像している方法です。

So that's called zeroscope.

これがゼロスコープと呼ばれるものです。

Again, you can use it 100 for free on Hugging Face if you are patient.

繰り返しになりますが、我慢すればHugging Faceで無料で100個使えます。

I haven't personally found a way to install it on your own computer and run it locally, although that's not to say there isn't a way.

個人的には、自分のコンピューターにインストールしてローカルで実行する方法は見つかっていない。

I just personally haven't found it yet.

ただ、個人的にはまだ見つけていない。

So right now your best option is either using it for free on Hugging Face but waiting or duplicating the space and you can generate about a video a minute.

だから、今現在のベストな選択肢は、Hugging Faceで無料で使うか、スペースを複製して待つかのどちらかだ。

And there you have it.

それで、以上です。

There's a new text to video AI tool, zeroscope, that's available for anybody to use right now.

新しいテキストをビデオに変換するAIツール、zeroscopeは、今すぐ誰でも使うことができます。

You can create some fun videos.

楽しいビデオを作ることができます。

I just showed you a teeny tiny handful of what I've found on Twitter, but a lot of these videos are kind of going viral right now.

私がTwitterで見つけたほんの一握りの動画をお見せしましたが、これらの動画の多くは今、バイラルになっています。

So if you look, you'll probably find a lot more.

探せばもっとたくさん見つかるだろう。

They're real fun to create.

作るのは本当に楽しい。

They're real easy to create.

それらは作成が非常に簡単です。

Whatever you can imagine, you can generate a funky looking video of.

想像できるものなら何でも、ファンキーなビデオを作ることができる。

So hopefully you enjoyed this video.

このビデオを楽しんでいただけたなら幸いです。

If you like they're not about this stuff, check out futuretools.io where I curate all the latest tools and news that I come across.

もしこのようなことについて興味がなければ、私が出会った最新のツールやニュースをまとめているfuturetools.ioをチェックしてみてください。

And also, if you haven't already, join the free newsletter.

また、まだの方は無料のニュースレターに参加してください。

I send it out every Friday.

毎週金曜日に配信しています。

It's the tldr of everything you missed, both tools and news of AI for the week.

その週のツールやAIのニュースなど、見逃したものをすべてまとめたものだ。

You can find it all over at futuretools.io.

すべてfuturetools.ioでご覧いただけます。

And if you haven't already, maybe consider giving this video a thumbs up and to subscribe and a bell and all of the stuff because that will help me with the algorithm and also it'll make sure you see more videos like this in your news feed.

もしまだであれば、このビデオに高評価や購読、ベルの設定などを考慮してみてください。それはアルゴリズムの改善に役立ち、またあなたがこのようなビデオをもっとニュースフィードで見ることができるようになります。

Thank you so much for tuning in.

ご視聴ありがとうございました。

I really, really appreciate you.

本当に本当にありがとう。

I'll see you guys the next video.

また次のビデオでお会いしましょう。

Bye.

さようなら。

この記事が気に入ったらサポートをしてみませんか？