【Googleの新技術Lumiere：テキストからビデオへの次世代変換】英語解説を日本語で読む【2024年1月25日｜@TheAIGRID】

2024年1月28日 22:47

Google Reserchが発表したLumiereという最先端のテキストからビデオへの変換技術は、これまでに見た中で最も進化したモデルです。ユーザーの評価でも、既存のImagenやPika Labs、ZeroScope、RunwayのGen-2などのモデルよりも優れていることが示されています。Lumiereは、空間と時間の両方を効率的に扱うSpaceTimeユニットアーキテクチャを使用し、テキストから画像への拡張モデルを基にビデオデータの複雑さを処理しています。
公開日：2024年1月25日
※動画を再生してから読むのがオススメです。

So, Google Research recently released a stunning paper, and they show off a very, very state-of-the-art text to video generator.

最近、Google Researchが驚くべき論文を公開しましたが、非常に最新のテキストからビデオを生成するモデルを披露しています。

By far, this is likely going to be the very best text to video generator you've seen.

これはおそらく、これまで見た中で最高のテキストからビデオを生成するモデルになるでしょう。

I want you guys to take a look at the video demo that they've shown us because it's fascinating.

彼らが私たちに披露したビデオデモを見ていただきたいです。それは魅力的ですから。

After that, I'll dive into why this is state-of-the-art and just how good this really is.

その後、なぜこれが最新技術であり、実際にどれほど優れているのかについて詳しく説明します。

Now, one of the most shocking things from Lumiere (and I'm not sure if that's exactly how you pronounce it), but one of the very, very most shocking things that we did see was, of course, the consistency in the videos and how certain things are rendered.

今、Lumiereから最も衝撃的なことの1つ（そして正確にはそれがどのように発音されるのかはわかりません）ですが、私たちが見た中で非常に非常に衝撃的なことの1つは、もちろんビデオの一貫性と特定のものがどのように描写されているかです。

There are a bunch more examples that they didn't actually showcase in this small video, so I will be showcasing you those from the actual web page.

実際のウェブページから、彼らがこの小さなビデオで実際にショーケースに出さなかった他の例がまだいくつかありますので、それらをあなたにショーケースします。

But it is far better than anything we've seen before, and some studies that they did actually do confirm this.

これまでに見たものよりもはるかに優れており、彼らが実際に行ったいくつかの研究でも確認されています。

For example, in their user study, what they found was that our method was preferred by users in both text to video and image to video generation.

たとえば、ユーザースタディでは、テキストからビデオや画像からビデオを生成する際に、私たちの手法がユーザーによって好まれたことがわかりました。

One of the benchmarks that they did (I'm not sure what the quality score was), but you can see that theirs, which is, of course, the Lumiere, actually performed a lot better than Imagen, a lot better than Pika Labs, a lot better than ZeroScope, and Gen-2, which is Runway.

彼らが行ったベンチマークの1つ（品質スコアはわかりませんが）、Lumiereの方がImagenよりも、Pika Labsよりも、ZeroScopeよりも、そしてGen-2（Runway）よりもはるかに優れていることがわかります。

So, Gen-2, if you don't know, is being compared against Runway's video model, and Runway actually recently did launch a bunch of stuff.

Gen-2は、Runwayのビデオモデルと比較されていますが、最近、Runwayはさまざまなものを発表しました。

But if we look at text alignment as well, we can see that across all different video models, this is the winner.

また、テキストの整列についても見てみると、さまざまなビデオモデル全体でこれが最も優れていることがわかります。

And then, of course, on image to video or video quality, you can see that against Pika, it wins a lot of the time against State diffusion video (I'm pretty sure that's what that is). Then we can see for image to video, you can see it wins against Pika Labs and wins against Gen-2.

そして、画像からビデオやビデオの品質についても、Pikaに対しては多くの場合勝っています（Stable Diffusion videoだと思います）。また、画像からビデオについても、Pika LabsやGen-2に勝っています。

I'm not sure if this is Stable Diffusion video too, but if you haven't seen that, it's actually something that is really good too.

これもStable Diffusion videoかどうかはわかりませんが、それも非常に優れたものです。

Overall, we do know that right now, this is actually the gold standard in a text to video, which is a very good benchmark because many people have been discussing how 2024 is likely to be the year for text to video.

全体的に、現時点では、これがテキストからビデオのゴールドスタンダードであり、2024年はおそらくテキストからビデオの年になると多くの人々が議論しています。

Now, what I do want to talk about before I dive into some of the more crazy examples of their stuff was, of course, the new architecture.

では、彼らの素晴らしいもののいくつかのクレイジーな例に入る前に、新しいアーキテクチャについて話したいと思います。

Because what exactly is making this so good?

なぜこれがそんなに良いのか、具体的には何ですか？

Because, as you know, it looks fascinating in terms of everything that we can do.

なぜなら、私たちができることのすべてにおいて魅力的に見えるからです。

And when I show you some of the more examples, you're going to see exactly why this is even better than you thought.

そして、もっと具体的な例を見せると、思っていた以上に優れていることがわかります。

Essentially, the first thing that they do is they utilize the SpaceTime unit architecture, so unlike traditional video generation models that create key frames. And then, fill in the gaps. Lumiere generates the entire temporal duration of the video in one go, and this is achieved through a unique SpaceTime unit architecture which efficiently handles both spatial and temporal aspects of the video data.

基本的に、彼らが行っている最初のことは、SpaceTimeユニットアーキテクチャを利用することです。従来のビデオ生成モデルとは異なり、キーフレームを作成してギャップを埋めるのではなく、Lumiereはビデオの全時間を一度に生成します。これは、ビデオデータの空間的および時間的な側面の両方を効率的に処理するユニークなSpaceTimeユニットアーキテクチャによって実現されています。

Now, what they also do is they have temporal down sampling and upsampling, and Lumiere incorporates both spatial and temporal downsampling and upsampling in its architecture.

さらに、彼らは時間的なダウンサンプリングとアップサンプリングも行っており、Lumiereはそのアーキテクチャに空間的および時間的なダウンサンプリングとアップサンプリングの両方を組み込んでいます。

Now, this approach allows the model to process and generate full frame rate videos much more effectively, leading to more coherent and realistic motion in the generated content.

このアプローチにより、モデルはフルフレームレートのビデオをより効果的に処理し生成することができ、生成されたコンテンツの一貫性と現実味のある動きが向上します。

Now, of course, what they also did was they leverage pre-trained Text-to-Image Diffusion Models, and the research is built upon existing Text-to-Image Diffusion Models, adapting them for video generation.

さらに、彼らは事前学習されたText-to-Image Diffusion Modelsを活用しており、この研究は既存のText-to-Image Diffusion Modelsをビデオ生成に適応させることで構築されています。

And this allows the model to benefit from the strong generative capabilities of these pre-trained models while extending them to handle complexities of video data.

これにより、モデルはこれらの事前学習モデルの強力な生成能力を活用しながら、ビデオデータの複雑さに対応することができます。

Now, one of the significant challenges in video generation is, of course, maintaining global temporal consistency, and Lumiere's architecture and training approach are specifically designed to address this, ensuring that the generated videos exhibit coherent and realistic motion throughout their duration.

ビデオ生成における重要な課題の1つは、グローバルな時間的一貫性を維持することですが、Lumiereのアーキテクチャとトレーニングアプローチは、これに対応するように特別に設計されており、生成されたビデオが全体の時間にわたって一貫性があり現実味のある動きを示すようにしています。

Now, this is Lumiere's GitHub page, and this is by far one of the very best things I've ever seen because I want to show you guys some of these examples to just show you how advanced this really is.

これがLumiereのGitHubページであり、これは私がこれまで見た中で最も素晴らしいものの1つです。この例をいくつか紹介して、どれだけ進化しているかをお見せしたいと思います。

So, one of the clips I want you to pay attention to is, of course, and I'm going to zoom in here, is of course this Lamborghini because this actually shows us how crazy this technology is.

では、注目していただきたいクリップの1つは、もちろん、ここで拡大表示しているランボルギーニです。これは実際にこの技術がどれほどすごいかを示しています。

So, we can see that the Lamborghini is driving, driving, driving, and then as it rotates, we can actually see that the Lamborghini wheel is not only moving, but also we can see the other angles of that Lamborghini too.

したがって、他のビデオモデルと比較すると、動きや回転といった点で苦労していることがありますが、この新しいアーキテクチャを使用することで、ランボルギーニや回転など、ビデオにとって本当の課題ではなくなることがわかります。

So, I would say that if we can compare it to some of the other video models, one of the things that we do struggle with is, of course, the motion and, of course, the rotation, but seemingly they've managed to solve this by using this new architecture, and we can see that things like the Lamborghini and rotations, which is a real struggle for video, isn't going to be a problem.

ですので、他のビデオモデルと比較すると、私たちが苦労していることの1つは、もちろん、動きと回転ですが、この新しいアーキテクチャを使用することでこれを解決できたようです。ランボルギーニや回転など、ビデオにとって本当に難しいものは問題にならないことがわかります。

Now, another one of my favorite examples was, of course, beer being poured into a glass.

次に、私のお気に入りの例の一つは、もちろんビールをグラスに注ぐことです。

So, if we take a look at this, this is absolutely incredible because we can see that the glass is just being filled up, and it looks so good and realistic.

これを見てみると、本当に信じられないことがわかります。グラスがただ満たされていく様子が見え、とても良くてリアルです。

I mean, we have the foam, we have the beer actually just moving up, we also do have the bubbles, and we have things just looking really realistic, like if someone was to say this is just a low FPS video of me pouring liquid into a glass, I would honestly believe them.

もともと、泡があり、ビールも実際に上昇しており、泡もあり、また、ビデオにおいて液体をグラスに注いでいるように見える非常にリアルな要素もあります。もしこれが私が液体をグラスに注いでいる低いFPSのビデオだと誰かが言ったら、私は本当に信じるでしょう。

And even if you don't think that it is realistic, I think we can all agree that this is very, very good for text to video.

そして、それがリアルではないと思わないとしても、テキストからビデオへの変換には非常に優れていると思います。

And if you just hover over it, you can see the input.

それをホバーするだけで、入力が表示されます。

Now, some of these as well, there are just really, really good showcases of how good it is at rotations because I've seen some of the other video models, and this is something that we've only recently, like literally yesterday, I saw a preview, and only recently we've managed to solve that a little bit.

これらの中にも、回転が非常にうまくいっている素晴らしいショーケースがあります。他のビデオモデルのいくつかを見たことがありますが、これは私たちが最近、つい昨日、プレビューを見たもので、最近になって少し解決できたものです。

So, I mean, if we take a look at the bottom left, we can see that the sushi is rotating, and it looks to me like this, it doesn't look as AI generated as many other videos.

したがって、左下を見ると、寿司が回転しているのがわかりますが、他の多くのビデオと比べてAI生成されたものとは思えません。

The only one issue that AI generated videos do suffer from is, of course, low resolution and low frames per second, but I mean, I think that that is going to be solved very, very soon.

AI生成されたビデオが苦しんでいる唯一の問題は、解像度が低く、フレームレートが低いことですが、非常に近い将来に解決されると思います。

And with what we have here as well, like, I mean, if we look at The Confident Teddy Bear Surfer Rides Waves in the Tropics, I mean, if we look at how the water ripples every single time the surfboard actually makes impact with the water, I think we can say that it does look very, very realistic.

そして、ここにあるものも、自信を持って言えると思います。例えば、「自信を持ったテディベアサーファーが熱帯地方で波に乗る」というものを見てみると、サーフボードが水に衝突するたびに水面が波打つ様子が非常にリアルに見えます。

And then, of course, we have the chocolate muffin video clip.

そして、もちろん、チョコレートマフィンのビデオクリップもあります。

Now, this one right here as well looks super, super temporarily consistent.

今、これも非常に非常に一時的に一貫して見えます。

I mean, just the way that it rotates just looks like nothing we've ever seen before.

まあ、ただ回転する方法だけでも、これまでに見たことのないようなものに見えます。

And of course, this wolf one silhouette against a twilight sky also looks very, very accurate and very, very good.

そして、もちろん、このシルエットのオオカミが黄昏の空に映える様子も非常に正確で非常に良いです。

So, I mean, these demos of the texture video, I would say, are just absolutely outstanding.

したがって、これらのテクスチャビデオのデモは、まさに素晴らしいものです。

This one right here, the fireworks that we're looking at, is definitely something that I've seen done by other models before, but it does go to show how good it is.

ここで見ている花火は、他のモデルでも見たことがあるものですが、それがどれだけ優れているかを示しています。

And this one right here, camera mthing through dry grass at an Autumn morning, also does so just how good it is.

そして、ここにあるものは、秋の朝に乾いた草を通り抜けるカメラの映像も同様です。

Now, with regards to walking and legs and stuff like that, there is still a bit of a small issue there, and there are some other things that I do want to discuss about this entire project because this entire project is, I'm pretty sure, a collaboration of some other AI projects that Google has done before, and I can't wait to, see if Google manages to finally release this.

歩行や足などに関しては、まだ少し問題がありますし、このプロジェクト全体についても他にいくつか話したいことがあります。このプロジェクト全体は、おそらくGoogleが以前に行った他のAIプロジェクトの共同作業であると思われます。Googleがこれを最終的にリリースするのを楽しみにしています。

So, one of the other models, so some of the other ones that are my favorites, of course, the chocolate syrup pouring on vanilla ice cream, that looks really well, and then this clip of the skywalking doesn't look too bad.

他のモデルの中でも、私のお気に入りのものは、もちろんバニラアイスクリームにチョコレートシロップを注ぐものです。それは本当に良く見えますし、スカイウォーキングのクリップも悪くありません。

And I think that when we take a look at certain videos that are very subtle in nature, so for example, blooming cherry tree in the garden, that looks pretty subtle, and then of course, the Aurora Borealis, that one looks pretty subtle too.

そして、非常に微妙な性質のある特定のビデオを見るとき、例えば庭の咲き誇る桜やオーロラのようなものは、非常に微妙に見えます。

So, a lot of these videos, I think personally, do just are just absolutely the best.

したがって、これらのビデオの多くは、個人的には本当に最高だと思います。

And of course, we do need to take a look at stylized generation because this is something that is really, really important for generating certain styles of videos, but Google's Lumiere does it really, really well.

そして、もちろん、スタイル化された生成も見ておかなければなりません。これは特定のスタイルのビデオを生成するために非常に重要なことですが、GoogleのLumiereは非常にうまくやっています。

So, another thing that I did also see was because I stay up to date with pretty much all of Google's AI research, is that I do note that this stylized generation right here is definitely taking the research from another Google paper that was called StyleDrop, and I'll show you guys that in a moment.

また、私が見たもう一つのことは、私はほぼすべてのGoogleのAI研究を最新の状態に保っているため、このスタイル化された生成は、実際には別のGoogleの論文である「StyleDrop」というものからの研究を取り入れていることです。それをすぐにお見せします。

But I think it just goes to show that when Google combines all of their stuff, and it does go to show that they're probably building some very comprehensive video system in the future, that whenever they do tend to release it, it's going to be absolutely incredible because if we look at this is just one reference image.

しかし、Googleがすべての要素を組み合わせるとき、そしておそらく将来的に非常に包括的なビデオシステムを構築していることを示しているとき、それがリリースされるとき、それは本当に信じられないほど素晴らしいものになるでしょう。これは単なる参照画像です。

And then, we can see that all of these kinds of videos that we do get, this is going to be very, very useful for people who are trying to create certain styles, for certain things.

そして、私たちが得ることができるすべてのビデオを見ると、特定のスタイルの作成を試みている人々にとって非常に役立つものになるでしょう。

And of course, we can see that this is like, some kind of 3D animation kind of style.

そして、もちろん、これはある種の3Dアニメーションのようなスタイルです。

And then, of course, the videos from that actually look very, very good too.

そして、実際にそのビデオも非常に非常に良く見えます。

So, this is what I'm talking about when I say StyleDrop.

これが私が「StyleDrop」と言っているものです。

So, I'm going to show you guys that page now.

では、このページをお見せします。

The Google previously actually did release this research paper, and this was actually sometime last year.

Googleは以前にこの研究論文を公開しましたが、それは実際には昨年のことです。

But you can see that this was essentially based off similar stuff.

しかし、このページを見ると、それが基本的に同じものであることがわかります。

Now, I'm not sure how much they've changed the architecture, but you can see that it's a text to Image Maker, and essentially what it does when it generates the images is it uses the reference image as a style.

今、アーキテクチャをどれだけ変更したかはわかりませんが、テキストからイメージを生成するもので、イメージを生成する際に参照画像をスタイルとして使用します。

And you can see just how good that stuff does look.

そして、それがどれだけ良く見えるかをご覧いただけます。

I mean, if we take a look at the Vincent Van Gogh style, and then, of course, we do take a look at the other images, I mean, I mean, they just look absolutely incredible.

私は言いたいのは、もし私たちがヴィンセント・ファン・ゴッホのスタイルを見て、そしてもちろん、他の画像も見てみると、本当に信じられないくらい素晴らしいです。

And of course, we do have the same exact one here in the StyleDrop paper as videos.

そしてもちろん、ここにも同じものがあります。StyleDropペーパーとしてのビデオでも同じです。

And I think this is really, really important because if Google manages well, it looks like they've managed to combine everything from their previous research like magit and video poet all into one unique thing.

そして、これは本当に、本当に重要だと思います。Googleがうまく管理しているように見えるのは、彼らがmagitやビデオポエットなど、以前の研究のすべてをひとつのユニークなものに組み合わせることに成功したように見えるからです。

I think this is going to be super, super effective because people are wondering, and one of the questions has been, why no code, why no model no model just to show once again.

これは非常に効果的になると思います。人々は疑問に思っていて、質問の一つは、なぜコードがないのか、なぜモデルがないのか、ただ再度示すためだけではないのかということです。

Okay, are you going to release this though and press it, but no open source weights?

では、それをリリースして、プレスするつもりですか？ただし、オープンソースの重みはありませんか？

I think that the reason Google has chosen to not release this model and to not release, the weights of this model or the code is because I'm pretty sure that they are going to be building on this to release it into perhaps Gemini or a later version of another Google system.

Googleがこのモデルやそのコード、重みを公開しない理由は、おそらくこれをGeminiや別のGoogleシステムの後のバージョンにリリースするためにこれを構築するつもりだからだと思います。

Now, I could be completely wrong.

今、私は完全に間違っているかもしれません。

Google has been known in the past to just build things and just sit on them, but I think with the nature of how competitive things are and the fact that this is state-of-the-art and the fact that there aren't any other models that can do this in terms of models that seem to be competing in this area, this is an area that Google could easily dominate.

過去、Googleは単に何かを構築してそれを放置することが知られていましたが、競争がどれだけ激しいか、そしてこれが最先端であり、この領域で競合するような他のモデルが存在しない事実を考慮すると、Googleは簡単にこの分野を支配できる可能性があると思います。

And since Google did lose before to ChatGPT in terms of the AI race, I'm sure that Google would try and stay ahead.

Googleは以前、AIレースでChatGPTに負けたことがありますので、Googleは先を行こうとするでしょう。

Now, seemingly like since they've got the lead, so I don't know, they may do that, they may not.

今、彼らはリードを持っているようですので、それをするかどうかはわかりません。

Google has previously just sat on things before, but I do think that maybe they might just polish the model and then release it.

Googleは以前、単に何かを放置したことがありましたが、このモデルを磨いてからリリースする可能性もあると思います。

I think it would be really cool if they did that, and I really do hope they do do that because it would make other things even more competitive.

もしそれをやってくれたら本当に素晴らしいと思いますし、本当にそうしてほしいと思います。他のこともさらに競争力を高めるでしょうから。

The key things here, as well, was the video stylization, and I don't think you understand just how good this is.

また、ここでの重要なポイントは、ビデオのスタイル化ですが、これがどれだけ素晴らしいか、あなたは理解していないと思います。

Like the made of flowers one right here, here is just absolutely incredible.

ここにある花でできたもの、本当に信じられないくらい素晴らしいです。

I mean, look at that.

見てください。

That just looks, I mean, that looks like CGI honestly.

これ、正直、CGIのように見えます。

Like if I saw that, I would be like, Wow, that's some really cool CGI.

もし私がそれを見たら、わあ、それは本当にクールなCGIだと思うでしょう。

Other styles aren't as aesthetic or aren't as good, but for some reason, the Lego one as well, for example, if we do take a look at this Lego car, that one doesn't look AI generated in the sense that like it was just from AI.

他のスタイルは美的でなかったり、あまり良くなかったりすることがありますが、何故か例えばレゴのスタイルは、このレゴの車を見ると、それがただのAIから生成されたものではないように見えます。

It actually looks like a Lego car.

実際にレゴの車のように見えます。

And then, of course, the one for flowers.

そしてもちろん、花のためのものもあります。

I'm not sure why.

なぜかはわかりません。

I think it's because the way how AI generates these images, they're kind of like fine. And I think with flowers, they just look very fine and detailed and intricate.

AIがこれらの画像を生成する方法のせいだと思います。それらは細かい感じがします。そして、花に関しては、非常に細かくて詳細で入り組んで見えると思います。

So that's why it doesn't look that bad, but that one does look really cool.

だからそれがそんなに悪く見えないのですが、それは本当にクールに見えます。

So yeah, I think, I think what we've seen here, on in terms of the video stylization, shows us just how good of a model this is.

だから、そうですね、私たちがここで見たものは、ビデオのスタイル化に関して、このモデルがどれだけ優れているかを示しています。

Now with the cinemagraphs, I do think that this is also another fascinating piece of the paper because this is where the model is able to animate the content of an image within a specific user provided region.

そして、シネマグラフについては、これも論文のもう一つの魅力的な部分だと思います。この部分では、モデルが特定のユーザーが指定した領域内の画像の内容をアニメーション化できるのです。

And I do think that this is really effective.

そして、これは本当に効果的だと思います。

But what was fascinating was that a couple of days ago, Runway actually did release their ability to do this.

しかし、面白かったのは、数日前にRunwayが実際にこれを実現したことです。

So if you haven't seen it before, I'm going to show it to you guys now.

もし以前に見たことがなければ、今から皆さんに見せます。

But essentially, Runway has this brush where you can select specific parts of an image, and then essentially, you can adjust the movement of these brushes.

しかし、基本的には、Runwayにはこのブラシがあり、画像の特定の部分を選択し、それらのブラシの動きを調整することができます。

And then once you do that, you can essentially animate a specific character.

そして、それを行った後、特定のキャラクターをアニメーション化することができます。

Now, I know this isn't a Runway video, but it's just going to show that this is a new feature that is being rolled out to video models across different companies.

今、これはRunwayのビデオではありませんが、これは異なる会社のビデオモデルに展開されている新機能であることを示しています。

So I think that in the future, what we're also going to have is since video models sometimes aren't always the best at animating certain things, I think we're going to have a lot more customization.

ですので、将来的には、ビデオモデルが特定のものをアニメーション化するのに最適ではない場合もあるため、より多くのカスタマイズが可能になると思います。

And that's what we're seeing here with Lumiere, because of course, the fire looks really, really good.

そして、もちろん、Lumiereでは火が非常に非常にリアルに見えるんです。

The butterfly here also looks really cool.

ここにいる蝶も本当にクールに見えます。

The water here looks like it's moving realistically, and this smoke train also does look very, very effective.

ここにある水はリアルに動いているように見えますし、この煙の列車も非常に効果的に見えます。

There weren't that many demos of this, but it was enough to show us that it was really good.

それほど多くのデモはありませんでしたが、それは本当に良いことを示していました。

Now, video in painting was something that we did look at. I think it was either VideoPoet or MAGVIT that showed us this, but at the time, it honestly wasn't as good as it was.

動画のペインティングについては、VideoPoetまたはMAGVITのいずれかが私たちに紹介したものですが、その時点では正直、それほど優れていなかったように思います。

I mean, it was decent, but this is different, like just completely different level.

私の意味は、まあ、まあまあだったけど、これは違うんだよ、完全に違うレベルだよ。

Like, I mean, imagine having just half of a video, and then, being able to just say fill in the rest.

半分の動画があると想像してみてください。そして、残りを埋めることができるということですね。

So basically, if you don't know what this is, this is basically just generator fill for video.

だから基本的に、これが何かわからないなら、これは基本的にビデオのための生成器の埋め合わせなんだよ。

And I think that having this is just pretty, pretty crazy because being able to just say, Okay, fill it in or just with the text prompt, I mean, just look at the way that the chocolate falls on this one.

そして、これがあることは本当に、本当にすごいと思うんだよ、埋め合わせてみて、テキストのプロンプトだけで、まあ、このチョコレートが落ちる様子を見てみて。

it's definitely really, really, really effective at doing that.

それは本当に、本当に、本当にそれを効果的にやるのに非常に効果的だと思うよ。

So I think this one is definitely going to have some wild scale uses.

だから、これは間違いなくワイルドなスケールで使われると思うよ。

And of course, this one is probably going to have the most because you can change different things.

そしてもちろん、これはおそらく最も使われるだろうね、なぜなら異なることを変えることができるからだよ。

So you can literally just say, Wearing a red scarf, wearing a purple tie, sitting on a stool, wearing boots wearing a bathrobe.

だから、文字通り、赤いスカーフを着て、紫のネクタイを着て、スツールに座って、ブーツを履いて、バスローブを着て、と言うだけでいいんだよ。

I think a lot of this stuff is most certainly fascinating.

私はこのようなことの多くが確かに魅力的だと思うよ。

And another thing that we also didn't take a look at was, of course, the image to video.

そして、もう一つ見ていなかったことは、もちろん、画像からビデオへの変換だよ。

And with image to video, I think this is really good as well because some of the models don't always generate the best images.

そして画像からビデオへの変換も、これはとても良いと思うよ、なぜならモデルの中にはいつも最高の画像を生成しないものもあるからだよ。

And if you want to be able to generate certain images yourself, you're going to want to be able to animate those specifically.

そして、特定の画像を自分で生成できるようにしたいなら、それらを特にアニメーションさせることができるようにしたいんだよ。

So I think that this, as well, the image to video section of the model is rather, rather effective.

だから、これもまた、モデルの画像からビデオへのセクションはかなり効果的だと思うよ。

And I always find it very funny and hilarious that, for some reason, all of these video models decide to use a teddy bear running in New York as some kind of benchmark.

そして、なぜか、これらのビデオモデルは、ニューヨークで走るテディベアをベンチマークとして使うことに決めたことが、いつもとても面白くてユーモラスだと思うんだよ。

But definitely, this one does look better over previous iterations.

でも確かに、これは以前のバージョンよりも見た目が良くなっていると思うよ。

And I do think that, for some reason, the text to video model is better than the image to video model, just simply based on how things are done.

そして、なぜか、テキストからビデオへのモデルの方が、画像からビデオへのモデルよりも良いと思うんだ、ただ単にやり方の違いに基づいて。

But for example, things like ocean waves, the way that the giraffe is eating grass.

でも例えば、海の波とか、キリンが草を食べている様子とか。

I know that they definitely did train this on a huge amount of data because if you've ever seen giraffes eating grass, they do eat it exactly like that.

私は彼らが確かに大量のデータでこれをトレーニングしたと知っている、なぜならキリンが草を食べる様子を見たことがあるなら、彼らはまさにそのようにそれを食べるからだよ。

It's not a weird AI-generated mouth.

それは奇妙なAI生成の口ではないんだ。

Also, if you do look at waves, waves look exactly like that.

また、波を見ると、波はまさにそのように見えるんだ。

Fire moves exactly like that too.

火もまさにそのように動くんだ。

So there is a like a real big level of understanding, like a huge level of understanding, for what's being done here.

だから、ここでやっていることには、本当に大きな理解のレベルがあるんだよ、本当に大きな理解のレベルが。

And I mean, even if we look at a happy elephant like this one right here a happy elephant wearing a birthday hat under the sea.

そして、私の意味は、たとえば、この幸せな象を見てみて、ここにいるこの幸せな象が誕生日帽子をかぶって海の中にいる様子を見てみて。

And then, when you hover over it, you can see the original image.

そして、それをホバーすると、元の画像が見えるんだ。

So, this is what the original image looks like.

だから、これが元の画像の見た目だよ。

And then, this is what the text video thing is.

そして、これがテキストビデオのものだよ。

And we can see that, like, it's kicking up the water as it's moving underwater, which is, I don't know, it's kind of weird.

そして、私たちは見ることができる、まるで水を蹴り上げながら水中を移動しているような、ちょっと変な感じだよ。

But, it also does look, pretty cool if you ask me.

でも、私に言わせれば、それはかなりクールに見えるんだよ。

And then, this is that notable image of soldiers raising the United States flag on a windy day.

そして、これが風の強い日にアメリカ国旗を掲げる兵士の有名な画像だよ。

Then, we can see that it is moving.

それから、それが動いているのが見えるんだ。

So, I think overall, and of course, we got this very famous painting.

だから、全体的には、もちろん、このとても有名な絵もあるよ。

And of course, even more waves.

もちろん、さらに多くの波もあります。

But, I think in certain scenarios, for example, with liquids, it seems to work pretty well.

ただし、特定のシナリオでは、例えば液体の場合、かなりうまく機能するようです。

With water, it seems to work pretty well.

水では、かなりうまく機能するようです。

And I think fireworks and, for some reason, rotating objects do now work really well.

そして、花火や、何故か回転するオブジェクトは本当にうまく機能すると思います。

But, I think the main question that is going to come away from this is, is Google going to release this?

ただ、この結果から生じる主な疑問は、Googleはこれを公開するのかということです。

Are they going to build it into a bigger project?

彼らはこれをより大きなプロジェクトに組み込むのでしょうか。

Or are they waiting for something to be more published?

それとも、何かをより公開するために待っているのでしょうか。

I mean, currently, it is state-of-the-art.

つまり、現時点では最先端の技術です。

So, I guess we're going to have to wait from Google themselves.

ですから、Google自身からの発表を待つしかないと思います。

But, I do note that one thing that, is a bit different from larger companies is the fact that there is a difference between getting AI research done and then, of course, just having it out there and just releasing it versus actually having a product that people are going to use.

ただし、大企業とは少し違う点として、AIの研究を行うことと、それを単に公開することやリリースすることとは異なり、実際に人々が使う製品を持っているという事実があるということには注意しています。

Because it's all well and good being able to do something which is fascinating, astounding, and it's really good.

魅力的で驚くべきことをすることができるのは良いことですが、それを人々が実際に使えて効果的な製品に変換することは別の問題です。

But of course, translating that into a product that people can then use and is actually effective is another issue.

しかし、もちろん、それを実際に使用できる製品に翻訳することは別の問題です。

So, I don't know if they're going to do that soon.

ですので、彼らがそれをすぐに行うかどうかはわかりません。

But, I will be looking out for that because I do want to be able to use this and test it to see just how well it does against certain prompts, against certain things like Runway, Pika Labs, and of course, stable the fusion video.

しかし、それには注目しており、これを使用して特定のプロンプトやRunway、Pika Labsなどの特定の要素に対してどれだけうまく機能するかをテストしたいと思っています。もちろん、安定したフュージョンビデオに対してもです。

So, what do you think about this?

では、あなたはこれについてどう思いますか？

Let me know what your favorite feature is going to be.

どの機能がお気に入りになるか教えてください。

My favorite feature that I'm thinking of is, of course, just the text of video because I'm just going to use that once it does come out if it does ever come out.

私が考えているお気に入りの機能は、もちろん、テキストからビデオへの変換です。もしも公開されるなら、それを使うつもりです。

But, other than that, I think this is an exciting project.

ただし、それ以外については、私はこのプロジェクトがエキサイティングだと思います。

I think there's a lot more things to be done in this space.

この領域でやるべきことはまだまだたくさんあると思います。

And if things are continuing to move at this pace, I really do wonder where we will be at the end of the year.

そして、このペースで進んでいるなら、年末にはどこにいるのか本当に興味があります。

この記事が気に入ったらサポートをしてみませんか？