

Google Reserchが発表したLumiereという最先端のテキストからビデオへの変換技術は、これまでに見た中で最も進化したモデルです。ユーザーの評価でも、既存のImagenやPika Labs、ZeroScope、RunwayのGen-2などのモデルよりも優れていることが示されています。Lumiereは、空間と時間の両方を効率的に扱うSpaceTimeユニットアーキテクチャを使用し、テキストから画像への拡張モデルを基にビデオデータの複雑さを処理しています。

So, Google Research recently released a stunning paper, and they show off a very, very state-of-the-art text to video generator.

最近、Google Researchが驚くべき論文を公開しましたが、非常に最新のテキストからビデオを生成するモデルを披露しています。

By far, this is likely going to be the very best text to video generator you've seen.


I want you guys to take a look at the video demo that they've shown us because it's fascinating.


After that, I'll dive into why this is state-of-the-art and just how good this really is.


Now, one of the most shocking things from Lumiere (and I'm not sure if that's exactly how you pronounce it), but one of the very, very most shocking things that we did see was, of course, the consistency in the videos and how certain things are rendered.


There are a bunch more examples that they didn't actually showcase in this small video, so I will be showcasing you those from the actual web page.


But it is far better than anything we've seen before, and some studies that they did actually do confirm this.


For example, in their user study, what they found was that our method was preferred by users in both text to video and image to video generation.


One of the benchmarks that they did (I'm not sure what the quality score was), but you can see that theirs, which is, of course, the Lumiere, actually performed a lot better than Imagen, a lot better than Pika Labs, a lot better than ZeroScope, and Gen-2, which is Runway.

彼らが行ったベンチマークの1つ(品質スコアはわかりませんが)、Lumiereの方がImagenよりも、Pika Labsよりも、ZeroScopeよりも、そしてGen-2(Runway)よりもはるかに優れていることがわかります。

So, Gen-2, if you don't know, is being compared against Runway's video model, and Runway actually recently did launch a bunch of stuff.


But if we look at text alignment as well, we can see that across all different video models, this is the winner.


And then, of course, on image to video or video quality, you can see that against Pika, it wins a lot of the time against State diffusion video (I'm pretty sure that's what that is). Then we can see for image to video, you can see it wins against Pika Labs and wins against Gen-2.

そして、画像からビデオやビデオの品質についても、Pikaに対しては多くの場合勝っています(Stable Diffusion videoだと思います)。また、画像からビデオについても、Pika LabsやGen-2に勝っています。

I'm not sure if this is Stable Diffusion video too, but if you haven't seen that, it's actually something that is really good too.

これもStable Diffusion videoかどうかはわかりませんが、それも非常に優れたものです。

Overall, we do know that right now, this is actually the gold standard in a text to video, which is a very good benchmark because many people have been discussing how 2024 is likely to be the year for text to video.


Now, what I do want to talk about before I dive into some of the more crazy examples of their stuff was, of course, the new architecture.


Because what exactly is making this so good?


Because, as you know, it looks fascinating in terms of everything that we can do.


And when I show you some of the more examples, you're going to see exactly why this is even better than you thought.


Essentially, the first thing that they do is they utilize the SpaceTime unit architecture, so unlike traditional video generation models that create key frames. And then, fill in the gaps. Lumiere generates the entire temporal duration of the video in one go, and this is achieved through a unique SpaceTime unit architecture which efficiently handles both spatial and temporal aspects of the video data.


Now, what they also do is they have temporal down sampling and upsampling, and Lumiere incorporates both spatial and temporal downsampling and upsampling in its architecture.


Now, this approach allows the model to process and generate full frame rate videos much more effectively, leading to more coherent and realistic motion in the generated content.


Now, of course, what they also did was they leverage pre-trained Text-to-Image Diffusion Models, and the research is built upon existing Text-to-Image Diffusion Models, adapting them for video generation.

さらに、彼らは事前学習されたText-to-Image Diffusion Modelsを活用しており、この研究は既存のText-to-Image Diffusion Modelsをビデオ生成に適応させることで構築されています。

And this allows the model to benefit from the strong generative capabilities of these pre-trained models while extending them to handle complexities of video data.


Now, one of the significant challenges in video generation is, of course, maintaining global temporal consistency, and Lumiere's architecture and training approach are specifically designed to address this, ensuring that the generated videos exhibit coherent and realistic motion throughout their duration.


Now, this is Lumiere's GitHub page, and this is by far one of the very best things I've ever seen because I want to show you guys some of these examples to just show you how advanced this really is.


So, one of the clips I want you to pay attention to is, of course, and I'm going to zoom in here, is of course this Lamborghini because this actually shows us how crazy this technology is.


So, we can see that the Lamborghini is driving, driving, driving, and then as it rotates, we can actually see that the Lamborghini wheel is not only moving, but also we can see the other angles of that Lamborghini too.


So, I would say that if we can compare it to some of the other video models, one of the things that we do struggle with is, of course, the motion and, of course, the rotation, but seemingly they've managed to solve this by using this new architecture, and we can see that things like the Lamborghini and rotations, which is a real struggle for video, isn't going to be a problem.


Now, another one of my favorite examples was, of course, beer being poured into a glass.


So, if we take a look at this, this is absolutely incredible because we can see that the glass is just being filled up, and it looks so good and realistic.


I mean, we have the foam, we have the beer actually just moving up, we also do have the bubbles, and we have things just looking really realistic, like if someone was to say this is just a low FPS video of me pouring liquid into a glass, I would honestly believe them.


And even if you don't think that it is realistic, I think we can all agree that this is very, very good for text to video.


And if you just hover over it, you can see the input.


Now, some of these as well, there are just really, really good showcases of how good it is at rotations because I've seen some of the other video models, and this is something that we've only recently, like literally yesterday, I saw a preview, and only recently we've managed to solve that a little bit.


So, I mean, if we take a look at the bottom left, we can see that the sushi is rotating, and it looks to me like this, it doesn't look as AI generated as many other videos.


The only one issue that AI generated videos do suffer from is, of course, low resolution and low frames per second, but I mean, I think that that is going to be solved very, very soon.


And with what we have here as well, like, I mean, if we look at The Confident Teddy Bear Surfer Rides Waves in the Tropics, I mean, if we look at how the water ripples every single time the surfboard actually makes impact with the water, I think we can say that it does look very, very realistic.


And then, of course, we have the chocolate muffin video clip.


Now, this one right here as well looks super, super temporarily consistent.


I mean, just the way that it rotates just looks like nothing we've ever seen before.


And of course, this wolf one silhouette against a twilight sky also looks very, very accurate and very, very good.


So, I mean, these demos of the texture video, I would say, are just absolutely outstanding.


This one right here, the fireworks that we're looking at, is definitely something that I've seen done by other models before, but it does go to show how good it is.


And this one right here, camera mthing through dry grass at an Autumn morning, also does so just how good it is.


Now, with regards to walking and legs and stuff like that, there is still a bit of a small issue there, and there are some other things that I do want to discuss about this entire project because this entire project is, I'm pretty sure, a collaboration of some other AI projects that Google has done before, and I can't wait to, see if Google manages to finally release this.


So, one of the other models, so some of the other ones that are my favorites, of course, the chocolate syrup pouring on vanilla ice cream, that looks really well, and then this clip of the skywalking doesn't look too bad.


And I think that when we take a look at certain videos that are very subtle in nature, so for example, blooming cherry tree in the garden, that looks pretty subtle, and then of course, the Aurora Borealis, that one looks pretty subtle too.


So, a lot of these videos, I think personally, do just are just absolutely the best.


And of course, we do need to take a look at stylized generation because this is something that is really, really important for generating certain styles of videos, but Google's Lumiere does it really, really well.


So, another thing that I did also see was because I stay up to date with pretty much all of Google's AI research, is that I do note that this stylized generation right here is definitely taking the research from another Google paper that was called StyleDrop, and I'll show you guys that in a moment.


But I think it just goes to show that when Google combines all of their stuff, and it does go to show that they're probably building some very comprehensive video system in the future, that whenever they do tend to release it, it's going to be absolutely incredible because if we look at this is just one reference image.


And then, we can see that all of these kinds of videos that we do get, this is going to be very, very useful for people who are trying to create certain styles, for certain things.


And of course, we can see that this is like, some kind of 3D animation kind of style.


And then, of course, the videos from that actually look very, very good too.


So, this is what I'm talking about when I say StyleDrop.


So, I'm going to show you guys that page now.


The Google previously actually did release this research paper, and this was actually sometime last year.


But you can see that this was essentially based off similar stuff.


Now, I'm not sure how much they've changed the architecture, but you can see that it's a text to Image Maker, and essentially what it does when it generates the images is it uses the reference image as a style.


And you can see just how good that stuff does look.


I mean, if we take a look at the Vincent Van Gogh style, and then, of course, we do take a look at the other images, I mean, I mean, they just look absolutely incredible.


And of course, we do have the same exact one here in the StyleDrop paper as videos.


And I think this is really, really important because if Google manages well, it looks like they've managed to combine everything from their previous research like magit and video poet all into one unique thing.


I think this is going to be super, super effective because people are wondering, and one of the questions has been, why no code, why no model no model just to show once again.


Okay, are you going to release this though and press it, but no open source weights?


I think that the reason Google has chosen to not release this model and to not release, the weights of this model or the code is because I'm pretty sure that they are going to be building on this to release it into perhaps Gemini or a later version of another Google system.


Now, I could be completely wrong.


Google has been known in the past to just build things and just sit on them, but I think with the nature of how competitive things are and the fact that this is state-of-the-art and the fact that there aren't any other models that can do this in terms of models that seem to be competing in this area, this is an area that Google could easily dominate.


And since Google did lose before to ChatGPT in terms of the AI race, I'm sure that Google would try and stay ahead.


Now, seemingly like since they've got the lead, so I don't know, they may do that, they may not.


Google has previously just sat on things before, but I do think that maybe they might just polish the model and then release it.


I think it would be really cool if they did that, and I really do hope they do do that because it would make other things even more competitive.


The key things here, as well, was the video stylization, and I don't think you understand just how good this is.


Like the made of flowers one right here, here is just absolutely incredible.


I mean, look at that.


That just looks, I mean, that looks like CGI honestly.


Like if I saw that, I would be like, Wow, that's some really cool CGI.


Other styles aren't as aesthetic or aren't as good, but for some reason, the Lego one as well, for example, if we do take a look at this Lego car, that one doesn't look AI generated in the sense that like it was just from AI.


It actually looks like a Lego car.


And then, of course, the one for flowers.


I'm not sure why.


I think it's because the way how AI generates these images, they're kind of like fine. And I think with flowers, they just look very fine and detailed and intricate.


So that's why it doesn't look that bad, but that one does look really cool.


So yeah, I think, I think what we've seen here, on in terms of the video stylization, shows us just how good of a model this is.


Now with the cinemagraphs, I do think that this is also another fascinating piece of the paper because this is where the model is able to animate the content of an image within a specific user provided region.


And I do think that this is really effective.


But what was fascinating was that a couple of days ago, Runway actually did release their ability to do this.


So if you haven't seen it before, I'm going to show it to you guys now.


But essentially, Runway has this brush where you can select specific parts of an image, and then essentially, you can adjust the movement of these brushes.


And then once you do that, you can essentially animate a specific character.


Now, I know this isn't a Runway video, but it's just going to show that this is a new feature that is being rolled out to video models across different companies.


So I think that in the future, what we're also going to have is since video models sometimes aren't always the best at animating certain things, I think we're going to have a lot more customization.


And that's what we're seeing here with Lumiere, because of course, the fire looks really, really good.


The butterfly here also looks really cool.


The water here looks like it's moving realistically, and this smoke train also does look very, very effective.


There weren't that many demos of this, but it was enough to show us that it was really good.


Now, video in painting was something that we did look at. I think it was either VideoPoet or MAGVIT that showed us this, but at the time, it honestly wasn't as good as it was.


I mean, it was decent, but this is different, like just completely different level.


Like, I mean, imagine having just half of a video, and then, being able to just say fill in the rest.


So basically, if you don't know what this is, this is basically just generator fill for video.


And I think that having this is just pretty, pretty crazy because being able to just say, Okay, fill it in or just with the text prompt, I mean, just look at the way that the chocolate falls on this one.


it's definitely really, really, really effective at doing that.


So I think this one is definitely going to have some wild scale uses.


And of course, this one is probably going to have the most because you can change different things.


So you can literally just say, Wearing a red scarf, wearing a purple tie, sitting on a stool, wearing boots wearing a bathrobe.


I think a lot of this stuff is most certainly fascinating.


And another thing that we also didn't take a look at was, of course, the image to video.


And with image to video, I think this is really good as well because some of the models don't always generate the best images.


And if you want to be able to generate certain images yourself, you're going to want to be able to animate those specifically.


So I think that this, as well, the image to video section of the model is rather, rather effective.


And I always find it very funny and hilarious that, for some reason, all of these video models decide to use a teddy bear running in New York as some kind of benchmark.


But definitely, this one does look better over previous iterations.


And I do think that, for some reason, the text to video model is better than the image to video model, just simply based on how things are done.


But for example, things like ocean waves, the way that the giraffe is eating grass.


I know that they definitely did train this on a huge amount of data because if you've ever seen giraffes eating grass, they do eat it exactly like that.


It's not a weird AI-generated mouth.


Also, if you do look at waves, waves look exactly like that.


Fire moves exactly like that too.


So there is a like a real big level of understanding, like a huge level of understanding, for what's being done here.


And I mean, even if we look at a happy elephant like this one right here a happy elephant wearing a birthday hat under the sea.


And then, when you hover over it, you can see the original image.


So, this is what the original image looks like.


And then, this is what the text video thing is.


And we can see that, like, it's kicking up the water as it's moving underwater, which is, I don't know, it's kind of weird.


But, it also does look, pretty cool if you ask me.


And then, this is that notable image of soldiers raising the United States flag on a windy day.


Then, we can see that it is moving.


So, I think overall, and of course, we got this very famous painting.


And of course, even more waves.


But, I think in certain scenarios, for example, with liquids, it seems to work pretty well.


With water, it seems to work pretty well.


And I think fireworks and, for some reason, rotating objects do now work really well.


But, I think the main question that is going to come away from this is, is Google going to release this?


Are they going to build it into a bigger project?


Or are they waiting for something to be more published?


I mean, currently, it is state-of-the-art.


So, I guess we're going to have to wait from Google themselves.


But, I do note that one thing that, is a bit different from larger companies is the fact that there is a difference between getting AI research done and then, of course, just having it out there and just releasing it versus actually having a product that people are going to use.


Because it's all well and good being able to do something which is fascinating, astounding, and it's really good.


But of course, translating that into a product that people can then use and is actually effective is another issue.


So, I don't know if they're going to do that soon.


But, I will be looking out for that because I do want to be able to use this and test it to see just how well it does against certain prompts, against certain things like Runway, Pika Labs, and of course, stable the fusion video.

しかし、それには注目しており、これを使用して特定のプロンプトやRunway、Pika Labsなどの特定の要素に対してどれだけうまく機能するかをテストしたいと思っています。もちろん、安定したフュージョンビデオに対してもです。

So, what do you think about this?


Let me know what your favorite feature is going to be.


My favorite feature that I'm thinking of is, of course, just the text of video because I'm just going to use that once it does come out if it does ever come out.


But, other than that, I think this is an exciting project.


I think there's a lot more things to be done in this space.


And if things are continuing to move at this pace, I really do wonder where we will be at the end of the year.

