


China just went ahead and released their text to video tool, and it is pretty, pretty incredible.


I'm going to show you guys a quick sample of some of the clips, and then we'll dive into the good stuff.


What you just saw was, of course, the very, very impressive Kling AI.

ご覧になったのは、もちろん非常に印象的なKling AIです。

This is Kling AI video generation tool, and this is something that was launched by Kuaishou, and this is a major Chinese technology company that was launched in 2011 with its headquarters in Beijing.

これはKling AIビデオ生成ツールであり、これは2011年に北京に本社を置く中国の主要な技術企業である快手によって立ち上げられたものです。

This I genuinely have to say, on some of these demos, I would argue that it actually genuinely surpasses Sora in terms of the consistency and what it's able to do regarding the quality of the clips.


Trust me when I say that you need to watch this video until the end because once you truly understand how great this system is in terms of its ability to generate high-quality clips with a very decent amount of consistency amongst the scene, in terms of ensuring that characters remain stable and consistent.


For example, clips like this that are just a remarkable display of true understanding of exactly what's going on.


It is showing us that right now we are seeing that AI is coming to a point where other nations are truly starting to catch up and slowly even surpass some of the state-of-the-art models in certain areas like text to video.


Let's dive in to exactly what makes this entire system so effective and how it actually works and how the team managed to crack this in such a short time frame.


There are six different things that they talk about on their webpage, and I'm going to show you guys exactly what they are.


One of the things that they talk about is 3D spatial temporal attention.


You can see right here that the prompt we have is a man riding a horse in the Gobi desert with a beautiful sunset behind him, movie quality like scene.


Essentially, this is where they've adopted a 3D spatial temporal attention mechanism which can better model complex spatial temporal motion and generate video content with larger movements while conforming to the laws of motion.


This isn't by far their best clip.


In fact, this is probably the worst clip that you're going to see for the entire video.


But essentially what they talk about here is the ability to ensure that when they're generating clips that have a lot of different moving parts and a lot of things that you know have motion in them, it's very difficult to ensure that certain things are actually pretty consistent.


With this clip, we can see that things remain consistent.


We have the person riding, and we can see their body doing what it should be.


This is how riders move when they're on a horse.


We also have the dust trails, and of course, we have the legs of the horse moving in sync with the entirety of this clip, as well as the background that is moving correctly.


This is something that is remarkably impressive.


They also demonstrate another example here where an astronaut runs on the lunar surface.


The low angle shot shows the vast background of the moon, and the movements are smooth and light.


So what we can see here is that this is a clip where we can see an astronaut running across the moon, a very, very decent one.


I don't think this is their highly scaled model.


I'm guessing that this was just where they wanted to showcase what you can do when you have the camera angle panning from below all the way to above.


So this was an example where they're trying to show as much character consistency among different camera angles.


I think this, while yes, the quality isn't remarkably incredible, I still think that it shows what this system is able to do because if you take a look at, for example, the shadows that most people wouldn't even think to look at, they do look remarkably accurate.


Now let's take a look at another example of their 3D spatial temporal attention mechanism in action.


This is where we have the most interesting thing, and this is by far one of the most interesting demos that you'll probably see in the entire video.


There are a lot more that are remarkably impressive, and I was truly shocked by this.


I know we have that as a meme on this channel amongst the AI community, but this was generally pretty surprising that they managed to catch up or at least be on the level of Sora in such a short time frame.


This is where they talk about thanks to the efficient training infrastructure, extreme inference optimization, and scalable infrastructure, the keyling large model can generate videos up to two minutes long with a rate of 30 frames per second.


That's the info that they have on their website, and I think this is arguably more impressive than the OpenAI Sora video because what we are seeing here is a two-minute long video that is remarkably consistent with the background animation.


I guess you could say whatever the background footage is.


I mean, it's truly, truly impressive with as to what we are seeing here.


I would argue that this is much longer than some of the Sora demonstrations because Sora demonstrations, as far as I know, were limited to around one minute.


Now they might be working on Sora 2, but if we're taking a look at what we're truly seeing here, this is truly a remarkable level of consistency and a remarkable level of temporal consistency.

今、彼らはSora 2に取り組んでいるかもしれませんが、私たちが本当に見ているものを見ていると、これは本当に驚くほどの一貫性と時間的一貫性の高いレベルです。

Because what we have to truly think about here is that the AI system must need to understand exactly what's going on over a longer period of time.


You have to understand that the longer the context is, the harder it is for these AI systems to, I guess you could say, be remarkably consistent.


We can see that this consistency is ushered amongst the entirety of this two-minute long clip generation.


There was another example, but I didn't choose to include it because it's not as good as this one in terms of actually explaining what's going on.


I think this example right here, and you can even see that there are literal train lines as the train is going across, and of course, maybe the background doesn't make sense because it looked like that was Rome.


and then it looked like another place was Arctic, so maybe that's a bit small in terms of the details that it might be missing, but I think that video generation up to two minutes long where you have this level of consistency.


And usually with the kind of AI systems that we're working with, the longer the systems generate things for, the more errors you start to see because things just get lost in translation, I guess you could say.


Like as the information is processed through the AI system, a lot of it does get lost, which is why early on a lot of the AI video systems that we used to see were only two to three seconds worth of videos.


And now you can see we've got things that are up to two minutes long, and there doesn't seem to be any real glitchiness or any real loss of quality regarding what's going on here.


This is something that I think is remarkably impressive because it shows that this system is able to generate consistency, especially when the AI system is able to look at what the scenery is like, and it's able to generate consistent footage for whatever system or whatever scene may be next.


And all of the motion that's going on here, I think this is genuinely really, really remarkable and impressive.


Now, one of the most impressive things that we did see with other AI systems was their ability to simulate the physical world properties.


This was something that was talked about in the Sora paper because it was hailed as a new capability that was, I guess you could say, kind of emergent because it was something that we didn't really expect.


But of course, as these AI systems are trying to predict the next frame or, I guess you could say, make the videos all in one go, which is usually the architecture that we know that they use, they have to, I guess you could say, understand how the physical world works in order to create a video clip that actually looks realistic.


And whatever kind of world model they may have internally, this shows us that they're able to simulate the physical properties of the real world and generate videos that conform to the laws of physics.


So here we can see that the prompt is carefully pour the milk into the cup.


The milk flows steadily and the cup is gradually filled with milky white.


So that's the actual extracted prompt from the website, and what we can see here is remarkable consistency in such a short clip.


Now, there's another clip that I do want to show you guys that has been remarkably impressive because I would say if there is probably one video clip that you do take away from this video, it's going to be this one.


So take a look at this clip right here: a Chinese man sitting at a table eating noodles with chopsticks.


And I would have to argue that if I personally saw this clip like 480p maybe on a forum or something, I wouldn't for the life of me think that this is AI generated at all.


But we can clearly see here that this actually is AI generated, but it looks remarkably impressive because one of the things that you don't see here is that the man doesn't actually have sauce around his lips.


But as he inhales, not inhales, the sauce, you can see that there is all of this mess around the lips, and that's because of the sauce that is orange at the bottom of the, I think, noodles here.


So I think it's rather impressive that such a subtle detail is captured with the AI system, which is truly, truly in my opinion, remarkable because it shows that all of these small details are captured by the systems, and they're not really messing out on any of the finer quality details that we do expect from traditional video footage.


So this was one of the clips that I think truly showed people that, hey, this is a system that is really, really up there in terms of its ability to generate clips that are impressive.


And I think unless you're actually just focusing on the hands because the hands don't look as realistic, I mean, you can see just a little bit of inconsistency, just a little bit, but enough to let you know that what you're watching is AI generated.


But I think this is, of course, something that is just remarkable in itself, especially the way the noodles move and the fact that the guy's emotions look very, very realistic.


There was also this example, and this is where the chef chopping onions in the kitchen preparing for a dish.


And I would argue that yes, this isn't as good as the previous one, and it isn't as long as the previous one, but it still is a demonstration of this simulation of simulating the physical world's properties.


And the reason why they've likely included this one is because what you are doing in this video clip right here is that you are basically changing the physical nature of that onion.


Okay, so essentially the reason that this is, of course, so impressive is because you have to truly understand what is going to happen to an onion when it is cut by this blade.


And you can see that as it is cut, you can see more onions are processed.


and then they are split out by the knife, which is pretty impressive because this shows a decent level of understanding by this ai system.


I would say that it is very, very, very hard to get this kind of consistency with whatever ai system you are using.


This ai system was truly, truly impressive because there are also other examples of them being able to generate high quality things and just do a whole bunch of other useful things that we may have not have even thought about.


So, one of the things that they spoke about was, of course, the strong concept combination ability.


So, this ai system is remarkably good at combining different concepts together.


So, this is a white cat driving a car through a busy downtown street with tall buildings and pedestrians in the background.


The reason that they've done this example is because this footage doesn't exist.


So, a cat driving a car downtown through a busy city street, of course, footage like this hasn't been recorded before.


It doesn't really exist on any, I guess you could say, person's hard drive or any of those large databases where they just house millions of royalty stock videos.


I'm guessing that what we have here is a situation where they're demonstrating this AI system's ability to generate new and interesting videos that haven't existed before, and combine existing videos with other new concepts to create new pieces of material.


Which is, of course, very, very fascinating because it shows us that this is a system that doesn't fail when it tries to mimic exactly what is going on with the real world.


We can see the background is, of course, very, very good in its consistency and we can see that even the subtle movements of the cat as it looks around and drives the car, those seem quite realistic, if I say so myself.


Now, once again, you can see that they've demonstrated this ability in this here where we have a macro lens volcano erupting in a coffee cup, a scene that, of course, you wouldn't ever see unless you somehow manage to have a volcano erupting in your coffee cup.


But what we have here is a demonstration of exactly how great this system is.


So, we've got a situation on our hands where it's not just good at replicating some of the footage that we've seen before, it manages to show us how the liquid from the volcano actually transfers into this like coffee style liquid and gets melted along the cup edge here.


And one of my personal favorites from this entire strong concept combination ability was the ability for this lego character visiting an art gallery.


I thought that the reason that this was so good was because this video clip actually captured the nuances of how lego characters actually walked.


If you've ever seen a lego movie, you'll know that those characters in the movie, they actually walk exactly like this, which is remarkably surprising.


The fact that they were able to actually actively capture exactly how this lego character walks.


And of course, you can see even on the right there as a little easter egg, there is also a lego character there too.


It's very interesting because what's fascinating, as well, was that this character on the right was in focus.


As the lego character keeps walking forward and forward, it then shifts to being out of focus.


Which is, like I said before, if you're someone that doesn't really understand how videos work because you've never worked in media before, you might miss some of the subtle details as well as some of the subtle mistakes.


But I think you can grow to appreciate them more, especially if you've had that background, which is why when I look at some of these clips, they truly do make me pretty impressed.


I think that this one here was really, really cool because it showed the ability to capture specific details across many, many different clips.


One of the things that I really did like, and I have to say that I think personally this is my favorite feature from this video system, and what we have here is movie quality image generation.


One of the biggest gripes that we've had and that I've personally had with video ai systems is the fact that they just don't have the good quality.


Whilst yes, temporal consistency is something that we do look for in these video clips, the problem is right now that the quality is just not there.


But you can see right here with the prompt that we have, this is a very high quality clip that looks remarkably accurate of what we've described.


Now, I want to show you guys this clip instead because this is the clip right here that showcases just how good the quality is in terms of what you're getting here.


And if I'm being completely honest with you, the quality here might not look as good as it can be because I've of course downloaded this clip.


And then, I've uploaded this clip, and then I've recorded my screen, and then I've once again processed the video again, and then it's been uploaded to YouTube, and of course, it's been compressed again.


So trust me when I say, when you see this raw video, it actually looks remarkably impressive in terms of its quality.


This is something that I'm not just stating for the video, but it does look really, really high quality, higher quality than anything I've seen.


Now, of course, post-processing with any upscaling video softwares that you want to use, I think in the future, this is not going to be a difficult problem to solve at all.


But I do think that having a system that can natively output high-quality footage is going to be something that is a game-changer for industries.


And of course, you can see right here that this is a prompt of a chimney under the sunset, and this is where you can start to see that the high-quality nature of this AI system isn't just for show.


It's something that is truly, truly, truly impressive.


So I think that when we take a look at all of these factors combined and the fact that this system apparently is in alpha testing as in some people are actively being able to use this, shows us that China is rapidly advancing with their video models and all models that they are currently using.


Now, another feature that they actually spoke about was the varied aspect ratio.


So they spoke about how Keyling adopts a variable resolution training strategy which allows it to output a variety of different video aspect ratios for the same content during the inference process, meeting the needs for video materials in richer scenarios.


So that's what the website said, but essentially we have it here in a 1080 by 1080 scene, and then on the left here we have it in a 920 by 1080 scene, which is basically just, of course, the portrait edition.


And then, of course, we have this square edition.


There was a landscape edition, but I didn't include it because I'm sure you guys can completely understand the picture of whatever this AI system is trying to do.


But I genuinely have to say that with this clip being in there as something that I personally think is probably one of the most realistic clips.


And of course, when we do take a look at some of these, for example, this bird right here being very, very high quality, and of course, this road right here showing us the kind of consistency that I just wouldn't even think of.


This right here showing us the real-world physics being demonstrated, a very, very consistent fish underwater, and of course, one of my favorites was the panda playing the guitar.


Now, there was one clip that I actually did forget to add, but I'm going to show it to you all now because the consistency of that clip is remarkable.


And I'm going to show you guys why, although it was slightly demonstrated a little bit before.


So this is the clip I wanted to show you guys before the video ended.


So this was a clip where we have a little boy eating a burger, but take a look at what happens because there was also this clip from Sora, and I would argue that it was remarkably impressive for this very reason.


So he takes a bite of the burger.


and then you can see literally as he's taken a bite that there is quite a lot of mess around his mouth, which i think is remarkably accurate for of course how kids eat.


It's managing to simulate the fact that there might be certain particles left on, well not actually particles, just call these actual crumbs.


But of course, that they'd be on his face.


I just thought that this is an eerie, eerie realistically generation for such an ai system.


Overall, I think that what this is going to do for the dynamics of the ai marketplace is it's going to show us that china can compete quickly and efficiently to not only the state of what the united states is doing in terms of their ai development, but in even some instances manage to surpass them.


Which means that now that china is of course focusing a lot of their efforts on these kinds of systems, and of course we've seen a variety of different advancements across many different domains, i genuinely wouldn't be surprised if in a couple of months we do get a bunch of different chinese ai tools that are far superior than what the united states has.


And it may create an even worse terminal race condition where other nations are fighting to develop the very best ai systems, which could lead to detrimental outcomes.


Now i know that yes this is literally just a text to video ai video, but of course it shows us that this kind of technology was something that we really looked at and we heralded it as if it was going to be something that was completely impossible just 18 months ago.


And now we have a system that some people would say is just remarkably you know just realistic.


So i would say overall what does this do for your timelines in terms of where you think ai is going to go because i don't think if it was for saura or for google's recent vo we would even be maybe not as shocked by this kind of demo.


But for me personally this kind of makes me believe that the kind of capable systems that we're going to be getting in the future even if it's not from the united states but from another country who is developing it it's definitely going to be absolutely incredible.


Because if another country releases this tool and a lot of people are using it then man the fight for customers and of course the marketplace is going to be very very incredible to watch.


With that being said, let me know what your favorite demo was.


Was it the man eating the noodles with chopsticks?


Was it the high-quality blue rose petals in HD?


Was it the chimney under the sunset that looked remarkably interesting?


Or was it the very long video generation up to two minutes long, which showed remarkable consistency across many different areas and demonstrated a very good ability to generate the physical world?


Or was it, of course, the cat driving around the city with tall buildings and pedestrians in the background?


I'd love to know what you guys thought about this.


Do you think this is actually a major AI update, or do you think this is not something worth your time?


Otherwise, I'll see you guys in the next video.

