In today's video I'm going to be showcasing an amazing new project called TANGO which is a text to audio generative model that uses large language models called Flan-T5 as a text encoder.


Now, Flan-T5 has been fine-tuned for instruction and chain of thought-based tasks, and it basically has significantly improved Xero as well as a few-shot performances and many natural language processing tasks.


Now this is quite a remarkable application as you're able to formulate such amazing audios with the form of a text using their amazing encoders.


Now throughout today's video I'm going to be showcasing you guys a little bit more about this project and a little bit more about the analysis of what it's trying to accomplish.


I'm also going to take a little bit of time to install it locally on your desktop as well as showing you a lot of different examples as what they're trying to do.


And I'll also give you guys a link as to how you can play around with it on the actual web front.


And I'll also show you guys how to install it which I said before.


So as we talked about this is a new text to audio generative actual application.


Now, the project uses Flan-T5, which is another type of LLM, and it uses a text encoder that is incorporated within the LLM.


And it has been specifically fine-tuned for instructions to process the input of text data.


Now the TANGO model also involves training a U-net based diffusion model for audio generation.


Now this is something that they've developed and I'll definitely be covering it over in this video.


Now, despite training the LDM on a dataset that is significantly smaller than those used by the other state of the art models, I also think that TANGO was able to perform comparably across both objectives and subjective metrics.


Now, this is something that I'll be showing throughout today's video in comparison to other TTAs as well as getting a little bit more in-depth analysis as to what TANGO is trying to do by making its models, training, and interface code a little bit better with its pre-trained data so that you're able to get the best output.


Now you might be wondering why am I showcasing such an application when there's so many different TTAs out there.


Well, basically, one of the main reasons is because when I show you the examples as to how amazing it produces conditional sound effects, you will understand how great this actual application is going to be.


The actual project has also been trained on 4A6000 GPUs and basically it's been supervised with the fine-tuned model of Flan-T5.


Now this is going to make it so much more optimized with less data so that it can produce the best output.


Now how does this actually work?


Now let's take a look at the flow chart over here.


Now basically TANGO's project consists of 3 main components and you can see this over here.


Now it's illustrated over here in this figure.


Now the first component is the textual prompt encoders and this is where it receives the data of a text form and it takes input descriptions of a desired audio and it basically encodes it.


Now, the second component is the latent diffusion model, and this uses the encoder's textual representation to basically generate a latent representation of the desired audio of the input that you gave.


Now this is prior from standard noises as well as through reverse diffusions.


Now the third component is the MEL spectrum audio figure and this is what we can see over here.


Now this is what is taking place as the latest audio representations are then constructed and basically it is fed to the basic output and you're able to get the generative response.


Now let's actually take some examples into place as to get a better understanding of what it's trying to do as a text to audio application.


Now if you were to give this actual prompt of a man is speaking in a huge room you're able to get this generative response using its encoders, listen through.


Now, from this representation, you can see that the actual encoder represents what the actual descriptive text is.


And you're also able to get something like let's just compare it to a small room, for example.


You can see that there is less of an echo and it represents a smaller room and this is quite amazing guys because it's able to do a lot of different things like for example it's able to use a studio.


In my opinion it sounds more refined.


Now you can even add something like this, a racing car is passing by and it disappears.


Now that is quite accurate, describe the sound of a battlefield, okay let me turn this down because I don't know how loud it's going to be.


Now, I don't know about you guys, but this could be a huge breakthrough for different sounds as well as like when there's different copyright services that try to copyright and monopolize on different sounds, you could use certain things like this.


Now obviously there's before you actually go on by doing that there's also limitations to it so before you actually get into doing that make sure you stay tuned for what we're going to talk about.


Now, these are some of the examples of descriptions that you can see, and there's a lot of different things that you can actually take a look at on their website.


And I'll leave a link down in the description below so that this way you can actually get a better representation as to what they're trying to accomplish with their application.


Now, there's also a different thing that is audio LDM, and basically, it's built off of like not built off of TANGO, but TANGO is built off of audio LDM, and you can see there's a huge difference in between as to how improved TANGO has become.


Now let's take an actual example by maybe just taking a wooden table tapping sound while water is pouring so you give it the description and this is how audio LDM would output it.

Now this is how TANGO would sound.


I don't know about you guys but I definitely found it better with TANGO and obviously you can hear that the sound is very muffled or it has a very low quality feel to it.


This is because these are recorded differently, and they're not outputted properly through the right actual files.


So just keep that in mind.


And this is obviously if you are to generate sounds, it would be more refined, and it would sound way better.


Now let's take another example of maybe an elephant noise.


Now I don't know what that was trying to do with audio LDM but let's see what TANGO was able to do.


Now that definitely sounds like an elephant so TANGO did a better job and obviously you can see that it's not the best sound so keep that in mind.


Obviously, it's a work in progress, so you're not going to get the best generated responses right now as it's still a demo, and they're continuously going to improve on their actual app so that it can get the best responses.


Now let's maybe try something that has a bigger description so that you can get a better idea as to what type of sounds that it can actually generate.


Now that is quite remarkable even audio LDM is able to do such an amazing job.


Now let's see what I believe TANGO is actually able to do.


Now, that is quite amazing, guys because this is not actual real footage.


It's actually being made using a text to audio description, which is insane, guys.


And I really find this stuff to be quite remarkable as it's amazing to see the progression of different things like this, guys.


Now you might be wondering what are some of the limitations.


Now one of the limitations is that it has been trained on a relatively small data set and that is audio caps.


This is the actual name of their dataset, and this means that TANGO may not be also able to generate good audio samples from concepts that have not been through during like been set through like training.


And this is things like singing as well as monologues, as it's not been trained for that dataset at this current moment.


But they're obviously going to be continuously working on adding bigger datasets so they can expand their actual growth of different audio generation.


Now, additionally, I also think that TANGO may not be able to finally control its audio generation over textual control prompts.


As it's seen in these examples, where people like prompts are with subtle differences like with the production of different examples, and you're not able to get the best refined noises.


So, this is one thing that I also feel is a problem, and these are some of the two limitations that I currently see.


But obviously, in terms of its actual use cases, you can go down on GitHub and talk about the acknowledgments as well as how you can use it.


Please make sure that you take a look at this so that you can get a better understanding before you actually use it.


And now I'm going to take a little bit to go into how you can actually install it locally on your desktop.


So first things first you got to make sure you have Git installed.


This is so that you're able to clone the repository onto your desktop.


Secondly, you want to have Python installed because this is going to be your code unpacker as well as different things that you'll use to edit the actual package.


And lastly, you'll need Visual Studio Code.

This is optional as this is another code editor that you'll be using to edit as well as unpack certain things of your actual package.


You can also use Windows or Linux or different processors, actual command prompt but I personally use Visual Studio Code as it's much easier and much more like appealing to actually work with.

So first things first you got to make sure you clone the repository.


You can do this by copying this link over here or you can do it by clicking on this link over here and copying this repository.


So what you want to do now is open up command prompt.


Once you have done that, paste the git clone link and then click into pressing enter.

Now, once it's done installing all the files, what you can do in the meantime is go into the actual TANGO folder.


And that is by clicking CD TANGO.

And once you have done that and you're in the folder, you can start unpacking the different files of the repository onto your desktop.


And you can do that by clicking enter and copy and pasting this link over here.


Now what it will do is it will take a couple seconds.


I think I got an error because I don't have the right installation of PyTorch so make sure you install it by putting this in and then once you're able to do that you need to install the diffusers.


So what you can do is once you install the right files, you can go into the right files by installing this.


And what you can do is copy and paste this into the thing so that you're in the CD diffusers file.


And once you're in this, you can start installing the diffuser packages by basically clicking copy and paste and installing those packages into the diffuser file.


Now, obviously, I have a little problem here because I do not have the right installation for the actual files.


So I'm not going to be going forward with that, but basically, once you have reached that, you can start working with the different things.


And you can obviously train it as well as work with different datasets so you can obviously even play around with the interface by making it so it's easier to use and get a better generative response.


Now this is just how you can actually play around and install it locally on your desktop.


Now I'm going to be showing you a little bit more of the actual experiment results.


From what we can see here, these are some of the results that TANGO project can be summarized with different models, different data sets as well as the parameters.


Now, the TANGO model actually performed completely to a current state of the art models of text to audio, different generative applications, and despite being trained on much smaller datasets, it has been able to outperform a lot of them, and this is something that we can see at the bottom over here.


You're able to get better parameters as well as different overall beneficial textual prompts as well as metrics that basically measure different aspects of different TTAs.


Now, the TANGO project also released its model training interface code as well as its pre-trained checkpoints for the research community to use and build upon.


So, this is something that's quite great and will basically promote the further research and development of the field of TTA applications.


Now let's get into the actual part of where we can actually use this on the web front.


Now this is something that I'll leave in the description below.


I'll also leave the link to the actual research paper in the description below as well as the repo and the different links that you will need to actually install it locally on your desktop.


Now, with this Hugging Face interface of the actual application, you're going to be able to use as well as generate different types of audios using a text to audio application.

And this is something that you can do for free completely without an API key.


Now there's different examples over here.


Now for example, if I were to click two gunshots followed by the birds flying away while chirping, you can click that, click submit and you're going to be able to get a gendered response.


It's going to take a little bit longer, but this is how you're going to be able to do it on the web front.


And this is obviously going to happen as there's a lot of people using this.


So you got to keep that in mind.


But it's easy as that, guys.


And you can also increase the steps as well as the guidance skill, and you can tweak around with the parameters to get different types of responses of what you're trying to do with your prompt.


Now, it won't take too long, but as you can see, it's a little bit slower.


And if you have a beefy GPU, I would highly recommend that you run it on your actual local, as I do not actually have that at the current moment, so I won't be able to do that.


But in this case, I'll just show you on the web front.


And this is how you can actually do it.


And basically, that's it for this actual application, guys.


I hope you found this application of TANGO, which is a text audio and application, and I hope you got some value out of this.


And there's going to be a lot of different releases as well as use cases for this.


So I highly recommend that you keep a tab on this as it's going to be something that they're going to continuously develop and evolve over the coming weeks and years.


