【TANGO】英語解説を日本語で読む【2023年5月3日｜@WorldofAI】

2023年5月3日 20:53

LDMを用いてテキストをオーディオに変換するText-to-Audio（TTA）生成技術『TANGO』の解説です。人間の音声や動物の鳴き声、自然音や人工音、効果音などのリアルな音声をテキストから生成できます。
公開日：2023年5月3日
※動画を再生してから読むのがオススメです。

Hey what is up guys welcome back to another youtube video at the WorldofAI.

こんにちは、WorldofAIのYouTubeビデオにお帰りなさい。

In today's video I'm going to be showcasing an amazing new project called TANGO which is a text to audio generative model that uses large language models called Flan-T5 as a text encoder.

今日のビデオでは、TANGOという素晴らしい新しいプロジェクトを紹介します。これは、テキストエンコーダーとしてFlan-T5という大規模な言語モデルを使用した、テキストから音声への生成モデルです。

Now, Flan-T5 has been fine-tuned for instruction and chain of thought-based tasks, and it basically has significantly improved Xero as well as a few-shot performances and many natural language processing tasks.

Flan-T5は、命令や思考の連鎖に基づくタスクのために微調整されており、基本的にXeroを大幅に改善し、数発のパフォーマンスや多くの自然言語処理タスクも実現しています。

Now this is quite a remarkable application as you're able to formulate such amazing audios with the form of a text using their amazing encoders.

この素晴らしいエンコーダーを使うことで、素晴らしい音声をテキストの形で表現することができるのですから。

Now throughout today's video I'm going to be showcasing you guys a little bit more about this project and a little bit more about the analysis of what it's trying to accomplish.

今日のビデオでは、このプロジェクトについてもう少し詳しく紹介し、このプロジェクトが達成しようとしていることについて、もう少し詳しく分析します。

I'm also going to take a little bit of time to install it locally on your desktop as well as showing you a lot of different examples as what they're trying to do.

また、デスクトップにインストールし、何をしようとしているのか、さまざまな例をお見せします。

And I'll also give you guys a link as to how you can play around with it on the actual web front.

そして、実際のウェブフロントでの操作方法を紹介するリンクも用意しました。

And I'll also show you guys how to install it which I said before.

そして、先ほども言ったインストール方法も紹介します。

Now with that thought guys if you guys haven't subscribed please do so there's a lot of content and a lot of value that you will definitely benefit from.

もしまだ購読していないなら、ぜひ購読してください！たくさんのコンテンツがあり、たくさんの価値があるので、間違いなく恩恵を受けることができます。

I always try my best to post daily, and I always try to strive to give you guys the best content.

私はいつも毎日投稿するよう最善を尽くし、皆さんに最高のコンテンツを提供するよう努力しています。

So I highly recommend that you check out my videos as there's a lot of stuff that you will definitely benefit from.

だから、私のビデオをチェックすることを強くお勧めします。あなた方が間違いなく利益を得ることができるものがたくさんあります。

So please subscribe, comment and like and with that thought let's get right into the video.

では、早速ビデオに入りましょう。

So as we talked about this is a new text to audio generative actual application.

さて、今回ご紹介するのは、テキストからオーディオへの新しいジェネレーティブ・アプリケーションです。

Now, the project uses Flan-T5, which is another type of LLM, and it uses a text encoder that is incorporated within the LLM.

このプロジェクトでは、LLMの一種であるFlan-T5を使用しており、LLMの中に組み込まれたテキスト・エンコーダを使用しています。

And it has been specifically fine-tuned for instructions to process the input of text data.

そして、テキストデータの入力を処理するために、特に細かく調整された命令になっています。

Now the TANGO model also involves training a U-net based diffusion model for audio generation.

また、TANGOのモデルでは、音声生成のためにU-netベースの拡散モデルをトレーニングします。

Now this is something that they've developed and I'll definitely be covering it over in this video.

これは、彼らが開発したもので、このビデオでぜひ取り上げたいと思います。

Now, despite training the LDM on a dataset that is significantly smaller than those used by the other state of the art models, I also think that TANGO was able to perform comparably across both objectives and subjective metrics.

さて、LDMのトレーニングは、他の最先端モデルが使用するデータセットよりもかなり小さいにもかかわらず、TANGOは、目的と主観の両方のメトリクスで同等のパフォーマンスを発揮することができたと私は思います。

Now, this is something that I'll be showing throughout today's video in comparison to other TTAs as well as getting a little bit more in-depth analysis as to what TANGO is trying to do by making its models, training, and interface code a little bit better with its pre-trained data so that you're able to get the best output.

今日のビデオでは、他のTTAと比較しながら、TANGOが何をしようとしているのか、モデル、トレーニング、インターフェイスのコードを、事前トレーニングされたデータでもう少し良くすることで、最高のアウトプットを得ることができるのか、もう少し深く分析したいと思います。

Now you might be wondering why am I showcasing such an application when there's so many different TTAs out there.

さて、世の中には様々なTTAがあるのに、なぜ私がこのようなアプリケーションを紹介するのかと思われるかもしれません。

Well, basically, one of the main reasons is because when I show you the examples as to how amazing it produces conditional sound effects, you will understand how great this actual application is going to be.

基本的には、条件付きサウンドエフェクトの生成の素晴らしさを例として紹介することで、このアプリケーションの素晴らしさを理解していただけると思うからです。

The actual project has also been trained on 4A6000 GPUs and basically it's been supervised with the fine-tuned model of Flan-T5.

また、このプロジェクトは4A6000GPUで学習され、基本的にはFlan-T5の微調整されたモデルでスーパービジョンされています。

Now this is going to make it so much more optimized with less data so that it can produce the best output.

これにより、より少ないデータで最適化され、最高のアウトプットが得られるようになっています。

Now how does this actually work?

では、実際にどのように機能するのでしょうか？

Now let's take a look at the flow chart over here.

フローチャートを見てみましょう。

Now basically TANGO's project consists of 3 main components and you can see this over here.

TANGOのプロジェクトは、基本的に3つの主要なコンポーネントで構成されており、これをご覧ください。

Now it's illustrated over here in this figure.

この図に示されている通りです。

Now the first component is the textual prompt encoders and this is where it receives the data of a text form and it takes input descriptions of a desired audio and it basically encodes it.

最初のコンポーネントは、テキストプロンプトエンコーダで、ここでテキストフォームのデータを受け取り、希望する音声の入力説明を受け取り、基本的にそれをエンコードします。

Now, the second component is the latent diffusion model, and this uses the encoder's textual representation to basically generate a latent representation of the desired audio of the input that you gave.

2つ目のコンポーネントは潜在拡散モデルで、エンコーダーのテキスト表現を使って、入力された希望する音声の潜在的な表現を基本的に生成します。

Now this is prior from standard noises as well as through reverse diffusions.

これは、標準的なノイズと逆拡散による先行表現です。

Now the third component is the MEL spectrum audio figure and this is what we can see over here.

3つ目のコンポーネントは、MELスペクトルオーディオ図です。

Now this is what is taking place as the latest audio representations are then constructed and basically it is fed to the basic output and you're able to get the generative response.

これは、最新の音声表現が構築され、基本的な出力に供給され、生成反応を得ることができるものである。

Now let's actually take some examples into place as to get a better understanding of what it's trying to do as a text to audio application.

では、実際にいくつかの例を挙げて、テキストからオーディオへのアプリケーションとして何をしようとしているのか、よりよく理解することにしましょう。

Now if you were to give this actual prompt of a man is speaking in a huge room you're able to get this generative response using its encoders, listen through.

巨大な部屋の中で男性が話しているという実際のプロンプトを与えると、エンコーダーを使ってこのような生成的な反応を得ることができます。

Now, from this representation, you can see that the actual encoder represents what the actual descriptive text is.

この表現から、実際のエンコーダーが実際の説明的なテキストを表していることがわかります。

And you're also able to get something like let's just compare it to a small room, for example.

また、例えば、小さな部屋と比較してみましょう。

You can see that there is less of an echo and it represents a smaller room and this is quite amazing guys because it's able to do a lot of different things like for example it's able to use a studio.

エコーが少なく、小さな部屋を表現していることがわかります。これは非常に素晴らしいことで、例えばスタジオを使うなど、さまざまなことができるようになります。

In my opinion it sounds more refined.

私の意見では、より洗練されたサウンドになりました。

Now you can even add something like this, a racing car is passing by and it disappears.

レーシングカーが通り過ぎた後、消えてしまうというようなことも可能です。

Now that is quite accurate, describe the sound of a battlefield, okay let me turn this down because I don't know how loud it's going to be.

戦場の音を表現して、どれくらいの音量になるかわからないから音を小さくしてみよう。

Now, I don't know about you guys, but this could be a huge breakthrough for different sounds as well as like when there's different copyright services that try to copyright and monopolize on different sounds, you could use certain things like this.

さて、皆さんはどうかわかりませんが、これはさまざまな音にとって大きなブレークスルーになるかもしれません。さまざまな著作権サービスが、さまざまな音に著作権を与えて独占しようとするとき、このようなものを使うことができるのです。

Now obviously there's before you actually go on by doing that there's also limitations to it so before you actually get into doing that make sure you stay tuned for what we're going to talk about.

しかし、実際にこのようなことをする前に、制限もありますので、実際にこのようなことをする前に、これからお話しすることにご期待ください。

Now, these are some of the examples of descriptions that you can see, and there's a lot of different things that you can actually take a look at on their website.

さて、これらはあなたが見ることができる記述の例の一部であり、あなたが実際に彼らのウェブサイトを見てみることができる多くの異なるものがあります。

And I'll leave a link down in the description below so that this way you can actually get a better representation as to what they're trying to accomplish with their application.

下の説明文にリンクを貼っておきますので、この方法で実際に彼らがアプリケーションで何を達成しようとしているのか、よりよく表現することができます。

Now, there's also a different thing that is audio LDM, and basically, it's built off of like not built off of TANGO, but TANGO is built off of audio LDM, and you can see there's a huge difference in between as to how improved TANGO has become.

さて、オーディオLDMというものもあり、基本的にはTANGOからではなく、TANGOがオーディオLDMから作られたものです。

Now let's take an actual example by maybe just taking a wooden table tapping sound while water is pouring so you give it the description and this is how audio LDM would output it.

では、実際の例を挙げてみましょう。例えば、水を注ぐときに木のテーブルをたたく音がするとします。このような説明をすると、audio LDMはそれをどのように出力するでしょうか。

Now this is how TANGO would sound.

これがTANGOの音です。

I don't know about you guys but I definitely found it better with TANGO and obviously you can hear that the sound is very muffled or it has a very low quality feel to it.

皆さんはどう思われるか分かりませんが、私はTANGOの方が断然良いと思いました。

This is because these are recorded differently, and they're not outputted properly through the right actual files.

これは、録音方法が異なるためで、正しい実際のファイルを通して正しく出力されないからです。

So just keep that in mind.

ですから、その点に注意してください。

And this is obviously if you are to generate sounds, it would be more refined, and it would sound way better.

そして、これは明らかに音を生成する場合、より洗練され、より良い音になるはずです。

Now let's take another example of maybe an elephant noise.

では、別の例として象の鳴き声を考えてみましょう。

Now I don't know what that was trying to do with audio LDM but let's see what TANGO was able to do.

オーディオLDMで何をしようとしているのかわかりませんが、TANGOが何をしたのか見てみましょう。

Now that definitely sounds like an elephant so TANGO did a better job and obviously you can see that it's not the best sound so keep that in mind.

これは確かに象のような音です。TANGOはより良い仕事をしました。

Obviously, it's a work in progress, so you're not going to get the best generated responses right now as it's still a demo, and they're continuously going to improve on their actual app so that it can get the best responses.

もちろん、まだデモなので、今すぐ最高の反応が得られるわけではありませんが、最高の反応が得られるように、実際のアプリを継続的に改良していきます。

Now let's maybe try something that has a bigger description so that you can get a better idea as to what type of sounds that it can actually generate.

では、実際にどのような音を生成できるのか、より良いアイデアを得るために、より大きな描写を持つものを試してみましょう。

Now that is quite remarkable even audio LDM is able to do such an amazing job.

さて、オーディオLDMでもこれだけすごいことができるんですね。

Now let's see what I believe TANGO is actually able to do.

では、TANGOが実際にどんなことができるのか見てみましょう。

Now, that is quite amazing, guys because this is not actual real footage.

これは実際の映像ではないので、とても驚きです。

It's actually being made using a text to audio description, which is insane, guys.

これはテキストから音声への変換で作られているのですが、これは正気の沙汰ではありません。

And I really find this stuff to be quite remarkable as it's amazing to see the progression of different things like this, guys.

このように、さまざまなものが進化していくのを見るのは、本当に驚くべきことだと思います。

Now you might be wondering what are some of the limitations.

では、どのような制限があるのかと思われるかもしれません。

Now one of the limitations is that it has been trained on a relatively small data set and that is audio caps.

限界の1つは、比較的小さなデータセットでトレーニングされていることです。

This is the actual name of their dataset, and this means that TANGO may not be also able to generate good audio samples from concepts that have not been through during like been set through like training.

これはデータセットの実際の名前です。つまり、TANGOは、トレーニングによって設定されていない概念から、良いオーディオサンプルを生成することができないかもしれません。

And this is things like singing as well as monologues, as it's not been trained for that dataset at this current moment.

これは、歌やモノローグのようなもので、現時点ではそのデータセットに対してトレーニングされていません。

But they're obviously going to be continuously working on adding bigger datasets so they can expand their actual growth of different audio generation.

しかし、より大きなデータセットを追加することで、さまざまな音声生成の実際の成長を拡大できるよう、継続的に取り組んでいくつもりです。

Now, additionally, I also think that TANGO may not be able to finally control its audio generation over textual control prompts.

さて、さらに、私は、TANGOは、テキストによる制御プロンプトに対して、最終的に音声生成を制御することができないかもしれないとも考えています。

As it's seen in these examples, where people like prompts are with subtle differences like with the production of different examples, and you're not able to get the best refined noises.

これらの例に見られるように、プロンプトが好きな人は、異なる例の生成のような微妙な違いがあり、最高の洗練されたノイズを得ることができないのです。

So, this is one thing that I also feel is a problem, and these are some of the two limitations that I currently see.

そのため、この点は私も問題だと感じていますし、これが現在私が見ている2つの限界です。

But obviously, in terms of its actual use cases, you can go down on GitHub and talk about the acknowledgments as well as how you can use it.

しかし、実際の使用例については、GitHubに行けば、謝辞や使用方法について話すことができるのは明らかです。

Please make sure that you take a look at this so that you can get a better understanding before you actually use it.

実際に使う前に理解を深めるためにも、必ず見ておいてください。

And now I'm going to take a little bit to go into how you can actually install it locally on your desktop.

そして、実際にデスクトップにローカルにインストールする方法について、少し説明します。

So first things first you got to make sure you have Git installed.

まず最初に、Gitがインストールされていることを確認する必要があります。

This is so that you're able to clone the repository onto your desktop.

これは、リポジトリをデスクトップにクローンできるようにするためです。

Secondly, you want to have Python installed because this is going to be your code unpacker as well as different things that you'll use to edit the actual package.

次に、Pythonをインストールしてください。これは、コードを展開したり、実際のパッケージを編集したりするのに使うものです。

And lastly, you'll need Visual Studio Code.

そして最後に、Visual Studio Codeが必要です。

This is optional as this is another code editor that you'll be using to edit as well as unpack certain things of your actual package.

これはオプションで、実際のパッケージの編集や解凍に使用するコードエディターです。

You can also use Windows or Linux or different processors, actual command prompt but I personally use Visual Studio Code as it's much easier and much more like appealing to actually work with.

WindowsやLinuxなどのプロセッサやコマンドプロンプトを使うこともできますが、個人的にはVisual Studio Codeを使う方がずっと簡単で、実際に作業するのに適しています。

So first things first you got to make sure you clone the repository.

まず最初に、リポジトリのクローンを作成する必要があります。

You can do this by copying this link over here or you can do it by clicking on this link over here and copying this repository.

このリンクをコピーしてもいいですし、このリンクをクリックしてこのリポジトリをコピーしてもいいです。

So what you want to do now is open up command prompt.

次に、コマンドプロンプトを開いてください。

Once you have done that, paste the git clone link and then click into pressing enter.

git cloneのリンクを貼り付け、Enterキーを押してクリックします。

Now, once it's done installing all the files, what you can do in the meantime is go into the actual TANGO folder.

さて、すべてのファイルのインストールが完了したら、その間にできることは、実際のTANGOのフォルダに入ることです。

And that is by clicking CD TANGO.

それはCD TANGOをクリックすることです。

And once you have done that and you're in the folder, you can start unpacking the different files of the repository onto your desktop.

そして、そのフォルダに入ったら、リポジトリの様々なファイルをデスクトップに解凍することができます。

And you can do that by clicking enter and copy and pasting this link over here.

そのためには、Enterをクリックし、このリンクをコピーしてここに貼り付けます。

Now what it will do is it will take a couple seconds.

これで、2秒ほど時間がかかります。

I think I got an error because I don't have the right installation of PyTorch so make sure you install it by putting this in and then once you're able to do that you need to install the diffusers.

PyTorchが正しくインストールされていないのでエラーが出たと思いますので、これを入れてインストールしたことを確認してください。

So what you can do is once you install the right files, you can go into the right files by installing this.

つまり、正しいファイルをインストールしたら、これをインストールすることで正しいファイルに入ることができるのです。

And what you can do is copy and paste this into the thing so that you're in the CD diffusers file.

これをコピー＆ペーストして、CDディフューザーのファイルに入るようにします。

And once you're in this, you can start installing the diffuser packages by basically clicking copy and paste and installing those packages into the diffuser file.

この中に入ったら、コピー＆ペーストをクリックしてディフューザー・パッケージをインストールし、ディフューザー・ファイルにインストールすることができます。

Now, obviously, I have a little problem here because I do not have the right installation for the actual files.

さて、ここで少し問題があります。実際のファイルを正しくインストールできていないのです。

So I'm not going to be going forward with that, but basically, once you have reached that, you can start working with the different things.

しかし、基本的には、ここまでくれば、さまざまなものを使って仕事を始めることができます。

And you can obviously train it as well as work with different datasets so you can obviously even play around with the interface by making it so it's easier to use and get a better generative response.

そして、訓練もできますし、さまざまなデータセットで作業することもできます。

Now this is just how you can actually play around and install it locally on your desktop.

これは、実際に遊んでみて、デスクトップにローカルにインストールする方法です。

Now I'm going to be showing you a little bit more of the actual experiment results.

では、実際の実験結果をもう少しお見せしましょう。

From what we can see here, these are some of the results that TANGO project can be summarized with different models, different data sets as well as the parameters.

これは、TANGOプロジェクトが、異なるモデル、異なるデータセット、およびパラメータを使用してまとめた結果の一部です。

Now, the TANGO model actually performed completely to a current state of the art models of text to audio, different generative applications, and despite being trained on much smaller datasets, it has been able to outperform a lot of them, and this is something that we can see at the bottom over here.

TANGOのモデルは、テキストから音声への変換や様々な生成アプリケーションの最新モデルと完全に同じ性能を持ち、はるかに小さなデータセットで学習したにもかかわらず、多くのモデルを凌駕することができました。

You're able to get better parameters as well as different overall beneficial textual prompts as well as metrics that basically measure different aspects of different TTAs.

より良いパラメータを得ることができ、全体的に有益なテキストプロンプトや、基本的に様々なTTAの様々な側面を測定するメトリクスを得ることができるのです。

Now, the TANGO project also released its model training interface code as well as its pre-trained checkpoints for the research community to use and build upon.

TANGOプロジェクトは、モデルトレーニングインターフェースのコードと、事前にトレーニングされたチェックポイントも公開し、研究コミュニティが利用できるようにしました。

So, this is something that's quite great and will basically promote the further research and development of the field of TTA applications.

これは非常に素晴らしいことで、TTAの応用分野のさらなる研究開発を促進することになるでしょう。

Now let's get into the actual part of where we can actually use this on the web front.

では、実際にウェブ上でどのように使えるのか、その部分に触れていきましょう。

Now this is something that I'll leave in the description below.

これは、下の説明文に残しておきます。

I'll also leave the link to the actual research paper in the description below as well as the repo and the different links that you will need to actually install it locally on your desktop.

また、実際の研究論文へのリンクと、実際にデスクトップにインストールするために必要なレポやさまざまなリンクも、以下の説明文に残しておきます。

Now, with this Hugging Face interface of the actual application, you're going to be able to use as well as generate different types of audios using a text to audio application.

さて、このHugging Faceのインターフェイスを使った実際のアプリケーションでは、テキストから音声に変換するアプリケーションを使って、さまざまな種類の音声を生成することができるようになります。

And this is something that you can do for free completely without an API key.

しかも、APIキーがなくても、完全に無料でできることです。

Now there's different examples over here.

ここで、さまざまな例を挙げてみましょう。

Now for example, if I were to click two gunshots followed by the birds flying away while chirping, you can click that, click submit and you're going to be able to get a gendered response.

例えば、2つの銃声の後に鳥がさえずりながら飛び去っていく様子をクリックすると、それをクリックして送信すると、性別に応じた応答を得ることができます。

It's going to take a little bit longer, but this is how you're going to be able to do it on the web front.

少し時間がかかりますが、これがウェブ上でできるようになる方法です。

And this is obviously going to happen as there's a lot of people using this.

そして、多くの人がこれを使うので、これは明らかに起こることです。

So you got to keep that in mind.

だから、そのことを心に留めておいてください。

But it's easy as that, guys.

でも、そんなの簡単だよ、みんな。

And you can also increase the steps as well as the guidance skill, and you can tweak around with the parameters to get different types of responses of what you're trying to do with your prompt.

また、ステップを増やしたり、ガイダンススキルを増やしたり、パラメータをいじったりして、プロンプトで何をしようとしているのか、さまざまな種類の反応を得ることができます。

Now, it won't take too long, but as you can see, it's a little bit slower.

さて、それほど時間はかかりませんが、ご覧のとおり、少し遅くなっています。

And if you have a beefy GPU, I would highly recommend that you run it on your actual local, as I do not actually have that at the current moment, so I won't be able to do that.

GPUをお持ちの方は、ぜひ実際のローカルで実行されることをお勧めしますが、私は現在GPUを持っていませんので、実行することはできません。

But in this case, I'll just show you on the web front.

しかし今回は、ウェブ上でお見せします。

And this is how you can actually do it.

このように、実際に行うことができます。

And basically, that's it for this actual application, guys.

と、基本的に、この実際のアプリケーションはこれで終わりです、みんな。

I hope you found this application of TANGO, which is a text audio and application, and I hope you got some value out of this.

このTANGOのアプリケーションはテキストオーディオとアプリケーションであり、あなたがこのアプリケーションから何らかの価値を得ることができたなら幸いです。

And there's going to be a lot of different releases as well as use cases for this.

そして、これの使用例だけでなく、さまざまなリリースがあるはずです。

So I highly recommend that you keep a tab on this as it's going to be something that they're going to continuously develop and evolve over the coming weeks and years.

今後数週間、数年にわたり、継続的に開発され、進化していくものですので、ぜひご注目ください。

So thank you so much for watching guys, I hope you found this video quite informative and with that thought I'll see you guys next time, have an amazing day and I'll catch you later.

このビデオが有益なものであることをご理解いただけたと思います。

Bye guys.

それではまた。

この記事が気に入ったらサポートをしてみませんか？