【Soundstorm：超リアルなボイスオーバー】英語解説を日本語で読む【2023年6月26日｜@The AITELLIGENCE】

2023年6月26日 23:11

GoogleはSoundstormという新しい製品を発表しました。このソフトウェアは、ハイパーリアリスティックな音声を生成することができます。ソフトウェアはたった3秒の録音から任意の声を複製し、0.5秒で30秒の音声を生成することができます。この技術は素晴らしいものですが、ディープフェイクやなりすましといった潜在的なリスクや悪用の懸念があります。Googleはそのリスクを認識していますが、それに対処するための十分な対策を取っていません。政府の介入が望まれています。この論文では、対話合成やプロンプト/非プロンプト生成についても議論されています。Soundstormは有望なものですが、まだ微調整が必要です。このソフトウェアにはGoogleアシスタントの強化を含むさまざまな応用の可能性があります。ただし、この技術の影響や潜在的な悪用について考慮する必要があります。
公開日：2023年6月26日
※動画を再生してから読むのがオススメです。

Google has again presented something in their recent paper that has started calling the attention of many Observers, although many focus on the potential risks of this new product.

グーグルは最近の論文で、この新製品の潜在的なリスクに焦点を当てながらも、多くのオブザーバーの注目を集め始めたものを再び発表した。

We have something really amazing coming from Google with this new product, Soundstorm, and this video will be focused on highlighting the key details of this product so it can even be used to generate dialogues.

Googleからは、この新製品「Soundstorm」による本当に素晴らしいものがあり、このビデオではこの製品のキーの詳細を強調するために焦点を当てます。

Oh, interesting!

おお、興味深い！

Yeah, yeah, like this one was generated by Soundstorm.

ええ、ええ、これはSoundstormで生成されたようなものです。

Wait, what?

待って、何？

And I would really like everyone to pay attention to this, as everyone is likely to be impacted when this is released, one way or another.

これがリリースされると、誰もがどちらかの方法で影響を受ける可能性があるので、本当に皆さんに注目してほしいです。

And what this is, is the most realistic voiceover that you might have come across in a while.

そして、これが何なのかというと、ここしばらくで出会った中で最もリアルなナレーションなのだ。

And as is the trend now, Google has pointed out some pretty scary stuff that this technology can be used for, but we still don't see any action that they're taking to mitigate these risks, and this is really troubling.

そして、最近の傾向として、グーグルはこの技術が使用される可能性のあるかなり恐ろしいことを指摘しているが、これらのリスクを軽減するための対策はまだ見られない。

However, we really hope the government steps in soon enough, as Sam Altman already suggested to Congress before things get so much out of hand.

しかし、サム・アルトマンがすでに議会に提案したように、事態が手に負えなくなる前に、政府がすぐに介入してくれることを切に願う。

Soundstorm, as already pointed out, provides a hyper-realistic voiceover.

サウンドストームは、すでに指摘されているように、超リアルなナレーションを提供している。

I mean, you could almost hear the breath intakes that come from real human speeches, and it's been able to achieve the fluidity that comes with human speech.

つまり、本物の人間のスピーチから生まれる息継ぎが聞こえてきそうなほどで、人間のスピーチから生まれる流暢さを実現できている。

The kind of robotic performance you get from software like Siri and regular Google voiceovers is eliminated here, and that's not the most interesting part.

Siriや通常のGoogleのナレーションから得られるようなロボット的なパフォーマンスはここでは排除されている。

In this video, I'll be playing some of the demo voiceovers from the paper released by Google, and the crazy thing is that Soundstorm can be able to clone any voice from just a three-second recording.

このビデオでは、グーグルが発表した論文に掲載されたデモ音声のいくつかを再生しますが、クレイジーなのは、サウンドストームがたった3秒の録音からどんな声でもクローンできるということです。

That's just really insane.

これは本当に正気の沙汰ではありません。

And as much as you might think that's amazing, we're approaching a time when it will be really difficult to tell what's real and what's not because of these hyper-realistic deep fakes that are aided by AI.

これはすごいことだと思うかもしれませんが、AIによって超リアルな偽物が作られることで、何が本物で何が偽物なのかを見分けるのが本当に難しくなる時代が近づいているのです。

And we have some instances that we'll look at later in this video that will show how technologies like this have already been used to do some really bad stuff.

このビデオの後半で、このような技術がすでに本当に悪いことに使われていることを示すいくつかの事例を紹介します。

And as you can see here from this abstract of this paper on your screen, Google gave an overview of the mechanics behind the functionality of this model and also the efficiency level.

スクリーンに表示されているこの論文の抄録からわかるように、グーグルはこのモデルの機能の背後にある仕組みと効率性の概要を説明しています。

And the speed at which this thing works is just insane, as you can see in this section of the abstract where we have compared to the auto-regressive generation approach of audio LM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions while being two orders of magnitude faster.

そして、このものの動作する速さは、単純に狂気じみています。抽象化のこのセクションで見るように、音声LMの自己回帰生成アプローチと比較して、私たちのモデルは同じ品質の音声とより高い一貫性を持ちながら、2桁高速です。

Soundstorm generates 30 seconds of audio in 0.5 seconds on a TPU V4.

Soundstorm は、TPU V4 上で 30 秒の音声を 0.5 秒で生成します。

We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality natural dialogue segments given a transcript annotated with speaker turns and a short prompt with the speaker's voices.

我々は、発話者の交代と発話者の声による短いプロンプトが注釈されたトランスクリプトが与えられた場合、高品質の自然な対話セグメントを合成することで、より長いシーケンスに音声生成を拡張する我々のモデルの能力を実証します。

Being able to generate 30 seconds of audio in 0.5 seconds is pretty impressive, considering the quality this software gives out.

30秒の音声を0.5秒で生成できることは、このソフトウェアが出す品質を考慮すると、かなり印象的である。

And this means that this AI can be able to run a normal human interaction and no one will suspect anything.

そしてこれは、このAIが通常の人間との対話を実行しても、誰も何も疑わないことを意味する。

And in case you don't appreciate this enough, what we have here is a demo clip from Google.

万が一、あなたがこのことを十分に理解していない場合に備えて、ここにあるのはグーグルのデモクリップである。

Now listen.

聞いてください。

Well, it's a parallel decoder for efficient audio generation, so it can even be used to generate dialogues.

これは効率的な音声生成のための並列デコーダーで、対話の生成にも使えるんだ。

Oh, interesting!

おお、面白い！

Yeah, yeah, like this one was generated by Soundstorm.

そうそう、これはサウンドストームが生成したんだ

Wait, what?

待って、何？

Without being told, I would definitely mistake that for a real human voice.

言われなければ、間違いなく本物の人間の声と間違えてしまう。

The tone and inflections that are natural to human speech present in this thing are just amazing.

この中に存在する人間の自然な話し方のトーンや抑揚は、まさに驚きだ。

And there will be tons of applications for this when it is finally rolled out, and definitely, we'll be seeing some huge changes to the already existing voiceovers.

そして、これが最終的にロールアウトされた暁には、膨大な数のアプリケーションが出てくるだろうし、間違いなく、すでにあるナレーションに大きな変化が見られるだろう。

For a better understanding of how this model runs, I'll show you this video clip that Google shared in 2022 about audio LM.

このモデルがどのように動作するかをよりよく理解するために、グーグルが2022年に音声LMについて共有したビデオクリップをお見せしよう。

Ay, lording answered wolf, we know not how to call you Lord or lady.

ああ、狼よ、我々はあなたのことをロード、レディと呼ぶ方法を知らない。

We have lived too long in the forest.

私たちは森の中で長く暮らしすぎた。

And as you can see here, there's a gray line in between the recordings.

ここでわかるように、録音と録音の間にグレーの線が入っている。

The first parts are basically prompts given to the AI to work with, and whatever you hear after the gray line is basically what the AI generates on its own without any prior training or anything like that.

最初の部分は基本的にAIに与えられたプロンプトであり、灰色の線の後に聞こえるのは、事前のトレーニングなしにAI自体が生成したものです。

Just listen to this, and I promise this is just amazing.

これを聞いてみてください。約束しますが、これは本当に驚くべきものです。

Our first impressions of people are in nine cases out of ten mere spectacle reflections of the actuality of things, but they are impressions of something different.

私たちが人々の最初の印象を持つ場合、その9割は実際の事物の観察結果の単なる映像に過ぎませんが、それは何か違ったものの印象です。

This technology is what we have being advanced here in this new paper, and this paper has three very interesting parts that give us a general overview of what we were to expect when the software finally rolls out.

この技術は、この新しい論文で私たちが進めているものであり、この論文には3つの非常に興味深い部分があり、最終的にソフトウェアがロールアウトしたときに私たちが期待することの全体像を示している。

And this includes dialogue synthesis, prompted and unprompted generation, and baselines.

その中には、ダイアログの合成、プロンプトと非プロンプトの生成、そしてベースラインが含まれています。

Let's take a look at the first demo we have here under dialogue synthesis.

まず、ここで最初のデモを見てみましょう。対話合成の下にある最初のデモを見てみましょう。

And from here, you'll just understand why this is very impressive.

ここからは、なぜこれが非常に印象的なのかを理解していただけるでしょう。

Just like the version that I showed you earlier from the 2022 demo, we have two sections here.

先ほどお見せした2022年のデモと同じように、ここには2つのセクションがあります。

As you can see, the voice prompt and the synthesized dialogue.

ご覧の通り、ボイスプロンプトと合成されたダイアログです。

What you hear is basically Soundstorm being able to create a whole dialogue with just a three-second voice prompt.

音声プロンプトは、3秒間のボイスプロンプトだけで、ダイアログ全体を作成することができます。

Now listen to this voice prompt.

このボイスプロンプトを聞いてください。

And now we've got these two synthesized dialogues.Where did you go last summer?

そして、私たちはこれら2つの合成された対話を持っています。去年の夏、どこに行ったのですか？

I went to Greece, it was amazing.Where did you go last summer?

ギリシャに行きました、素晴らしかったです。去年の夏はどこに行きましたか？

I went to Greece, it was amazing.Oh, that's great!

ギリシャに行ったよ、素晴らしかった！

I've always wanted to go to Greece.What was your favorite part?

ずっとギリシャに行きたかったんだ。どこが一番好きだった？

Uh, it's hard to choose just one favorite part, but yeah, I really loved the food.

ええと、好きなところをひとつだけ選ぶのは難しいんだけど、そうだね、食べ物が本当に好きだった。

The seafood was especially delicious.

シーフードが特においしかった。

Yeah, and the beaches were incredible.

そうですね、そしてビーチは信じられないほど素晴らしかったです。

We spent a lot of time swimming, sunbathing, and exploring the islands.

泳いだり、日光浴をしたり、島を探検したりして、たくさんの時間を過ごしたわ。

It's just crazy how the AI was able to retain the tone and other details in the voices of the two actors through the rest of the generated part.

AIが2人の俳優の声のトーンやその他のディテールを、生成された残りの部分を通して保持できたのは、まさにクレイジーだ。

You can barely notice any difference.

違いはほとんどわからない。

And as you see here in the introduction of that section, it says right here that the following texts and speakers have not been seen during training.

そして、このセクションのイントロダクションにあるように、以下のテキストとスピーカーはトレーニング中に見られなかったと書かれている。

So all you hear from the synthesized part were just generated on the go by the AI.

つまり、合成された部分から聞こえてくるのは、AIがその場で生成したものだけなのです。

We're going to be seeing some impressive updates to Google Assistant with this, and there's no limit to what this can achieve when coupled with the large language models that we have available now.

これによってGoogleアシスタントには印象的なアップデートが行われ、私たちが現在利用可能な大規模な言語モデルと組み合わせると、これが達成できることには限りがありません。

Listening to the other instances, you might notice some of the usual robotic-sounding patterns interfere, but just within split seconds.

他の事例を聞いていると、通常のロボット的な発音パターンが干渉していることに気づくかもしれないが、それはほんの数秒のことだ。

Something really funny happened to me this morning.

今朝、私には本当に面白いことが起こりました。

Oh, wow, what?

おお、すごい、何？

Something really funny happened to me this morning.

今朝、本当に面白いことが起きたんだ。

Oh, wow, what?

おお、すごい、何？

Well, I woke up as usual, went downstairs to have breakfast, started eating.

まあ、いつものように起きて、朝食を食べに下に降りて、食べ始めたんだ。

Then, 10 minutes later, I realized it was the middle of the night.

それから10分後、気がついたら真夜中でした。

No way, that's so funny!

まさか、面白いね！

And this first synthesis and the second example didn't sound as good as the second one, and I'd like you to listen and observe this.

そして、この最初の合成音と2つ目の例は、2つ目の例ほど良く聞こえなかった。これを聞いて観察してほしい。

I hope you'll agree with me that the second one can be picked over the first one.

2つ目の方が1つ目より選べるということに同意してほしい。

And I think Google will have these little issues sorted out soon, as we expect that the model will still undergo some fine-tuning over the coming months.

そして私はGoogleがこれらの小さな問題をすぐに解決すると思います。まだ数ヶ月間、モデルは微調整される予定ですので。

And moving over to the second part of the paper here, which is the part that has the unprompted and prompted generation.

そして、この論文の2番目の部分、つまり、プロンプトなしとプロンプトありの生成の部分に移ります。

When you go through the paper itself, there's this section that talks about speech intelligibility, and that's where we get a good description of what is meant by the prompted and unprompted versions.

ペーパー自体を読むと、話される明瞭性について言及しているセクションがあり、それが提示されたバージョンと提示されなかったバージョンの意味をよく説明しています。

And it says right here, We perform these experiments both in the unprompted setup, where the methods can randomly sample speakers, and in the prompted setup, where the method should respect the speaker identity provided in the form of ground truth Soundstorm tokens corresponding to the first three seconds.

プロンプトなしセットアップでは、メソッドは話者をランダムにサンプリングすることができ、プロンプトありセットアップでは、メソッドは、最初の3秒に対応するグランドトゥルースSoundstormトークンの形で提供される話者のアイデンティティを尊重する必要があります。

We use a conformer transducer L ASR model for transcription.

転写には、コンフォーマトランスデューサL ASRモデルを使用します。

So basically, in the unprompted version, the AI can make changes to the original audio in terms of the voice of the speaker, but the prompted version is expected to totally mirror the prompt exactly as it is.

つまり、基本的に、プロンプトのないバージョンでは、AIは話者の声という点で、元の音声に変更を加えることができますが、プロンプトのあるバージョンは、プロンプトをそのまま完全に反映することが期待されます。

And there are many ways I think this can be used to cause very serious damage, but we'll talk about that maybe in another video.

これは、非常に深刻な被害を引き起こすために利用できる方法がたくさんあると思いますが、それについてはまた別のビデオでお話ししましょう。

Now, when you listen to the category under the unprompted version, we have the AI mimicking different voices, and I'd like you to listen to this.

さて、提示されなかったバージョンのカテゴリを聞くと、AIが異なる声を真似しているのが分かります。これを聞いてみてください。

Mr. Metacross the Elder, having not spoken one word thus far himself, introduced the newcomer to me with a side glance at his sons, which had something like defiance in it, a glance which, as I was sorry to notice, was returned with defiance on their side by the two young men.

メタクロス氏はこれまで一言も話していなかったが、彼の息子たちを横目で見ながら新入りを私に紹介した。その横目には挑戦的なものがあり、私が残念に思ったように、その横目は2人の若者によって彼らの側からも挑戦的に返された。

And you can hear that the AI switches to the original voice in the prompted part.

そして、プロンプトの部分でAIが元の声に切り替わるのを聞くことができる。

I'll leave the link in the description in case you'd like to try these samples yourself.

これらのサンプルを自分で試してみたい人のために、説明文にリンクを残しておこう。

And in the third section, baselines, we basically see that Soundstorm samples compared against other AI tools that are meant to carry out similar tasks.

そして3つ目のセクション、ベースラインでは、基本的にSoundstormのサンプルが、同様のタスクを実行するための他のAIツールと比較されています。

It says right here, When generating audio in the prompted case, Soundstorm generations have higher acoustic consistency and preserve the speaker's voice from the prompt better than the audio LM.

プロンプトのケースで音声を生成する場合、Soundstormの世代はオーディオLMよりも音響的な一貫性が高く、プロンプトから話者の声をよりよく保持します。

Compared to RVQ level-wise greedy decoding with the same model, Soundstorm produces audio with higher quality.

同じモデルのRVQレベル別貪欲デコードと比較して、Soundstormはより高品質な音声を生成する。

And I want you to listen carefully and observe the differences we have here.

注意して聞いて、ここでの違いに注目してほしいです。

He must descend with his heart full of charity and severity at the same time.

彼は慈愛と同時に厳しさを胸に降臨しなければならない。

He must descend with his heart full of charity and severity at the same time.

彼は慈愛と同時に厳しさに満ちた心で降りなければならない。

He must descend with his heart full of charity and severity at the same time.

彼は同時に、慈愛と厳しさに満ちた心で下らねばならない。

He must descend with his heart full of charity and severity at the same time.

彼は、慈愛と厳しさを同時に胸一杯に抱いて下らねばならない。

And true to the description I just showed you, Soundstorm did perform better than the others, especially greedy, which still has a lot of those echoing robotic sounds.

そして、今お見せした説明の通り、サウンドストームは他のものより良いパフォーマンスを見せた。特に貪欲なサウンドは、まだあの響くようなロボットサウンドが多い。

Overall, it's pretty solid stuff that Google has here, and we look forward to when this will be fully rolled out for use.

全体的に、グーグルがここに持っているものはかなり堅実なものであり、我々はこれが完全に使用できるようになる時を楽しみにしている。

But as much as we look forward to working with this software, there's a million ways this could go wrong from every angle, and Google did acknowledge this in the broader impact section.

しかし、私たちがこのソフトウェアを使うことを楽しみにしているのと同じくらい、あらゆる角度から見て、これがうまくいかない可能性はいくらでもある。

Look here where it says, However, a more thorough analysis of any training data and its limitations would be an area of future work in line with responsible AI principles.

しかし、あらゆるトレーニングデータとその限界についてのより徹底的な分析は、責任あるAIの原則に沿った今後の課題であろう。

In turn, the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation.

一方、声を真似する能力には、生体認証をバイパスしたり、なりすましの目的で悪用される可能性があります。

And I assure you, the risks are well above anything you'll see in this paper.

そして、そのリスクはこの論文に書かれているものよりもはるかに高いことを私は保証する。

And I'm quite eager to see how all these advancements will influence the upcoming elections because, like it or not, the role the new developments in AI plays spreads very much across most aspects of our lives.

そして私は、これらの進歩が今度の選挙にどのような影響を与えるのか、非常に期待している。というのも、好むと好まざるとにかかわらず、AIの新たな発展が果たす役割は、私たちの生活のほとんどの側面に大きく広がっているからだ。

And the possibility of having these deep fakes can change everything.

そして、このような深いフェイクを持つ可能性は、すべてを変える可能性がある。

Although Google has it in the same section that these AI-generated versions can be tracked by dedicated classifiers, I can only see this working out in more formal scenarios like verifying evidence in courts and other similar situations.

グーグルは同じセクションで、これらのAIが生成したバージョンは専用の分類機によって追跡できると述べているが、これがうまくいくのは、法廷での証拠の検証やその他同様の状況といった、より正式なシナリオにおいてのみだと私は考えている。

But we've had cases of these voice clone tools being used to defraud people by cloning the voice of a loved one or friend, and these are situations where you're likely not to have the luxury of running tests to verify identity.

しかし、これらの声のクローンツールが愛する人や友人の声をクローン化して人々を詐欺るという事例がありました。これらの状況では、身元を検証するためのテストを実行する余裕がない可能性があります。

There's no doubt we'll be having more cases, and I'm eager to see how these issues will be tackled.

今後、このような事例が増えることは間違いなく、このような問題にどのように取り組んでいくのか見てみたいと思っています。

What are your thoughts on the impact of these new products?

これらの新製品の影響について、あなたはどうお考えですか？

Do let us know in the comments.

コメントでお聞かせください。

Subscribe to our channel, and we'll see you in the next video.

私たちのチャンネルを購読して、次のビデオでお会いしましょう。

Bye now.

それではまた。

この記事が気に入ったらサポートをしてみませんか？