【Realtime Voice Changer】英語解説を日本語で読む【2023年5月25日｜@Nerdy Rodent】

2023年5月27日 07:45

AIの声変換機能を備えた使いやすい音声変換アプリの解説です。このアプリは、別々のツールを必要とせず、声変換プロセスを簡略化します。
公開日：2023年5月25日
※動画を再生してから読むのがオススメです。

Hello and welcome to more Nerdy Rodent geekery.

こんにちは、そしてようこそ、さらなるNerdy Rodentのギークな世界へ。

Voice-to-voice technology, in case you're not aware of what this is, basically allows you to change one voice into another voice, a bit like having an AI voice changer.

ボイス・トゥ・ボイス技術とは、何かを知らない場合、基本的には1つの声を別の声に変えることができ、AIボイスチェンジャーのようなものです。

And the best part?

そして、一番の魅力は？

Everything you need is now all in one app.

必要なものはすべて1つのアプリに集約されています。

Plus, it's really quick to train as well.

しかも、トレーニングも短時間で済みます。

So here it is, the retrieval-based voice conversion web user interface.

これが、検索型音声変換のウェブユーザーインターフェイスです。

Hardly a mouthful at all.

まったくもって、難しい言葉です。

You may be aware that AI singing voice conversion can be a bit of a task, as there are multiple stages involved before you can create your masterpiece video with John Cena dancing while listening to Abraham Lincoln singing the very latest K-Pop song.

AIによる歌唱声変換がやや手間がかかることはご存知かもしれません。これは、John Cenaがダンスしながら、Abraham Lincolnが最新のK-Pop曲を歌うというあなたの傑作ビデオを作成する前に、複数のステージを経る必要があるからです。

First, you need to collect a bunch of voice samples, process them, train a model, separate the vocals from the music track you're changing (if you don't already have them separately), run your new AI model on those vocals, and finally mix them back in with the music.

まず、多くの音声サンプルを収集し、それらを処理し、モデルを訓練し、変更する音楽トラックからボーカルを分離します（すでに別々に持っていない場合）、そのボーカルに新しいAIモデルを実行し、最終的にそれらを音楽と再度混ぜます。

Thankfully, that can now all be done via this one web interface.

それが、このウェブインターフェイスですべてできるようになったのはありがたいことです。

And what is the quality like?

そのクオリティはどうなのでしょうか？

Well, let's have a listen.

では、実際に聴いてみましょう。

I've used an example song from Pixabay.

Pixabayの楽曲を例にしてみました。

There it is, meaning that in less than 30 minutes of training time, I can be the one singing instead.

これなら、30分もかからずに、私が代わりに歌うことができます。

So let's take a quick listen to a clip of the original, so we know what I'm going to convert from.

それでは、私が変換する元の音源のクリップを短く聞いてみましょう。

And then, now, with that voice changed by this AI to sound like me instead.

そして今、その声はこのAIによって私のように聞こえるように変えられました。

Want to do this yourself?

自分でもやってみたいですか？

Then stick with me here, and I'll show you exactly how.

では、その方法をお見せしますから、お付き合いください。

As with anything Python, installation is an absolute breeze.

Pythonと同じように、インストールは至って簡単です。

And the best part is that it works on a range of operating systems, even Microsoft Windows.

そして何より、Microsoft Windowsを含む様々なOS上で動作するのが魅力です。

Here's a little table with some of the requirements if you use Microsoft Windows.

Microsoft Windowsを使用している場合の必要条件を表にまとめてみました。

Sorry if you are using that.

Microsoft Windowsをお使いの方は、申し訳ありません。

I do hope things get better.

改善されることを願っています。

What you could do is download and install 7-Zip, download the rvc-beta 7-Zip file from the Hugging Face page, unzip it, and then use go-web.bat.

7-Zipをダウンロードしてインストールし、Hugging Faceのページからrvc-betaの7-Zipファイルをダウンロードして解凍し、go-web.batを使用することができます。

A normal install can also be done just like they have here, though you may want to download the 7-Zip archive anyway as that has all the models in it.

通常のインストールもここで紹介されているように行えますが、7-Zipのアーカイブをダウンロードしておくと、すべてのモデルが含まれているので便利かもしれません。

Personally, I did the normal install using an Anaconda virtual Python 3.10 environment, as I like simple app management.

個人的には、シンプルなアプリ管理が好きなので、Anacondaの仮想Python 3.10環境を使って、普通にインストールしました。

There is also a Google Colab available if you prefer to use Google Colab.

また、Google Colabを使いたい場合は、Google Colabも用意されています。

So, with whatever installation method you chose, you should now have your web interface up and running.

さて、どのようなインストール方法を選んだとしても、これでウェブインターフェースを立ち上げることができるはずです。

Let's dive into this fascinating world of voice-to-voice technology and see what amazing things we can create.

この魅力的なVoice-to-Voiceテクノロジーの世界に飛び込んで、どんな素晴らしいものが作れるか見てみましょう。

If you already have a model, you can do model inference straight away.

すでにモデルを持っている人は、すぐにモデル推論を行うことができます。

Or, like me, you can begin with training one.

あるいは、私のようにトレーニングから始めることもできます。

If you don't, there is the training tab.

もし持っていないのであれば、トレーニング・タブがあります。

However, before we delve into the training process, let's just quickly go over these five tabs.

しかし、トレーニングのプロセスを掘り下げる前に、この5つのタブについて簡単に説明しましょう。

So first of all, you've got model inference.

まず最初に、モデル推論がありますね。

You've got separation of accompaniment and vocal, train checkpoint processing so you can mix checkpoints together, there's export ONNX (which I've never used), and also an FAQ as well.

伴奏とボーカルの分離、チェックポイントを混ぜることができるチェックポイント処理の訓練、ONNXのエクスポート（これは使ったことがない）、そしてFAQもあります。

To begin with, as mentioned, we're going to start with the training tab, as this is where you will create your very first voice model.

まず始めに、前述の通り、最初の音声モデルを作成する場所として、トレーニングタブから始めます。

Step one, for the experiment name, simply enter the name you want to give your project.

ステップ1、実験名には、プロジェクトに付けたい名前を入力します。

So, you could do, for example, nerdy because that's me.

例えば、「nerdy（オタク）」というのは、私のことですから。

As for the sample rate, I personally prefer always using 40K, and I always leave this on true as well, as that seems to be the best model architecture.

サンプルレートは、個人的には常に40Kを使うのが好きで、モデル・アーキテクチャとして最適なので、これも常にtrueにしています。

You can select either version 1 or version 2.

バージョン1とバージョン2のどちらかを選択することができます。

Personally, I prefer version two.

個人的にはバージョン2の方が好きです。

The number of threads, I think, is probably picked automatically.

スレッドの数は、おそらく自動的に選ばれると思います。

Congratulations!

おめでとうございます！

You have now completed step one.

これでステップ1は完了です。

The next step is step two.

次はステップ2です。

The first thing it asks for here is the path to the training directory.

ここで最初に聞かれるのは、トレーニングディレクトリへのパスです。

If you're not familiar with terms like files and directories on your computer, this part can be quite confusing.

コンピュータのファイルやディレクトリなどの用語に慣れていないと、この部分はかなり混乱する可能性があります。

You could think of directories as computer boxes where you organize your things, files in this case, and I've put them into a training directory.

ディレクトリはコンピュータの箱のようなもので、この場合はファイルですが、それらをトレーニングディレクトリに整理しています。

So, there is my path: training/nerd.

つまり、私のパスは「training/nerd」です。

If we have a quick look at that directory, as you can see, it's absolutely full of audio files.

このディレクトリをざっと見てみると、ご覧のように、オーディオファイルでいっぱいです。

If your name is different, you may wish to use something else, but it's entirely up to you.

もしあなたの名前が違うのであれば、他のものにした方がいいかもしれませんが、それはあなた次第です。

Even though I'd already split my samples up into around 250 segments, you don't actually need to worry too much about that because this program will automatically handle long audio and split it accordingly.

私はすでにサンプルを約250のセグメントに分割していますが、このプログラムは自動的に長いオーディオを処理し、それに応じて分割するので、実際にはあまり気にする必要はありません。

Generally speaking, between 10 and 50 minutes total audio is required.

一般的には、10分から50分程度の音声が必要です。

Any vocals are fine: singing, talking, whatever.

歌でも話し声でも、どんなボーカルでもかまいません。

Just make sure that you don't have any music in the background.

ただ、バックグラウンドに音楽がないことは確認してください。

It should be all one person, vocals only.

一人で、ヴォーカルだけです。

Okay, so now you've put in the directory with all your samples in, you can just click Process Data.

さて、すべてのサンプルが入ったディレクトリを入れたら、Process Dataをクリックしてください。

That will take a few seconds and process all of the samples for you.

それは数秒で全てのサンプルを処理します。

Now, you're ready to move on to step two B. If you have multiple graphics cards, then you can put them in here, but I've only got a single GPU, so I just leave that as is.

複数のグラフィックカードを持っている場合は、ここに入力することができますが、私は1つのGPUしか持っていないので、そのままにしています。

The defaults are absolutely fine.

デフォルトで全く問題ありません。

Next, you have pitch extraction, which has three options.

次に、ピッチ抽出ですが、これは3つのオプションがあります。

Personally, I always go with Harvest.

個人的には、いつもHarvestを使っています。

PM is fast but low quality.

PMは速いけど品質が低い。

Do is a bit slower but better quality, and Harvest is the slowest but the best quality.

Doは少し遅いですが品質は良く、Harvestは最も遅いですが最高品質です。

So, with Harvest selected there, I just click Feature Extraction.

そこでHarvestを選択した状態で、Feature Extractionをクリックします。

That will take a few seconds and finish that task.

これで数秒かかり、そのタスクは終了します。

Step three.

ステップ3。

Well, here for the most part, you can just go ahead and click that One-Click Training button.

さて、ここではほとんどの場合、「One-Click Training」ボタンをクリックするだけです。

Come back in about 10 minutes, and you'll have a model.

約10分後に戻ってくれば、モデルが完成しています。

However, if you are like me and you do like to change things a little bit, you've got some options there for how often you want to save the full model, the total number of epochs, the GPU batch size, and some options for saving.

しかし、私のように少しずつ物事を変えたいと思っているなら、フルモデルの保存頻度、エポックの総数、GPUバッチサイズ、そして保存のためのいくつかのオプションについて、選択肢があります。

Personally, the way I like to set this up for a version 2 model is to set that to 10 total training epochs.

個人的には、バージョン2モデルの場合、トレーニングエポック数を10に設定するのが好きです。

I do 200, which is about the maximum you'll ever need as I have a very large GPU.

私は200回に設定していますが、これは非常に大きなGPUを使用しているため、必要な最大値程度です。

I've got 24 gig of VRAM.

24ギガのVRAMを搭載しています。

The batch size up to 40, as that's the maximum my GPU will handle.

バッチサイズは、私のGPUが処理できる最大値である40に設定しました。

I like to click Yes to only save the latest checkpoint.

Yesをクリックすると、最新のチェックポイントだけが保存されるのが好きです。

I'll keep cache all on No, and I say Yes to save small finished models.

キャッシュはすべて「いいえ」で保存し、完成した小さなモデルを保存するときは「はい」と言います。

So, with your model training via that One-Click Training, I would suggest also going and having a look over at the frequently asked questions tab.

ワンクリック・トレーニングでモデルをトレーニングする際には、「よくある質問」タブにも目を通しておくことをお勧めします。

There's quite a lot of information here.

ここにはたくさんの情報があります。

Particularly useful are question nine and question 10: How many total epochs are optimal?

特に便利なのは、質問9と質問10「合計エポックはいくつが最適ですか？

and How much training set duration is needed?.

と「トレーニングセットの期間はどれくらい必要ですか？

Now that you've got your very first voice model, it's time to do that AI voice-to-voice thing.

さて、最初の音声モデルを手に入れたら、いよいよAIによる音声合成を行うことになります。

If you already have the voice that you want to convert, you can skip straight to model inference.

すでに変換したい音声がある場合は、そのままモデル推論に進むことができます。

However, if you want to do something like change the singer of a song that you don't have the Vocal Stems for, like I did here, then you'll first need to separate those vocals out from the background music.

しかし、私がここでやったように、ボーカルステムを持っていない曲の歌手を変えたいと思っているなら、まず、そのボーカルを背景音楽から分離する必要があります。

And this is where the separation tab comes in handy.

そこで、分離タブが役に立ちます。

Once again, those files and directories come into play as you'll need to know where you've saved your music files.

また、音楽ファイルをどこに保存したかを知るために、ファイルとディレクトリが登場します。

The first boxes, if you want to convert multiple files from a given directory as I tend to do, just one at a time, I delete that and then use the box underneath instead.

最初のボックスは、私がよくやるように、あるディレクトリから複数のファイルを一度に変換したい場合、それを削除して、代わりにその下のボックスを使用します。

Model selection has two options, like it says at the top.

モデル選択には、一番上に書いてあるように、2つのオプションがあります。

There's hp2 for input without Harmony or if with Harmony and instructed vocals do not need harmony, use hp5.

ハーモニーなしで入力する場合はhp2、ハーモニーありでボーカルにハーモニーが必要ない場合はhp5を使用します。

Basically, if you're unsure, use both.

基本的には、よくわからない場合は、両方使ってください。

Have a listen to the output and see which is best for you.

出力を聴いてみて、どちらが自分に合っているかを判断してください。

In my case, I'm going to use hp2 here.

私の場合、ここではhp2を使うことにします。

By default, the output goes into the opt directory, so feel free to change that if you like.

デフォルトでは、出力はoptディレクトリに入るので、好きなように変更してください。

When you're ready, push the huge orange convert button, and you'll have split the vocals from the music.

準備ができたら、オレンジ色の大きな変換ボタンを押せば、音楽からボーカルを分離することができます。

Let's have a quick listen to that.

早速、聴いてみましょう。

Of course, there's a few seconds of silence.

もちろん、数秒の沈黙があります。

There we go, anyway.

とにかく、これでいいのです。

That's done quite well.

なかなかうまくできていますね。

We've got the vocals there without the music, even if there is a little bit of an echo or something there in the voice, alright?

多少エコーがかかっていても、音楽なしでボーカルを聴くことができますね。

So now we're ready to go with inference.

これで推論を行う準備が整いました。

The page does look huge, but really it's two things in one.

このページは巨大に見えますが、実は2つのものが1つになっているのです。

The top half there is for single voice conversion, and there, you've got a batch as well.

上半分は単一の音声変換用で、そこにはバッチもあります。

So I'll just be going through the one.

ここでは、そのうちの1つだけを説明します。

The batch is essentially the same, but you're doing loads at a time.

バッチは基本的に同じですが、一度に大量のデータを処理することになります。

Again, everything is pretty straightforward here.

ここでも、すべてが非常に簡単です。

Push that huge refresh button, and then you should see your options appear in this little pull-down here.

大きな更新ボタンを押すと、この小さなプルダウンにオプションが表示されるはずです。

My list is absolutely huge, as all the girls would agree, but you'll probably only have one option in there the first time, so pick that.

私のリストは、すべての女の子が同意するように、非常に巨大ですが、おそらく初回は1つのオプションしかないでしょう。

I'm going to pick that one because that's my trained voice.

私の訓練された声なので、これを選びます。

Next, you have to select a pitch, just like it says above.

次に、上に書いてあるように、音程を選択する必要があります。

For low to high conversion, use plus 12.

低音から高音への変換は、プラス12で。

If it's about the same, use zero.

ほぼ同じならゼロを使う。

And for high to low voice conversion, use minus 12.

そして、高い声から低い声への変換は、マイナス12を使います。

The source voice in this case is quite high.

この場合の音源の声はかなり高いです。

My voice is a bit lower, so I'm going to use minus 12.

私の声はもう少し低いので、マイナス12を使うことにします。

Once again, those files and directories come into play here.

ここでも、ファイルやディレクトリが活躍します。

So put the path to your vocals in.

ですから、ボーカルのパスを入れてください。

If you did that default voice separation, then you'll have the two files in your opt directory.

デフォルトのボイスセパレーションを行った場合、optディレクトリに2つのファイルがあるはずです。

You want the one which starts with vocal.

ボーカルで始まる方のファイルが必要です。

So there, in my opt directory, I have the long name of that WAV file, the one that starts with vocal for pitch extraction.

そこで、optディレクトリに、ピッチ抽出のためにボーカルで始まる方のWAVファイルの長い名前を入れておくのです。

Again, PM is fast, and The Harvest is best.

繰り返しますが、PMは速いですし、The Harvestは最高です。

So I like to select Harvest.

だから、私はHarvestを選びたい。

Everything else I leave at the default, apart from this path to index, which should have a pull-down menu.

それ以外のものはデフォルトのままにしていますが、このインデックスへのパスにはプルダウンメニューがあるべきです。

There is the one that I want to use because it matches that inference voice.

その中に、推論音声と一致するものがあるので、それを使いたい。

Okay, so now you can go ahead and click that very tiny convert button, and in just a few seconds, you should have your output.

さて、それでは、この小さな変換ボタンをクリックすると、数秒後には出力されるでしょう。

And there it is.

そして、それがこれです。

Yeah, there we go.

そう、これです。

That's pretty cool.

とてもクールですね。

That's pretty cool.

とてもクールだ。

That's me.

これが私です。

Now you can right-click that, save audio as.

右クリックして、名前を付けてオーディオを保存することができます。

I'm going to put it in my opt directory as well.

私はそれを私のoptディレクトリに置くつもりです。

I'm using audacity.

私はaudacityを使用しています。

Here, I've got the instrumental, so I just drag that other voice in, and then I can file exporters whatever I want, and it will mix those two voices together.

ここで、インストゥルメンタルがあるので、もう一つの声をドラッグして、ファイルエクスポーターで好きなように、この2つの声をミックスすることができます。

Thank you, on your bones.

ありがとう、骨の髄まで。

Plus, if you thought that was cool, then you may also like this nerdy rodent video.

さらに、もしこれがクールだと思ったなら、このオタクな齧歯類のビデオも気に入るかもしれません。

この記事が気に入ったらサポートをしてみませんか？