mozilla/TTS の使い方

2021年1月26日 19:33

以下の記事を参考に書いてます。

1. mozilla/TTS

「mozilla/TTS」は、高度なテキスト読み上げ生成用のライブラリです。これは最新の研究に基づいて構築されており、学習のしやすさ、速度、品質の間で最高のトレードオフを達成するように設計されています。「TTS」には、事前学習モデル、データセットの品質測定ツールが付属しており、製品や研究プロジェクトで20以上の言語ですでに使用されています。

リソースは、以下で提供されています。

・チュートリアルとサンプル
・モデル

2. 特徴

・Text2Speechタスク用の高性能深層学習モデル。
・Text2Specモデル（Tacotron、Tacotron2、Glow-TTS、SpeedySpeech）。
・スピーカーの埋め込みを効率的に計算するスピーカーエンコーダー。
・ボコーダーモデル（MelGAN、Multiband-MelGAN、GAN-TTS、ParallelWaveGAN、WaveGrad、WaveRNN）
・高速で効率的なモデル学習。
・コンソールとTensorboardの詳細な学習ログ。
・マルチスピーカーTTSのサポート。
・効率的なマルチGPU学習。
・推論のためにPyTorchモデルをTensorflow2.0・TFLiteに変換する機能。
・PyTorch、Tensorflow、TFLiteでモデルの提供。
・Text2Speechの収集ツール（dataset_analysis）。
・モデルテスト用のデモサーバー。
・広範なモデルベンチマーク用のノートブック。
・モジュール式のコード。

3. モデル

◎ Text-to-Spectrogram

・Tacotron
・Tacotron2
・Glow-TTS
・Speedy-Speech

◎ アテンションメソッド

・Guided Attention
・Forward Backward Decoding
・Graves Attention
・Double Decoder Consistency

◎ スピーカーエンコーダー

・GE2E
・Angular Loss

◎ ボコーダー

・MelGAN
・MultiBandMelGAN
・ParallelWaveGAN
・GAN-TTS discriminators
・WaveRNN
・WaveGrad

4. フォルダ構造

|- notebooks/       (モデル評価、パラメーター選択、データ分析のためのJupyter Notebook)
|- utils/           (一般的なユーティリティ)
|- TTS
    |- bin/             (実能ファイル)
      |- train*.py                  (モデルの学習)
      |- distribute.py              (複数GPUを使用したTTSモデルの学習)
      |- compute_statistics.py      (正規化のためにデータセット統計の計算)
      |- convert*.py                (TorchモデルをTFモデルに変換)
    |- tts/             (Text2Speechモデル)
        |- layers/          (モデルレイヤーの定義 )
        |- models/          (モデル定義)
        |- tf/              (TF2ユーティリティとモデルの実装)
        |- utils/           (モデル固有のユーティリティ)
    |- speaker_encoder/ (スピーカーエンコーダーモデル)
        |- (same)
    |- vocoder/         (ボコーダーモデル)
        |- (same)

5. サンプルモデル出力

LJSpeechデータセットをバッチサイズ32で16K反復したTacotronモデルは、次のとおりです。

"Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase the grey matter in the parts of the brain responsible for emotional regulation and learning."

6. データセットとデータの読み込み

「TTS」は、カスタムデータセットに使いやすい汎用データローダーを提供します。データセットをフォーマットするための簡単な関数を書く必要があります。サンプル「datasets/preprocess.py」を参照してください。その後、「config.json」でデータセットフィールドを設定する必要があります。

・LJ Speech
・Nancy
・TWEB
・M-AI-Labs
・LibriTTS

7. 音声合成の実行

「LJSpeechデータセット」で学習した「Tacotron2モデル」と「MultiBand-Melganモデル」を使用します。

「Tacotron2」は、「Double Decoder Consistency」（DDC）を使用して、単一のGPUで130Kステップ（3日間）学習しています。「MultiBand-Melgan」は、実際のスペクトログラムを使用して145万ステップ学習しています。両方のモデルのパフォーマンスは、学習を増やすことで改善できます。

◎ モデルのダウンロード

# TTSモデルのダウンロード
!gdown --id 1dntzjWFg7ufWaTaFy80nRz-Tu02xWZos -O tts_model.pth.tar
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

# ボコーダーモデルのダウンロード
!gdown --id 1Ty5DZdOc0F7OTGj9oJThYbL5iVu_2G0K -O vocoder_model.pth.tar
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

◎ ライブラリのインストール

# espeakのインストール
!sudo apt-get install espeak

# TTSのインストール
!git clone https://github.com/mozilla/TTS
%cd TTS
!git checkout b1935c97
!pip install -r requirements.txt
!python setup.py install
%cd ..

◎ TTS関数の定義

# TTS関数の定義
def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
    t_1 = time.time()
    
    # テキスト → メルスペクトログラム
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
        truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)

    # メルスペクトログラム → 音声
    if not use_gl:
        waveform = vocoder_model.inference(torch.FloatTensor(mel_postnet_spec.T).unsqueeze(0))
        waveform = waveform.flatten()
    if use_cuda:
        waveform = waveform.cpu()
        
    # 出力
    waveform = waveform.numpy()
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

◎ モデルの読み込み

import os
import torch
import time
import IPython

from TTS.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.utils.synthesis import synthesis

from TTS.vocoder.utils.generic_utils import setup_generator

# ランタイム設定
use_cuda = False

# モデルパス
TTS_MODEL = "tts_model.pth.tar"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.pth.tar"
VOCODER_CONFIG = "config_vocoder.json"

# コンフィグの読み込み
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

# オーディオプロセッサの読み込み
ap = AudioProcessor(**TTS_CONFIG.audio) 

# TTSモデルの読み込み
speaker_id = None
speakers = []
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)
cp =  torch.load(TTS_MODEL, map_location=torch.device('cpu'))
model.load_state_dict(cp['model'])
if use_cuda:
    model.cuda()
model.eval()
if 'r' in cp:
    model.decoder.set_r(cp['r'])

# ボコーダーモデルの読み込み
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0
ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])    
if use_cuda:
    vocoder_model.cuda()
vocoder_model.eval()

◎ 推論の実行

sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

8. 学習とファインチューニング

「LJSpeechデータセット」でのモデルの学習方法は、以下のColabノートブックが参考になります。

はじめに、「metadata.csv」を訓練データ「metadata_train.csv」と検証データ「metadata_val.csv」に分割します。テキスト読み上げの場合、損失値は人間の耳への音声品質を直接測定せず、アテンションモジュールのパフォーマンスも測定しないため、検証パフォーマンスは誤解を招く可能性があります。そのため、モデルを実行して結果を聞くことが、最善の方法になります。

shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv

新しいモデルを学習するには、独自の「config.json」を定義して、モデルの詳細、学習構成などを定義する必要があります（詳しくはサンプル参照）。次に、対応する学習スクリプトを呼び出します。

「LJSpeechデータセット」で「Tacotron」または「Tacotron2」モデルを学習する手順は、次のとおりです。

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json

モデルをファインチューニングするには、「--restore_path」を使用します。

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar

過去の学習の続きを実行するには、「--continue_path」を使用します。

python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/

マルチGPU学習の場合は、「distribute.py」を使用します。提供されている学習スクリプトをマルチGPU設定で実行します。

CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json

実行毎に、使用した「config.json」、「モデルチェックポイント」、TensorBoardログを含む新規の出力フォルダが作成されます。

エラーまたは例外をキャッチした時、出力フォルダ下にまだチェックポイントがない場合は、フォルダ全体が削除されます。

TensorBoardの引数「--logdir」で「experimentフォルダ」を指定することで、TensorBoardで監視することもできます。

9. 関連

この記事が気に入ったらサポートをしてみませんか？