NVIDIA/tacotron2 で日本語の音声合成を試す (1) - 事始め

2021年3月1日 08:48

1. はじめに

先週、「つくよみちゃんコーパス」がリリースされました。

そこで、「NVIDIA/tacotron2」で日本語の音声合成に挑戦してみました。

とはいえ、「つくよみちゃんコーパス」の学習をいきなりやると失敗しそうなので、今回はシロワニさんの解説にそって、「Japanese Single Speaker Speech Dataset」を使った音声合成に挑戦してみました。

2. データセットの準備

今回は「Japanese Single Speaker Speech Dataset」を利用します。

・transcript.txt - wavファイル名とセリフの一覧
・meian - wavファイルを保持するフォルダ
・meian_XXXX.wav - wavファイル
:

2-1. transcript.txt

「Japanese Single Speaker Speech Dataset」の「transcript.txt」の書式は、次のとおりです。

meian/meian_0000.wav|この前探った時は、途中に瘢痕の隆起があったので、ついそこが行きどまりだとばかり思って、ああ云ったんですが、|kono mae sagut ta toki wa 、 tochu- ni hankon no ryu-ki ga at ta node 、 tsui soko ga yukidomari da to bakari omot te 、 a- yut ta n desu ga 、|8.77
meian/meian_0001.wav|今日疎通を好くするために、そいつをがりがり掻き落して見ると、まだ奥があるんです」|kyo- sotsu- wo yoku suru tame ni 、 soitsu wo garigari kaki otoshi te miru to 、 mada oku ga aru n desu|7.48
    :

これを「NVIDIA/tacotron2」用に変換します。

meian/meian_0000.wav|konomaesaguttatokiwa,tochu-nihankonnoryu-kigaattanode,tsuisokogayukidomaridatobakariomotte,a-yuttandesuga,
meian/meian_0001.wav|kyo-sotsu-woyokusurutameni,soitsuwogarigarikakiotoshitemiruto,madaokugaarundesu

前処理は次のとおりです。

(1) 元データを以下のように変換。

・「、」 → 「,」
・「。」 → 「.」
・「'」 → 「n」
・「―」 → <削除>
・「？」 → <削除>
・「！」 → <削除>
・<半角空白> → <削除>

(2) 「漢数字」「章おわり。」ではじまるデータと、10文字以下のデータを削除。

・「漢数字」「章おわり。」ではじまるデータ : 「◎◎のために録音されました。」的な指示文言の入った音声が入ってることが多いので削除。
・10文字以下のデータ : 壊れたデータが多いので削除。

変換スクリプトは、次のとおりです。

import os

# transcript.txtの変換
in_path = 'archive/transcript.txt'
out_path = 'filelists/transcript.txt'
output = []
nums = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十', '百', '章おわり。']
with open(in_path) as f:
    lines = f.readlines()
    for line in lines:
        strs = line.split('|')

        flag = False
        for num in nums:
            if strs[1].startswith(num):
                print(strs[0])
                flag = True
                break
        if flag:
            continue
        if len(strs[2]) < 10:
            continue

        strs[2] = strs[2].replace('、',',')
        strs[2] = strs[2].replace('。','.')
        strs[2] = strs[2].replace('――','')
        strs[2] = strs[2].replace('？','')
        strs[2] = strs[2].replace('！','')
        strs[2] = strs[2].replace(' ','')
        output.append(strs[0]+'|'+strs[2]+'\n')

with open(out_path, 'w') as f:
    f.writelines(output)

2-2. wav

「Japanese Single Speaker Speech Dataset」のwavのサンプリンビットは32ビットです。これを「NVIDIA/tacotron2」用の16ビットに変換します。

変換スクリプトは、次のとおりです。

$ pip install librosa==0.8.0
$ pip install pysoundfile==0.9.0.post1

import os
import librosa
import soundfile as sf

# サンプリングビットの変換
in_path = 'archive/meian/'
out_path = 'filelists/meian/'
filenames = os.listdir(in_path)
for filename in filenames:
   print(out_path+filename)
   y, sr = librosa.core.load(in_path+filename, sr=22050, mono=True) # 22050Hz、モノラルで読み込み
   sf.write(out_path+filename, y, sr, subtype="PCM_16") #16bitで書き込み

3. 学習

今回は「NVIDIA/tacotron2」を利用して、「Google Colab」で学習します。

(1) 「Google Colab」のメニュー「編集→ノートブックの設定」で「GPU」を選択。

(2) 作業フォルダの作成。
データを永続化したいので、Googleドライブに作業フォルダを作成します。

# 作業用フォルダの作成
from google.colab import drive 
drive.mount('/content/drive')
!mkdir -p /content/drive/'My Drive'/work/
%cd /content/drive/'My Drive'/work/

(3) 「TensorFlow 1.X」への切り替え。

%tensorflow_version 1.x

(4) 「PyTorch 1.6」のインストール。

# PyTorch 1.6のインストール
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

(5) 「NVIDIA/tacotoron2」のクローン。

# NVIDIA/tacotron2のインストール
!git clone https://github.com/NVIDIA/tacotron2.git
%cd tacotron2
!git submodule init
!git submodule update

(6) 「Apex」のインストール。

# Apexのインストール
!git clone https://github.com/NVIDIA/apex
%cd apex
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
%cd ..

(7) 依存パッケージのインストール。

!pip install -r requirements.txt

エラーが出たので、以下もインストールしました。

!pip install Unidecode==1.0.22

(8) データセットの配置。

・work
　・tacotron2
　　・filelists
　　　・transcript.txt ←★ここに配置
　　・meian ←★ここに配置
　　　・meian_XXXX.wav
　　　　　:

(9) データセットを学習データと検証データに分割

# 学習データと検証データの分割
!head -n 6500 filelists/transcript.txt > filelists/transcript_train.txt
!tail -n 85 filelists/transcript.txt > filelists/transcript_val.txt

(10) 公式サイトから「Tacotron2モデル」と「WaveGlowモデル」をダウンロードし、tacotron2フォルダに配置。

・Tacotron2モデル : 英語音声を音素に変換するモデル。
・WaveGlowモデル : 音素を音声に変換するモデル。

今回は、英語の「Tacotron2モデル」は転移学習に利用し、「WaveGlowモデル」はそのまま使用します。

(11) 「hparams.py」の編集。
「hparams.py」はハイパーパラメータを記述するスクリプトです。以下を修正します。今回は練習で100エポックにしてます。

    :
epochs=100,
    :
training_files='filelists/transcript_train.txt',
validation_files='filelists/transcript_val.txt',
text_cleaners=['basic_cleaners'],
    :
batch_size=32,
    :

・epochs : エポック数
・training_files : 学習ファイルのパス
・validation_files : 検証ファイルのパス
・text_cleaners : 前処理用のクリーナー
・batch_size : バッチサイズ

(12) 学習の実行。

!python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
 * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
 * https://github.com/tensorflow/addons
 * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Warm starting model from checkpoint 'tacotron2_statedict.pt'
Epoch: 0
Train loss 0 4.530630 Grad Norm 106.327721 4.36s/it
Validation loss 0: 74.075213  
Saving model and optimizer state at iteration 0 to outdir/checkpoint_0
Train loss 1 1.618356 Grad Norm 14.370405 3.57s/it
Train loss 2 3.936485 Grad Norm 68.261269 3.69s/it
Train loss 3 2.413089 Grad Norm 33.520912 3.78s/it
Train loss 4 1.628413 Grad Norm 13.512403 3.56s/it
    :

学習結果は「outdir」に出力されます。

4. 推論

推論方法は、「tacotron2/inference.ipynb」が参考になります。

(1) パッケージのインストール。

import matplotlib
%matplotlib inline
import matplotlib.pylab as plt

import IPython.display as ipd

import sys
sys.path.append('waveglow/')
import numpy as np
import torch

from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT, STFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence
from denoiser import Denoiser

(2) plot_data()の準備。

def plot_data(data, figsize=(16, 4)):
    fig, axes = plt.subplots(1, len(data), figsize=figsize)
    for i in range(len(data)):
        axes[i].imshow(data[i], aspect='auto', origin='bottom', 
                       interpolation='none')

(3) ハイパーパラメータの準備。

hparams = create_hparams()
hparams.sampling_rate = 22050

(4) 日本語のTacotron2モデルの読み込み。
待ちきれず、4000ステップ学習したモデルで試してみました。

checkpoint_path = "outdir/checkpoint_4000"
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
_ = model.cuda().eval().half()

(5) WaveGlowモデルの読み込み。

waveglow_path = 'waveglow_256channels_universal_v5.pt'
waveglow = torch.load(waveglow_path)['model']
waveglow.cuda().eval().half()
for k in waveglow.convinv:
    k.float()
denoiser = Denoiser(waveglow)

(6) 「こんにちわ」を音素に変換。

text = "konnichiwa"
sequence = np.array(text_to_sequence(text, ['basic_cleaners']))[None, :]
sequence = torch.autograd.Variable(
    torch.from_numpy(sequence)).cuda().long()

(7) グラフで確認。

mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
plot_data((mel_outputs.float().data.cpu().numpy()[0],
           mel_outputs_postnet.float().data.cpu().numpy()[0],
           alignments.float().data.cpu().numpy()[0].T))

(8) 音声に変換。

with torch.no_grad():
    audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

転移学習のおかげもあって、イントネーションが海外の人風ですが「こんにちわ」と言ってるのがわかりました。

【おまけ】 train.pyの引数

-o,--output_directory : チェックポイントを保存するフォルダ
-l,--log_directory : TensorBoardログを保存するフォルダ
-c,--checkpoint_path : チェックポイントのパス
--warm_start : モデルの重みのみをロードし、指定されたレイヤーを無視
--n_gpus : GPU番号
--rank : 現在のGPUのランク
--group_name : 分散グループ名
--hparams : カンマ区切りの名前=値のペアを指定

◎ 転移学習

!python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

◎ チェックポイントからの学習再開

!python train.py --output_directory=outdir --log_directory=logdir -c outdir/checkpoint_4000

次回

この記事が気に入ったらサポートをしてみませんか？