UnityからStyle-Bert-VITS2のAPIを呼ぶときに、BudouXのUnity版であるUniBudouXを使ってテキストを自動的に100文字以下に分割して音声合成する

2024年2月10日 00:34

趣味でAITuberを作っています。UnityでVRMを表示し、発話の生成はローカルLLMで。そして音声合成はローカルで動かしているStyle-Bert-VITS2をつくよみちゃんコーパスで学習させたものを使わせてもらっています。以下が最初のテスト配信です。合成音声コンテンツの本場はニコニコだろうということでニコ生でやりました。今後もニコ生メインでやってみたい。

UnityからStyle-Bert-VITS2を呼び出すときは、Style-Bert-VITS2のAPIサーバー機能を起動して、HTTPリクエストを投げる形で音声を生成します。

ただ、ちょっと悩ましいのが、Style-Bert-VITS2のAPIサーバーは100文字までしか受け付けないことです。

7BサイズのローカルLLMだと、プロンプトが常に思った通りに効くわけではないため、100文字以内と指示しても、実際の出力は100文字を超えてしまうことがあります。APIサーバーに渡すテキストが100文字を超えるとエラーになるので、100文字以内に切り詰める必要があります。

ただ、単純に文字数で切り詰めると、発話が途中で切れてしまってよろしくない。ちゃんと、意味の切れ目で分けたいです。改行や「、」「。」「？」「！」などで切るのも一つの方法ですが、ルイズコピペみたいな文章は、できるだけ一息で発話してこそ面白いので、できれば記号ではなく意味で切りたい。

そこで、BudouXを使ってみることにしました。元々は適切な改行位置で改行してくれるライブラリですが、今回は文字列のいい感じの切れ目を見つける用途として使います。

UnityからBudouXを使う、UniBudouXというライブラリがあります。これを使えば、C#だけでBudouXの力を借りることができます。

実際には、以下の様なシンプルな形になります。与えられたテキストを、100文字以下になるように分割してList<string>で返しています。

using System;
using System.Collections.Generic;
using UniBudouX;

namespace MyScripts
{
    public class BudouXTextSegmentation
    {
        public static List<string> BuouXParser(string str)
        {
            List<string> chunks = Parser.Parse(str);
            return chunks;
        }
        
        public static List<string> SegmentTextInto100Char(List<string> phrases)
        {
            List<string> segmentedTexts = new List<string>();
            string currentText = "";

            foreach (var phrase in phrases)
            {
                if ((currentText.Length + phrase.Length) <= 100)
                {
                    // 現在のテキストと新しいフレーズを結合
                    currentText += phrase;
                }
                else
                {
                    // 100文字を超える場合、現在のテキストをリストに追加し、新たなテキストを開始
                    segmentedTexts.Add(currentText);
                    currentText = phrase;
                }
            }

            // 最後のテキストをリストに追加
            if (!string.IsNullOrEmpty(currentText))
            {
                segmentedTexts.Add(currentText);
            }

            return segmentedTexts;
        }

    }
}

実際に使う場合、このList<string>を渡すと、List<AudioClip>を返すGenerateVoiceAsync関数にしてみました。

GenerateVoiceAsyncの中身はStyle-Bert-VITS2のAPIサーバーにHTTPリクエストを渡しています。単なる興味で、UnityWebRequestではなく、CysharpさんのYetAnotherHttpHandlerを使っています。Waveファイルのデコードに使っているWavDecocerは、もちねこさんが公開されているsimple-audio-codec-unityを使っています。

using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Threading;
using Cysharp.Threading.Tasks;
using Cysharp.Net.Http;
using UnityEngine;
using Mochineko.SimpleAudioCodec;

namespace MyScripts
{
    public static class BertVits2Client
    {

        public static async UniTask<List<AudioClip>> GetVoiceAsync(string text, BertVits2RequestParams requestParams,
            CancellationToken cancellationToken)
        {
            List<string> segmentedTexts = BudouXTextSegmentation.BuouXParser(text);
            List<string> segmentedTexts100Char = BudouXTextSegmentation.SegmentTextInto100Char(segmentedTexts);
            List<AudioClip> audioClips = new List<AudioClip>();
            foreach (var segmentedText in segmentedTexts100Char)
            {
                AudioClip audioClip = await GenerateVoiceAsync(segmentedText, requestParams, cancellationToken);
                audioClips.Add(audioClip);
            }
            return audioClips;
        }

        public static async UniTask<AudioClip> GenerateVoiceAsync(string text, BertVits2RequestParams requestParams,
            CancellationToken cancellationToken)
        {
        using YetAnotherHttpHandler handler = new YetAnotherHttpHandler();
            HttpClient client = new HttpClient(handler);
            
            // textの改行とタブと絵文字を削除する
            text = text.Replace("\n", "");
            text = text.Replace("\t", "");
            
            // textが100文字を超えていたら、99文字目以降を削除する（念のため）
            if (text.Length > 100)
            {
                text = text.Substring(0, 99);
            }
            
            // textが空文字だったら、「えっと」を話す
            if (text == "")
            {
                text = $"えっと、";
            }
            
            // Request parameters
            var queryParameters = $"?text={Uri.EscapeDataString(text)}&model_id={requestParams.ModelId}&speaker_id={requestParams.SpeakerId}&sdp_ratio={requestParams.SdpRatio}&noise={requestParams.Noise}&noisew={requestParams.Noisew}&length={requestParams.Length}&language={Uri.EscapeDataString(requestParams.Language)}&auto_split={requestParams.AutoSplit}&split_interval={requestParams.SplitInterval}&assist_text_weight={requestParams.AssistTextWeight}&style={Uri.EscapeDataString(requestParams.Style)}&style_weight={requestParams.StyleWeight}";
            var uri = new Uri("http://127.0.0.1:5000/voice" + queryParameters);

            // Setup request
            var request = new HttpRequestMessage(HttpMethod.Get, uri);
            request.Headers.Accept.Clear();
            request.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("audio/wav"));

            // Send request
            HttpResponseMessage response = await client.SendAsync(request, cancellationToken);
            // responseがAudio/wavでなかったらスキップ
            if (response.Content.Headers.ContentType.MediaType != "audio/wav")
            {
                Debug.LogError("Response is not audio/wav");
                return null;
            }
            
            byte[] responseBody = await response.Content.ReadAsByteArrayAsync();
            string responseText = await response.Content.ReadAsStringAsync();
            Debug.Log($"Response received: {responseText}");
            string tmpPath = Application.dataPath + "/tmp.wav";
            await System.IO.File.WriteAllBytesAsync(tmpPath, responseBody, cancellationToken);
            AudioClip audioClip = null;
            try
            {
                using (Stream stream = File.OpenRead(tmpPath))
                {
                    try
                    {
                        audioClip = await WaveDecoder.DecodeByBlockAsync(stream, tmpPath, cancellationToken);
                        Debug.Log($"Succeeded to decode wave:{audioClip.samples}");
                    }
                    catch (Exception decodeException)
                    {
                        Debug.LogException(decodeException);
                    }
                }
            }
            catch (Exception openException)
            {
                Debug.LogException(openException);
            }
            return audioClip;
        }
        
    }
    public class BertVits2RequestParams
    {
        public int ModelId { get; set; }
        public int SpeakerId { get; set; }
        public double SdpRatio { get; set; }
        public double Noise { get; set; }
        public double Noisew { get; set; }
        public double Length { get; set; }
        public string Language { get; set; }
        public bool AutoSplit { get; set; }
        public double SplitInterval { get; set; }
        public double AssistTextWeight { get; set; }
        public string Style { get; set; }
        public int StyleWeight { get; set; }
        
        public BertVits2RequestParams()
        {
            // Set default values
            ModelId = 4;  // Replace 0 with your default value for ModelId
            SpeakerId = 0;  // Replace 0 with your default value for SpeakerId
            SdpRatio = 0.2;  // Replace 0.2 with your default value for SdpRatio
            Noise = 0.6;  // Replace 0.6 with your default value for Noise
            Noisew = 0.8;  // Replace 0.8 with your default value for Noisew
            Length = 1.0;  // Replace 1.0 with your default value for Length
            Language = "JP";  // Replace with your default value for Language
            AutoSplit = false;  // Replace true with your default value for AutoSplit
            SplitInterval = 0.5;  // Replace 0.5 with your default value for SplitInterval
            AssistTextWeight = 1.0;  // Replace 1.0 with your default value for AssistTextWeight
            Style = "Neutral";  // Replace with your default value for Style
            StyleWeight = 5;  // Replace 5 with your default value for StyleWeight
        }
    }
}

あとは、返ってきたList<AudioClip>をforeachでAudioSourceにくっつけて、AudioSource.Playなどで再生すれば音声が出ます。ここは人それぞれ用途によってやり方があると思います。

自分用なので、わりとスパゲッティな感じですがご勘弁を🍝

Style-Bert-VITS2はStyleやAssist Textで音の表情を与えることもでき、表現力が豊かなので、鍛えれば鍛えるほど魅力的になりそうです。

この記事が気に入ったらサポートをしてみませんか？