text_to_speech_demoの実行(macOS編)

はじめに

Open Model Zoo内のDemoに格納されている、text_to_speech_demo を使ってみましょう。

環境

今回はmacOSで実行してみます。(もちろん他OSでも同等です)

MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports)
2.7 GHz クアッドコアIntel Core i7 メモリ16 GB
macOS Big Sur 11.1
Python 3.7.7
openvino 2021.2.185

モデルの確認

models.lstを開いて、使用するモデルを確認します。4つのモデルが必要です。モデル未入手の場合は、モデルダウンローダーを使って入手してください。

# This file can be used with the --list option of the model downloader.
forward-tacotron-duration-prediction
forward-tacotron-regression
wavernn-rnn
wavernn-upsampler

ヘルプの確認

% python3 text_to_speech_demo.py -h
usage: text_to_speech_demo.py [-h] -m_duration MODEL_DURATION -m_forward
                              MODEL_FORWARD -m_upsample MODEL_UPSAMPLE -m_rnn
                              MODEL_RNN -i INPUT [-o OUT]
                              [--upsampler_width UPSAMPLER_WIDTH] [-d DEVICE]

Options:
  -h, --help            Show this help message and exit.
  -m_duration MODEL_DURATION, --model_duration MODEL_DURATION
                        Required. Path to ForwardTacotron`s duration
                        prediction part (*.xml format).
  -m_forward MODEL_FORWARD, --model_forward MODEL_FORWARD
                        Required. Path to ForwardTacotron`s mel-spectrogram
                        regression part (*.xml format).
  -m_upsample MODEL_UPSAMPLE, --model_upsample MODEL_UPSAMPLE
                        Required. Path to WaveRNN`s part for mel-spectrogram
                        upsampling by time axis (*.xml format).
  -m_rnn MODEL_RNN, --model_rnn MODEL_RNN
                        Required. Path to WaveRNN`s part for waveform
                        autoregression (*.xml format).
  -i INPUT, --input INPUT
                        Text file with text.
  -o OUT, --out OUT     Required. Path to an output .wav file
  --upsampler_width UPSAMPLER_WIDTH
                        Width for reshaping of the model_upsample. If -1 then
                        no reshape. Do not use with FP16 model.
  -d DEVICE, --device DEVICE
                        Optional. Specify the target device to infer on; CPU,
                        GPU, FPGA, HDDL, MYRIAD or HETERO is acceptable. The
                        sample will look for a suitable plugin for device
                        specified. Default value is CPU

実行してみます

テキストファイルを読み込ませると、movで出力されます。まずはVincent van Gogh (ゴッホ)さんの名言を喋らせてみましょう。

Your life would be very empty if you had nothing to regret.
% python3 text_to_speech_demo.py -m_duration ./models/forward-tacotron-duration-prediction/FP16/forward-tacotron-duration-prediction.xml -m_forward ./models/forward-tacotron-regression/FP16/forward-tacotron-regression.xml -m_upsample ./models/wavernn-upsampler/FP16/wavernn-upsampler.xml -m_rnn ./models/wavernn-rnn/FP16/wavernn-rnn.xml -i test.txt -o test.wav

OK。喋りました。次は複数行の例として、Steve Jobsさんの名言を喋らせてみましょう。パラメータ等は同じなので省略します。

Your time is limited, so don't waste it living someone else's life. Don't be trapped by dogma — which is living with the results of other people's thinking. Don't let the noise of others' opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition. They somehow already know what you truly want to become. Everything else is secondary.

スムーズに喋ってますね。