Voicing-Aware Parallel WaveGAN for High-Quality Speech Synthesis

Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang

Jul 30, 2021

Submitted to IEEE signal processing letters

Authors
Abstract
TTS samples
Acknowledgements

Authors

Ryuichi Yamamoto (LINE Corp.)
Min-Jae Hwang (Search Solutions Inc.)
Eunwoo Song (NAVER Corp.)

Abstract

This letter proposes a voicing-aware Parallel Wave- GAN (VA-PWG) vocoder for a neural text-to-speech (TTS) system. To generate a high-quality speech waveform, it is important to reflect the distinct characteristics of voiced and unvoiced speech signals well. However, it is difficult for the conventional PWG model to accurately represent this condition, since the single unified architectures of the generator and discriminator are insufficient to capture those characteristics. In the proposed method, both the generator and discriminator are divided into their subnetworks to individually model the voicing state-dependent characteristics of a speech signal. In particular, a VA-generator consisting of two sub-WaveNets generates the harmonic and noise components of a speech signal by inputting pitch-dependent sine wave and Gaussian noise sources, respectively. Likewise, a VA-discriminator consisting of two sub-discriminators learns the distinct characteristics of harmonic and noise components by feeding the voiced and unvoiced waveforms, respectively. Subjective evaluation results verified the effectiveness of the proposed VA-PWG vocoder by achieving a 4.25 mean opinion score from a speaker-independent training scenario that was 11% higher than that of a conventional PWG vocoder.

TTS samples

M1 (male)

Sample 1: “鹿児島県で最大震度三を観測しています。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 2: “葉加瀬太郎の情熱大陸です。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 3: “それでうちの部は半分に減らされる。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

M2 (male)

Sample 1: “ヨメの、レオンティーンさんですね。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 2: “御予約は、二泊三日ですね。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 3: “わたさちの、ローリーさんですね。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

F1 (female)

Sample 1: “かわいそうに、助けてやらなくてはと、家に連れて帰りましたとさ。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 2: “失礼のないよう、笑顔で挨拶して。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 3: “照れていたので、ちょっと意外な気がしましたー。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

F2 (female)

Sample 1: “そして目に留まったのは、お気に入りの居酒屋の前にあるゴミの山。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 2: “今ひとつ、時間が足りず。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Sample 3: “実はこの道の先に、高い山があってね。”

Recording	WaveNet	PWG

VA-PWG-G	VA-PWG-D	VA-PWG-GD (proposed)

Acknowledgements

Work performed with nVoice, Clova Voice, Naver Corp.

Ryuichi Yamamoto

Engineer/Researcher

I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.