Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim

Oct 21, 2020

Preprint: arXiv:2010.14151 (accepted to ICASSP 2021)

Table of contents

Analysis/synthesis samples (Japanese)
Text-to-speech samples (Japanese)
Bonus: analysis/synthesis samples for CMU ARCTIC (English)

Authors

Ryuichi Yamamoto (LINE Corp.)
Eunwoo Song (NAVER Corp.)
Min-Jae Hwang (Search Solutions Inc.)
Jae-Min Kim (NAVER Corp.)

Abstract

This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator’s performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech. As each discriminator learns the distinctive characteristics of the harmonic and noise components, respectively, the adversarial training process becomes more efficient, allowing the generator to produce more realistic speech waveforms. Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems. In particular, our speaker-independently trained model within a FastSpeech 2 based text-to-speech framework achieves the mean opinion scores of 4.20, 4.18, 4.21, and 4.31 for four Japanese speakers, respectively.

Systems for comparision

System	Voiced segments	Unvoiced segments	Discriminator conditioning
S1-WaveNet [1]	-	-	-
S2-PWG [2]	-	-	-
S3-PWG-cGAN-D	-	-	Yes
S4-PWG-V/UV-D	$D^{\mathrm{{v}}}$	$D^{\mathrm{{v}}}$	Yes
S5-PWG-V/UV-D	$D^{\mathrm{{uv}}}$	$D^{\mathrm{{v}}}$	Yes
S6-PWG-V/UV-D	$D^{\mathrm{{uv}}}$	$D^{\mathrm{{uv}}}$	Yes
S7-PWG-V/UV-D (proposed)	$D^{\mathrm{{v}}}$	$D^{\mathrm{{uv}}}$	Yes
Recordings	-	-	-

$D^{\mathrm{{v}}}$: 1-D dilated CNN discrimiantor with the reseptive field size of 127.
$D^{\mathrm{{uv}}}$: 1-D CNN discrimiantor with the reseptive field size of 13.

PWG denotes Parallel WaveGAN for short. Systems S2-PWG and S3-PWG-cGAN-D used $D^{\mathrm{{v}}}$ as the primary discriminator. Note that all the Parallel WaveGAN systems used the same generator architecture and training configurations; they only differed in the discriminator settings.

Analysis/synthesis

F1 (female)

Sample 1

Recording	S1-WaveNet	S2-PWG	S7-PWG-V/UV-D (ours)

S3-PWG-cGAN-D	S4-PWG-V/UV-D	S5-PWG-V/UV-D	S6-PWG-V/UV-D

F2 (female)

Sample 1

Recording	S1-WaveNet	S2-PWG	S7-PWG-V/UV-D (ours)

S3-PWG-cGAN-D	S4-PWG-V/UV-D	S5-PWG-V/UV-D	S6-PWG-V/UV-D

M1 (male)

Sample 1

Recording	S1-WaveNet	S2-PWG	S7-PWG-V/UV-D (ours)

S3-PWG-cGAN-D	S4-PWG-V/UV-D	S5-PWG-V/UV-D	S6-PWG-V/UV-D

M2 (male)

Sample 1

Recording	S1-WaveNet	S2-PWG	S7-PWG-V/UV-D (ours)

S3-PWG-cGAN-D	S4-PWG-V/UV-D	S5-PWG-V/UV-D	S6-PWG-V/UV-D

Text-to-speech

FastSpeech 2 ([3]) based acoustic models were used for text-to-speech experiments.

F1 (female)

Sample 1

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 2

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 3

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

F2 (female)

Sample 1

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 2

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 3

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

M1 (male)

Sample 1

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 2

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 3

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

M2 (male)

Sample 1

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 2

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Sample 3

Recording	FastSpeech 2 + WaveNet	FastSpeech 2 + PWG	FastSpeech 2 + PWG-V/UV-D (ours)

FastSpeech 2 + PWG-cGAN-D

Bonus: Analysis/synthesis (English)

Samples for CMU ARCTIC database are provided as follows. The models were trained using total six speakers (clb, slt, bdl, rms, jmk, and ksp) in a speaker-independent way. The models were similary configured as the above experiments.

clb (female)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

slt (female)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

bdl (male)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

rms (male)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

jmk (male)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

ksp (male)

Sample 1

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 2

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

Sample 3

Recording	PWG	PWG-cGAN-D	PWG-V/UV-D (ours)

References

[1] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019 arXiv.
[2] R Yamamoto, E Song, and J.-M Kim, “Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 6199–6203. arXiv.
[3] Y Ren, C Hu, T Qin, S Zhao, Z Zhao, and T.-Y Liu, “Fast-speech 2: Fast and high-quality end-to-end text-to-speech,”arXiv preprint arXiv:2006.04558, 2020. arXiv.

Acknowledgements

Work performed with nVoice, Clova Voice, Naver Corp.

Citation

@inproceedings{yamamoto2020parallel,
  title={Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators},
  author={Ryuichi Yamamoto and Eunwoo Song and Min-Jae Hwang and Jae-Min Kim},
  booktitle="Proc. of ICASSP (in press)",
  year={2021},
}

Deep Learning TTS ICASSP

Ryuichi Yamamoto

Engineer/Researcher

I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.