NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit

Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda

Oct 18, 2022

Preprint: arXiv:2210.15987 (Submitted to ICASSP 2023)

Updates
Authors
Abstract
Systems
Samples
- SVS
- A/S
Mixed demo
Bonus samples
- Japanese
- Mandarin
Error cases
- Unstable pitch of DiffSinger
Acknowledgments
Appendix
- Note pitch distribution
References

Updates

2023/02/27: Added error cases to address reviewer’s comments. See Error cases.
2022/11/27: Added samples of diffusion-based acoustic models. See Bonus samples.
2022/10/18: Created the demo page.

Authors

Ryuichi Yamamoto (LINE Corp., Nagoya University)
Reo Yoneyama (Nagoya University)
Tomoki Toda (Nagoya University)

Abstract

This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research. NNSVS is inspired by Sinsy, an open-source pioneer in singing voice synthesis research, and provides many additional features such as multi-stream models, autoregressive fundamental frequency models, and neural vocoders. Furthermore, NNSVS provides extensive documentation and numerous scripts to build complete singing voice synthesis systems. Experimental results demonstrate that our best system significantly outperforms our reproduction of Sinsy and other baseline systems. The toolkit is available at https://github.com/nnsvs/nnsvs.

Systems

The following table summarizes the systems used in our experiments. All the models were trained on Namine Ritsu’s database [1].

System	Acoustic Features	Multi-stream Architecture	Autoregressive streams	Vocoder
Sinsy [2]	MGC, LF0, VUV, BAP	No	-	hn-uSFGAN [5]
Sinsy (w/ pitch correction)	MGC, LF0, VUV, BAP	No	-	hn-uSFGAN
Sinsy (w/ vibrato modeling)	MGC, LF0, VUV, BAP	No	-	hn-uSFGAN
Muskits RNN [3]	MEL	No	-	HiFi-GAN [6]
DiffSinger [4]	MEL, LF0, VUV	Yes	-	hn-HiFi-GAN
NNSVS-Mel v1	MEL, LF0, VUV	Yes	-	hn-uSFGAN
NNSVS-Mel v2	MEL, LF0, VUV	Yes	LF0	hn-uSFGAN
NNSVS-Mel v3	MEL, LF0, VUV	Yes	MEL, LF0	hn-uSFGAN
NNSVS-WORLD v0 [1]	MGC, LF0, VUV, BAP	No	-	WORLD
NNSVS-WORLD v1	MGC, LF0, VUV, BAP	Yes	-	hn-uSFGAN
NNSVS-WORLD v2	MGC, LF0, VUV, BAP	Yes	LF0	hn-uSFGAN
NNSVS-WORLD v3	MGC, LF0, VUV, BAP	Yes	MGC, LF0	hn-uSFGAN
NNSVS-WORLD v4	MGC, LF0, VUV, BAP	Yes	MGC, LF0, BAP	hn-uSFGAN
hn-HiFI-GAN (A/S)	MEL, LF0, VUV	-	-	hn-HiFi-GAN
hn-uSFGAN-Mel (A/S)	MEL, LF0, VUV	-	-	hn-uSFGAN
hn-uSFGAN-WORLD (A/S)	MGC, LF0, VUV, BAP	-	-	hn-uSFGAN

Notes on baselines

Muskits and DiffSinger baseline systems were trained with thier offical code (Muskits and DiffSinger).
hn-HiFi-GAN is the vocoder used by DiffSinger. The detail of its implementation can be found here.
Sinsy systems are based on NNSVS’s implementation.
NNSVS-WORLD v0 uses the model trained with the earlier version of NNSVS (as of Nov. 2021) [1]. See here for details.

Notes on NNSVS systems

The code for reproducing experiments are available here.

Samples

The following samples are vocal only. Mixed demo can be found here.

SVS

Samples generated from musical score.

Sample 1: 1st color

Recording	Sinsy	Sinsy (with pitch correction)

Sinsy (with vibrato modeling)	Muskits RNN	DiffSinger

NNSVS-Mel v1	NNSVS-Mel v2	NNSVS-Mel v3

NNSVS-WORLD v0	NNSVS-WORLD v1	NNSVS-WORLD v2

NNSVS-WORLD v3	NNSVS-WORLD v4

Sample 2: ARROW

Recording	Sinsy	Sinsy (with pitch correction)

Sinsy (with vibrato modeling)	Muskits RNN	DiffSinger

NNSVS-Mel v1	NNSVS-Mel v2	NNSVS-Mel v3

NNSVS-WORLD v0	NNSVS-WORLD v1	NNSVS-WORLD v2

NNSVS-WORLD v3	NNSVS-WORLD v4

Sample 3: BC

Recording	Sinsy	Sinsy (with pitch correction)

Sinsy (with vibrato modeling)	Muskits RNN	DiffSinger

NNSVS-Mel v1	NNSVS-Mel v2	NNSVS-Mel v3

NNSVS-WORLD v0	NNSVS-WORLD v1	NNSVS-WORLD v2

NNSVS-WORLD v3	NNSVS-WORLD v4

Sample 4: Close to you

Recording	Sinsy	Sinsy (with pitch correction)

Sinsy (with vibrato modeling)	Muskits RNN	DiffSinger

NNSVS-Mel v1	NNSVS-Mel v2	NNSVS-Mel v3

NNSVS-WORLD v0	NNSVS-WORLD v1	NNSVS-WORLD v2

NNSVS-WORLD v3	NNSVS-WORLD v4

Sample 5: ERROR

Recording	Sinsy	Sinsy (with pitch correction)

Sinsy (with vibrato modeling)	Muskits RNN	DiffSinger

NNSVS-Mel v1	NNSVS-Mel v2	NNSVS-Mel v3

NNSVS-WORLD v0	NNSVS-WORLD v1	NNSVS-WORLD v2

NNSVS-WORLD v3	NNSVS-WORLD v4

A/S

Samples generated from extracted features (i.e., analysis-by-synthesis).

Sample 1: 1st color

Recording	hn-HiFi-GAN	hn-USFGAN-Mel

hn-USFGAN-WORLD

Sample 2: ARROW

Recording	hn-HiFi-GAN	hn-USFGAN-Mel

hn-USFGAN-WORLD

Sample 3: BC

Recording	hn-HiFi-GAN	hn-USFGAN-Mel

hn-USFGAN-WORLD

Sample 4: Close to you

Recording	hn-HiFi-GAN	hn-USFGAN-Mel

hn-USFGAN-WORLD

Sample 5: ERROR

Recording	hn-HiFi-GAN	hn-USFGAN-Mel

hn-USFGAN-WORLD

Mixed demo

System: NNSVS-WORLD v4

Sample 1

ERROR (from test data)

Sample 2

ARROW (from test data)

Sample 3

WAVE (from training data)

Bonus samples

We integrated the diffusion model for SVS [4] to improve naturalness of synthetic voice. The following table summarizes the systems for bonus samples.

System	Acoustic Features	Autoregressive streams	Diffusion streams	Vocoder
NNSVS-WORLD v4*	MGC, LF0, VUV, BAP	LF0, MGC, BAP	-	SiFi-GAN [7]
NNSVS-Mel v5	MEL, lF0, VUV	LF0	MEL	SiFi-GAN
NNSVS-WORLD v5	MGC, LF0, VUV, BAP	LF0	MGC, BAP	SiFi-GAN

NNSVS-WORLD v4* is the best model (as of 2022/09) reported in our paper with the following two changes:

sampling rate was changed from 24 kHz to 48 kHz
the vocoder was changed from hn-uSFGAN to SiFI-GAN [7].

NNSVS-Mel v5 and NNSVS-WORLD v5 are the systems using diffusion-based multi-stream acoustic models.

Note that we used 48 kHz sampling rate for the additional experiments.

Japanese

Sample 1: 1st color

Recording	NNSVS-WORLD v4*	NNSVS-Mel v5

NNSVS-WORLD v5

Sample 2: ARROW

Recording	NNSVS-WORLD v4*	NNSVS-Mel v5

NNSVS-WORLD v5

Sample 3: BC

Recording	NNSVS-WORLD v4*	NNSVS-Mel v5

NNSVS-WORLD v5

Sample 4: Close to you

Recording	NNSVS-WORLD v4*	NNSVS-Mel v5

NNSVS-WORLD v5

Sample 5: ERROR

Recording	NNSVS-WORLD v4*	NNSVS-Mel v5

NNSVS-WORLD v5

Mandarin

We provide Mandarin SVS samples using Opencpop database [8].

Sample 1: 2044001628

Recording	NNSVS-Mel v5

Sample 2: 2044001629

Recording	NNSVS-Mel v5

Sample 3: 2092003412

Recording	NNSVS-Mel v5

Sample 4: 2093003457

Recording	NNSVS-Mel v5

Sample 5: 2093003458

Recording	NNSVS-Mel v5

Error cases

Unstable pitch of DiffSinger

DiffSinger	Recording

As shown in the figure (blue: discontinous F0; green: unstable vibrato), the pitch contour of the DiffSinger is sometimes unstable.

Acknowledgments

This work was partly supported by JST CREST Grant Number JPMJCR19A3.

Appendix

Note pitch distribution

We selected test songs to cover a wide range of note pitches. Their distributions are shown in the following figure.

Pitch is presented as MIDI note number, where A4 (69) corresponds to 440 Hz. The lowerest and highest notes of the test songs were D#4 (155.6 Hz) and A5 (880 Hz), whereas those of the training data were D#3 (146.8 Hz) and B5 (987.8 Hz).

References

[1] Canon, [NamineRitsu] Blue (YOASOBI) [ENUNU model Ver.2, Singing DBVer.2 release], https://www.youtube.com/watch?v=pKeo9IE_L1I, Accessed: 2022.10.06.
[2] Y. Hono, K. Hashimoto, K. Oura, et al., “Sinsy: A deep neural network-based singing voice synthesis system,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 2803.
[3] J. Shi, S. Guo, T. Qian, et al., “Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis,” in Proc. Interspeech, 2022, pp. 4277–4281.
[4] J. Liu, C. Li, Y. Ren, et al., “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism”, AAAI, vol. 36, no. 10, pp. 11020-11028, Jun. 2022.
[5] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation,” in Proc. Interspeech, 2022, pp. 848–852.
[6] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
[7] R. Yoneyama, Y.-C. Wu, and T. Toda, “Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder." arXiv preprint arXiv:2210.15533, 2022.
[8] Y. Wang, X. Wang, P. Zhu, et al., “Opencpop: a high-quality open source chinese popular song corpus for singing voice synthesis," arXiv:2201.07429, 2022.

Ryuichi Yamamoto

Engineer/Researcher

I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.

NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit

Updates

Authors

Abstract

Systems

Samples

SVS

A/S

Mixed demo

Sample 1

Sample 2

Sample 3

Bonus samples

Japanese

Mandarin

Error cases

Unstable pitch of DiffSinger

Acknowledgments

Appendix

Note pitch distribution

References

Ryuichi Yamamoto

Engineer/Researcher

Related