An open source implementation of WaveNet vocoder
This page provides audio samples for the open source implementation of the WaveNet (WN) vocoder.
Text-to-speech samples are found at the last section.
WN conditioned on mel-spectrogram (16-bit linear PCM, 22.5kHz)
WN conditioned on mel-spectrogram (8-bit mu-law, 16kHz)
WN conditioned on mel-spectrogram and speaker-embedding (16-bit linear PCM, 16kHz)
Tacotron2: WN-based text-to-speech (New! )
WN conditioned on mel-spectrogram (16-bit linear PCM, 22.5kHz)
Samples from a model trained for over 400k steps.
Left: generated, Right: ground truth
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
key
value
Data
LJSpeech (12522 for training, 578 for testing)
Input type
16-bit linear PCM
Sampling frequency
22.5kHz
Local conditioning
80-dim mel-spectrogram
Hop size
256
Global conditioning
N/A
Total layers
24
Num cycles
4
Residual / Gate / Skip-out channels
512 / 512 / 256
Receptive field (samples / ms)
505 / 22.9
Numer of mixtures
10
Number of upsampling layers
4
WN conditioned on mel-spectrogram (8-bit mu-law, 16kHz)
Samples from a model trained for 100k steps (~22 hours)
Left: generated, Right: (mu-law encoded) ground truth
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
key
value
Data
CMU ARCTIC (clb
) (1183 for training, 50 for testing)
Input type
8-bit mu-law encoded one-hot vector
Sampling frequency
16kHz
Local conditioning
80-dim mel-spectrogram
Hop size
256
Global conditioning
N/A
Total layers
16
Num cycles
2
Residual / Gate / Skip-out channels
512 / 512 / 256
Receptive field (samples / ms)
1021 / 63.8
Number of upsampling layers
N/A
WN conditioned on mel-spectrogram and speaker-embedding (16-bit linear PCM, 16kHz)
Samples from a model trained for over 1000k steps
Left: generated, Right: ground truth
awb
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
bdl
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
clb
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
jmk
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
ksp
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
rms
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
slt
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
key
value
Data
CMU ARCTIC (7580 for training, 350 for testing)
Input type
16-bit linear PCM
Local conditioning
80-dim mel-spectrogram
Hop size
256
Global conditioning
16-dim speaker embedding
Total layers
24
Num cycles
4
Residual / Gate / Skip-out channels
512 / 512 / 256
Receptive field (samples / ms)
505 / 22.9
Numer of mixtures
10
Number of upsampling layers
4
Tacotron2: WN-based text-to-speech
Scientists at the CERN laboratory say they have discovered a new particle.
Your browser does not support the audio element.
There’s a way to measure the acute emotional intelligence that has never gone out of style.
Your browser does not support the audio element.
President Trump met with other leaders at the Group of 20 conference.
Your browser does not support the audio element.
The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.
Your browser does not support the audio element.
Generative adversarial network or variational auto-encoder.
Your browser does not support the audio element.
Basilar membrane and otolaryngology are not auto-correlations.
Your browser does not support the audio element.
He has read the whole thing.
Your browser does not support the audio element.
He reads books.
Your browser does not support the audio element.
Don’t desert me here in the desert!
Your browser does not support the audio element.
He thought it was time to present the present.
Your browser does not support the audio element.
Thisss isrealy awhsome.
Your browser does not support the audio element.
Punctuation sensitivity, is working.
Your browser does not support the audio element.
Punctuation sensitivity is working.
Your browser does not support the audio element.
The buses aren’t the problem, they actually provide a solution.
Your browser does not support the audio element.
The buses aren’t the PROBLEM, they actually provide a SOLUTION.
Your browser does not support the audio element.
The quick brown fox jumps over the lazy dog.
Your browser does not support the audio element.
Does the quick brown fox jump over the lazy dog?
Your browser does not support the audio element.
Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?
Your browser does not support the audio element.
She sells sea-shells on the sea-shore. The shells she sells are sea-shells I’m sure.
Your browser does not support the audio element.
The blue lagoon is a nineteen eighty American romance adventure film.
Your browser does not support the audio element.
On-line demo
A demonstration notebook supposed to be run on Google colab can be found at Tacotron2: WaveNet-basd text-to-speech demo .
References
Aaron van den Oord, Sander Dieleman, Heiga Zen, et al, “WaveNet: A Generative Model for Raw Audio”, arXiv:1609.03499, Sep 2016.
Aaron van den Oord, Yazhe Li, Igor Babuschkin, et al, “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”, arXiv:1711.10433, Nov 2017.
Tamamori, Akira, et al. “Speaker-dependent WaveNet vocoder.” Proceedings of Interspeech. 2017.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, et al, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”, arXiv:1712.05884, Dec 2017.
Wei Ping, Kainan Peng, Andrew Gibiansky, et al, “Deep Voice 3: 2000-Speaker Neural Text-to-Speech”, arXiv:1710.07654, Oct. 2017.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, et al, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”, arXiv:1712.05884, Dec 2017.