An open source implementation of WaveNet vocoder


This page provides audio samples for the open source implementation of the WaveNet (WN) vocoder. Text-to-speech samples are found at the last section.

WN conditioned on mel-spectrogram (16-bit linear PCM, 22.5kHz)

key value
Data LJSpeech (12522 for training, 578 for testing)
Input type 16-bit linear PCM
Sampling frequency 22.5kHz
Local conditioning 80-dim mel-spectrogram
Hop size 256
Global conditioning N/A
Total layers 24
Num cycles 4
Residual / Gate / Skip-out channels 512 / 512 / 256
Receptive field (samples / ms) 505 / 22.9
Numer of mixtures 10
Number of upsampling layers 4

WN conditioned on mel-spectrogram (8-bit mu-law, 16kHz)

key value
Data CMU ARCTIC (clb) (1183 for training, 50 for testing)
Input type 8-bit mu-law encoded one-hot vector
Sampling frequency 16kHz
Local conditioning 80-dim mel-spectrogram
Hop size 256
Global conditioning N/A
Total layers 16
Num cycles 2
Residual / Gate / Skip-out channels 512 / 512 / 256
Receptive field (samples / ms) 1021 / 63.8
Number of upsampling layers N/A

WN conditioned on mel-spectrogram and speaker-embedding (16-bit linear PCM, 16kHz)

awb

bdl

clb

jmk

ksp

rms

slt

key value
Data CMU ARCTIC (7580 for training, 350 for testing)
Input type 16-bit linear PCM
Local conditioning 80-dim mel-spectrogram
Hop size 256
Global conditioning 16-dim speaker embedding 1
Total layers 24
Num cycles 4
Residual / Gate / Skip-out channels 512 / 512 / 256
Receptive field (samples / ms) 505 / 22.9
Numer of mixtures 10
Number of upsampling layers 4

Tacotron2: WN-based text-to-speech

Scientists at the CERN laboratory say they have discovered a new particle.

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don’t desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

Punctuation sensitivity, is working.

Punctuation sensitivity is working.

The buses aren’t the problem, they actually provide a solution.

The buses aren’t the PROBLEM, they actually provide a SOLUTION.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I’m sure.

The blue lagoon is a nineteen eighty American romance adventure film.

On-line demo

A demonstration notebook supposed to be run on Google colab can be found at Tacotron2: WaveNet-basd text-to-speech demo.

References


  1. Note that mel-spectrogram used in local conditioning is dependent on speaker characteristics, so we cannot simply change the speaker identity of the generated audio samples using the model. It should work without speaker embedding, but it might have helped training speed. [return]