An open source implementation of WaveNet vocoder

Github: https://github.com/r9y9/wavenet_vocoder

This page provides audio samples for the open source implementation of the WaveNet (WN) vocoder. Text-to-speech samples are found at the last section.

WN conditioned on mel-spectrogram (16-bit linear PCM, 22.5kHz)
WN conditioned on mel-spectrogram (8-bit mu-law, 16kHz)
WN conditioned on mel-spectrogram and speaker-embedding (16-bit linear PCM, 16kHz)
Tacotron2: WN-based text-to-speech (New!)

WN conditioned on mel-spectrogram (16-bit linear PCM, 22.5kHz)

Samples from a model trained for over 400k steps.
Left: generated, Right: ground truth

key	value
Data	LJSpeech (12522 for training, 578 for testing)
Input type	16-bit linear PCM
Sampling frequency	22.5kHz
Local conditioning	80-dim mel-spectrogram
Hop size	256
Global conditioning	N/A
Total layers	24
Num cycles	4
Residual / Gate / Skip-out channels	512 / 512 / 256
Receptive field (samples / ms)	505 / 22.9
Numer of mixtures	10
Number of upsampling layers	4

WN conditioned on mel-spectrogram (8-bit mu-law, 16kHz)

Samples from a model trained for 100k steps (~22 hours)
Left: generated, Right: (mu-law encoded) ground truth

key	value
Data	CMU ARCTIC (`clb`) (1183 for training, 50 for testing)
Input type	8-bit mu-law encoded one-hot vector
Sampling frequency	16kHz
Local conditioning	80-dim mel-spectrogram
Hop size	256
Global conditioning	N/A
Total layers	16
Num cycles	2
Residual / Gate / Skip-out channels	512 / 512 / 256
Receptive field (samples / ms)	1021 / 63.8
Number of upsampling layers	N/A

WN conditioned on mel-spectrogram and speaker-embedding (16-bit linear PCM, 16kHz)

Samples from a model trained for over 1000k steps
Left: generated, Right: ground truth

awb

bdl

clb

jmk

ksp

rms

slt

key	value
Data	CMU ARCTIC (7580 for training, 350 for testing)
Input type	16-bit linear PCM
Local conditioning	80-dim mel-spectrogram
Hop size	256
Global conditioning	16-dim speaker embedding ¹
Total layers	24
Num cycles	4
Residual / Gate / Skip-out channels	512 / 512 / 256
Receptive field (samples / ms)	505 / 22.9
Numer of mixtures	10
Number of upsampling layers	4

Tacotron2: WN-based text-to-speech

Tacotron2 (mel-spectrogram prediction part): trained 189k steps on LJSpeech dataset (Pre-trained model, Hyper params). The work has been done by @Rayhane-mamah. See https://github.com/Rayhane-mamah/Tacotron-2 for details.
WaveNet: trained over 1000k steps on LJSpeech dataset (Pre-trained model, Hyper params)

Scientists at the CERN laboratory say they have discovered a new particle.

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don’t desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

Punctuation sensitivity, is working.

Punctuation sensitivity is working.

The buses aren’t the problem, they actually provide a solution.

The buses aren’t the PROBLEM, they actually provide a SOLUTION.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I’m sure.

The blue lagoon is a nineteen eighty American romance adventure film.

On-line demo

A demonstration notebook supposed to be run on Google colab can be found at Tacotron2: WaveNet-basd text-to-speech demo.

References

Note that mel-spectrogram used in local conditioning is dependent on speaker characteristics, so we cannot simply change the speaker identity of the generated audio samples using the model. It should work without speaker embedding, but it might have helped training speed. ^[return]