An open source implementation of Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Github: https://github.com/r9y9/deepvoice3_pytorch

This page provides audio samples for the open source implementation of Deep Voice 3. Samples from single speaker and multi-speaker models follow.

Single speaker

Samples from a model trained for 210k steps (~12 hours)¹ on the LJSpeech dataset.
Pretrained model: link
Git commit: 4357976

Scientists at the CERN laboratory say they have discovered a new particle.

(74 chars, 13 words)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

(91 chars, 18 words)

President Trump met with other leaders at the Group of 20 conference.

(69 chars, 13 words)

The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.

(81 chars, 16 words)

Generative adversarial network or variational auto-encoder.

(59 chars, 7 words)

The buses aren’t the problem, they actually provide a solution.

(63 chars, 13 words)

Multi-speaker

Samples from a model trained for 600k steps (~22 hours) on the VCTK dataset (108 speakers)
Pretrained model: link
Git commit: 0421749

Same text with 12 different speakers

Some have accepted this as a miracle without any physical explanation

(69 chars, 11 words)

225, 23, F, English, Southern, England (ID, AGE, GENDER, ACCENTS, REGION)

226, 22, M, English, Surrey

227, 38, M, English, Cumbria

228, 22, F, English, Southern England

229, 23, F, English, Southern England

230, 22, F, English, Stockton-on-tees

231, 23, F, English, Southern England

232, 23, M, English, Southern England

233, 23, F, English, Staffordshire

234, 22, F, Scottish, West Dumfries

236, 23, F, English, Manchester

237, 22, M, Scottish, Fife

Five unknown texts with two (male/female) speakers with attention plot

Male (292, 23, M, NorthernIrish, Belfast)
Female (288, 22, F, Irish, Dublin)

Scientists at the CERN laboratory say they have discovered a new particle.

(74 chars, 13 words)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

(91 chars, 18 words)

President Trump met with other leaders at the Group of 20 conference.

(69 chars, 13 words)

The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.

(81 chars, 16 words)

Generative adversarial network or variational auto-encoder.

(59 chars, 7 words)

The buses aren’t the problem, they actually provide a solution.

(63 chars, 13 words)

Single speaker (arXiv:1710.08969 [cs.SD])

This is not the result of DeepVoice3, but there’s a very similar approach I also implemented.

Samples from a model trained for 585k steps on the LJSpeech dataset.
Pretrained model: link
Git commit: ba59dc7

Scientists at the CERN laboratory say they have discovered a new particle.

(74 chars, 13 words)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

(91 chars, 18 words)

President Trump met with other leaders at the Group of 20 conference.

(69 chars, 13 words)

The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.

(81 chars, 16 words)

Generative adversarial network or variational auto-encoder.

(59 chars, 7 words)

The buses aren’t the problem, they actually provide a solution.

(63 chars, 13 words)

References

I’m afraid I don’t remember correctly, I may have trained a bit more. ^[return]