An open source implementation of Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
	
    
        
This page provides audio samples for the open source implementation of Deep Voice 3. Samples from single speaker and multi-speaker models follow.
Single speaker
Scientists at the CERN laboratory say they have discovered a new particle.
(74 chars, 13 words)
There’s a way to measure the acute emotional intelligence that has never gone out of style.
(91 chars, 18 words)
President Trump met with other leaders at the Group of 20 conference.
(69 chars, 13 words)
The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.
(81 chars, 16 words)
Generative adversarial network or variational auto-encoder.
(59 chars, 7 words)
The buses aren’t the problem, they actually provide a solution.
(63 chars, 13 words)
Multi-speaker
- Samples from a model trained for 600k steps (~22 hours) on the VCTK dataset (108 speakers)
 
- Pretrained model: link
 
- Git commit: 0421749
 
Same text with 12 different speakers
Some have accepted this as a miracle without any physical explanation
(69 chars, 11 words)
225, 23,  F,    English,    Southern,  England (ID, AGE,  GENDER,  ACCENTS,  REGION)
226,  22,  M,    English,    Surrey
227,  38,  M,    English,    Cumbria
228,  22,  F,    English,    Southern  England
229,  23,  F,    English,    Southern  England
230,  22,  F,    English,    Stockton-on-tees
231,  23,  F,    English,    Southern  England
232,  23,  M,    English,    Southern  England
233,  23,  F,    English,    Staffordshire
234,  22,  F,    Scottish,  West  Dumfries
236,  23,  F,    English,    Manchester
237,  22,  M,    Scottish,  Fife
Five unknown texts with two (male/female) speakers with attention plot
- Male (292,  23,  M,    NorthernIrish,  Belfast)
 
- Female (288,  22,  F,    Irish,  Dublin)
 
Scientists at the CERN laboratory say they have discovered a new particle.
(74 chars, 13 words)
There’s a way to measure the acute emotional intelligence that has never gone out of style.
(91 chars, 18 words)
President Trump met with other leaders at the Group of 20 conference.
(69 chars, 13 words)
The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.
(81 chars, 16 words)
Generative adversarial network or variational auto-encoder.
(59 chars, 7 words)
The buses aren’t the problem, they actually provide a solution.
(63 chars, 13 words)
This is not the result of DeepVoice3, but there’s a very similar approach I also implemented.
Scientists at the CERN laboratory say they have discovered a new particle.
(74 chars, 13 words)
There’s a way to measure the acute emotional intelligence that has never gone out of style.
(91 chars, 18 words)
President Trump met with other leaders at the Group of 20 conference.
(69 chars, 13 words)
The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled.
(81 chars, 16 words)
Generative adversarial network or variational auto-encoder.
(59 chars, 7 words)
The buses aren’t the problem, they actually provide a solution.
(63 chars, 13 words)
References
- Wei Ping, Kainan Peng, Andrew Gibiansky, et al, “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”, arXiv:1710.07654, Oct. 2017.
 
- Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara, “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention”. arXiv:1710.08969, Oct 2017.