ttslearn.contrib¶
発展的な実装のためのモジュールです。
TTS¶
-
class
ttslearn.contrib.tacotron2_pwg.
Tacotron2PWGTTS
(model_dir=None, device='cpu')[source]¶ Fast Tacotron 2 based text-to-speech with Parallel WaveGAN
The WaveNet vocoder in Tacotron 2 is replaced with Parallel WaveGAN for fast real-time inference. Both single-speaker and multi-speaker Tacotron are supported.
- Parameters
Examples
Singler-speaker TTS
>>> from ttslearn.contrib import Tacotron2PWGTTS >>> engine = Tacotron2PWGTTS() >>> wav, sr = engine.tts("発展的な音声合成です!")
Multi-speaker TTS
>>> from ttslearn.contrib import Tacotron2PWGTTS >>> from ttslearn.pretrained import retrieve_pretrained_model >>> model_dir = retrieve_pretrained_model("multspk_tacotron2_pwg_jvs24k") >>> engine = Tacotron2PWGTTS(model_dir) >>> wav, sr = engine.tts("じぇーぶいえすコーパス10番目の話者です。", spk_id=10)
Note
This class supports not only Parallel WaveGAN but also any models supported in kan-bayashi/ParallelWaveGAN. For example, HifiGAN or MelGAN can also be used without any change.
Multi-speaker Tacotron 2¶
-
class
ttslearn.contrib.multispk_tacotron2.
MultiSpkTacotron2
(num_vocab=51, embed_dim=512, encoder_hidden_dim=512, encoder_conv_layers=3, encoder_conv_channels=512, encoder_conv_kernel_size=5, encoder_dropout=0.5, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31, decoder_out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, decoder_prenet_layers=2, decoder_prenet_hidden_dim=256, decoder_prenet_dropout=0.5, decoder_zoneout=0.1, postnet_layers=5, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.5, reduction_factor=1, n_spks=100, spk_emb_dim=64)[source]¶ Multi-speaker Tacotron 2
This implementation does not include the WaveNet vocoder of the Tacotron 2.
- Parameters
num_vocab (int) – the size of vocabulary
embed_dim (int) – dimension of embedding
encoder_hidden_dim (int) – dimension of hidden unit
encoder_conv_layers (int) – the number of convolution layers
encoder_conv_channels (int) – the number of convolution channels
encoder_conv_kernel_size (int) – kernel size of convolution
encoder_dropout (float) – dropout rate of convolution
attention_hidden_dim (int) – dimension of hidden unit
attention_conv_channels (int) – the number of convolution channels
attention_conv_kernel_size (int) – kernel size of convolution
decoder_out_dim (int) – dimension of output
decoder_layers (int) – the number of decoder layers
decoder_hidden_dim (int) – dimension of hidden unit
decoder_prenet_layers (int) – the number of prenet layers
decoder_prenet_hidden_dim (int) – dimension of hidden unit
decoder_prenet_dropout (float) – dropout rate of prenet
decoder_zoneout (float) – zoneout rate
postnet_layers (int) – the number of postnet layers
postnet_channels (int) – the number of postnet channels
postnet_kernel_size (int) – kernel size of postnet
postnet_dropout (float) – dropout rate of postnet
reduction_factor (int) – reduction factor
n_spks (int) – the number of speakers
spk_emb_dim (int) – dimension of speaker embedding
-
forward
(seq, in_lens, decoder_targets, spk_ids)[source]¶ Forward step
- Parameters
seq (torch.Tensor) – input sequence
in_lens (torch.Tensor) – input sequence lengths
decoder_targets (torch.Tensor) – target sequence
spk_ids (torch.Tensor) – speaker ids
- Returns
- tuple of outputs, outputs (after post-net), stop token prediction
and attention weights.
- Return type
Utility for multi-speaker training¶
-
class
ttslearn.contrib.multispk_util.
Dataset
(in_paths, out_paths, spk_paths)[source]¶ Dataset for numpy files
- Parameters
DataLoader¶
Collate function for multi-speaker Tacotron. |
|
Get data loaders for training and validation. |