ttslearn.tacotron¶
Tacotron 2に基づく音声合成のためのモジュールです。
TTS¶
The TTS functionality is accessible from ttslearn.tacotron.*
-
class
ttslearn.tacotron.tts.
Tacotron2TTS
(model_dir=None, device='cpu')[source]¶ Tacotron 2 based text-to-speech
- Parameters
Examples
>>> from ttslearn.tacotron import Tacotron2TTS >>> engine = Tacotron2TTS() >>> wav, sr = engine.tts("一貫学習にチャレンジしましょう!")
Text processing frontend¶
Open JTalk (Japanese)¶
Extract phoneme + prosoody symbol sequence from input full-context labels |
|
Convert phoneme + prosody symbols to sequence of numbers |
|
Convert sequence of numbers to phoneme + prosody symbols |
|
Get number of vocabraries |
Text (English)¶
Convert text to sequence of numbers |
|
Convert sequence of numbers to text |
|
Get number of vocabraries |
Encoder¶
-
class
ttslearn.tacotron.encoder.
Encoder
(num_vocab=51, embed_dim=512, hidden_dim=512, conv_layers=3, conv_channels=512, conv_kernel_size=5, dropout=0.5)[source]¶ Encoder of Tacotron 2
- Parameters
num_vocab (int) – number of vocabularies
embed_dim (int) – dimension of embeddings
hidden_dim (int) – dimension of hidden units
conv_layers (int) – number of convolutional layers
conv_channels (int) – number of convolutional channels
conv_kernel_size (int) – size of convolutional kernel
dropout (float) – dropout rate
Attention¶
-
class
ttslearn.tacotron.attention.
BahdanauAttention
(encoder_dim=512, decoder_dim=1024, hidden_dim=128)[source]¶ Bahdanau-style attention
This is an attention mechanism originally used in Tacotron.
- Parameters
-
class
ttslearn.tacotron.attention.
LocationSensitiveAttention
(encoder_dim=512, decoder_dim=1024, hidden_dim=128, conv_channels=32, conv_kernel_size=31)[source]¶ Location-sensitive attention
This is an attention mechanism used in Tacotron 2.
- Parameters
-
forward
(encoder_outs, src_lens, decoder_state, att_prev, mask=None)[source]¶ Forward step
- Parameters
encoder_outs (torch.FloatTensor) – encoder outputs
src_lens (list) – length of each input batch
decoder_state (torch.FloatTensor) – decoder hidden state
att_prev (torch.FloatTensor) – previous attention weight
mask (torch.FloatTensor) – mask for padding
Decoder¶
-
class
ttslearn.tacotron.decoder.
Prenet
(in_dim, layers=2, hidden_dim=256, dropout=0.5)[source]¶ Pre-Net of Tacotron/Tacotron 2.
- Parameters
-
class
ttslearn.tacotron.decoder.
Decoder
(encoder_hidden_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31)[source]¶ Decoder of Tacotron 2.
- Parameters
encoder_hidden_dim (int) – dimension of encoder hidden layer
out_dim (int) – dimension of output
layers (int) – number of LSTM layers
hidden_dim (int) – dimension of hidden layer
prenet_layers (int) – number of pre-net layers
prenet_hidden_dim (int) – dimension of pre-net hidden layer
prenet_dropout (float) – dropout rate of pre-net
zoneout (float) – zoneout rate
reduction_factor (int) – reduction factor
attention_hidden_dim (int) – dimension of attention hidden layer
attention_conv_channel (int) – number of attention convolution channels
attention_conv_kernel_size (int) – kernel size of attention convolution
-
forward
(encoder_outs, in_lens, decoder_targets=None)[source]¶ Forward step
- Parameters
encoder_outs (torch.Tensor) – encoder outputs
in_lens (torch.Tensor) – input lengths
decoder_targets (torch.Tensor) – decoder targets for teacher-forcing.
- Returns
tuple of outputs, stop token prediction, and attention weights
- Return type
Post-Net¶
Tacotron 2¶
-
class
ttslearn.tacotron.tacotron2.
Tacotron2
(num_vocab=51, embed_dim=512, encoder_hidden_dim=512, encoder_conv_layers=3, encoder_conv_channels=512, encoder_conv_kernel_size=5, encoder_dropout=0.5, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31, decoder_out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, decoder_prenet_layers=2, decoder_prenet_hidden_dim=256, decoder_prenet_dropout=0.5, decoder_zoneout=0.1, postnet_layers=5, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.5, reduction_factor=1)[source]¶ Tacotron 2
This implementation does not include the WaveNet vocoder of the Tacotron 2.
- Parameters
num_vocab (int) – the size of vocabulary
embed_dim (int) – dimension of embedding
encoder_hidden_dim (int) – dimension of hidden unit
encoder_conv_layers (int) – the number of convolution layers
encoder_conv_channels (int) – the number of convolution channels
encoder_conv_kernel_size (int) – kernel size of convolution
encoder_dropout (float) – dropout rate of convolution
attention_hidden_dim (int) – dimension of hidden unit
attention_conv_channels (int) – the number of convolution channels
attention_conv_kernel_size (int) – kernel size of convolution
decoder_out_dim (int) – dimension of output
decoder_layers (int) – the number of decoder layers
decoder_hidden_dim (int) – dimension of hidden unit
decoder_prenet_layers (int) – the number of prenet layers
decoder_prenet_hidden_dim (int) – dimension of hidden unit
decoder_prenet_dropout (float) – dropout rate of prenet
decoder_zoneout (float) – zoneout rate
postnet_layers (int) – the number of postnet layers
postnet_channels (int) – the number of postnet channels
postnet_kernel_size (int) – kernel size of postnet
postnet_dropout (float) – dropout rate of postnet
reduction_factor (int) – reduction factor
-
forward
(seq, in_lens, decoder_targets)[source]¶ Forward step
- Parameters
seq (torch.Tensor) – input sequence
in_lens (torch.Tensor) – input sequence lengths
decoder_targets (torch.Tensor) – target sequence
- Returns
- tuple of outputs, outputs (after post-net), stop token prediction
and attention weights.
- Return type
Generation utility¶
Synthesize waveform with Griffin-Lim algorithm. |
|
Synthesize waveform |