ttslearn.tacotron

Tacotron 2に基づく音声合成のためのモジュールです。

TTS

The TTS functionality is accessible from ttslearn.tacotron.*

class ttslearn.tacotron.tts.Tacotron2TTS(model_dir=None, device='cpu')[source]

Tacotron 2 based text-to-speech

Parameters
  • model_dir (str) – model directory. A pre-trained model (ID: tacotron2) is used if None.

  • device (str) – cpu or cuda.

Examples

>>> from ttslearn.tacotron import Tacotron2TTS
>>> engine = Tacotron2TTS()
>>> wav, sr = engine.tts("一貫学習にチャレンジしましょう!")
set_device(device)[source]

Set device for the TTS models

Parameters

device (str) – cpu or cuda.

tts(text, griffin_lim=False, tqdm=<class 'tqdm.std.tqdm'>)[source]

Run TTS

Parameters
  • text (str) – Input text

  • griffin_lim (bool, optional) – Use Griffin-Lim algorithm or not. Defaults to False.

  • tqdm (object, optional) – tqdm object. Defaults to None.

Returns

audio array (np.int16) and sampling rate (int)

Return type

tuple

Text processing frontend

Open JTalk (Japanese)

pp_symbols

Extract phoneme + prosoody symbol sequence from input full-context labels

text_to_sequence

Convert phoneme + prosody symbols to sequence of numbers

sequence_to_text

Convert sequence of numbers to phoneme + prosody symbols

num_vocab

Get number of vocabraries

Text (English)

text_to_sequence

Convert text to sequence of numbers

sequence_to_text

Convert sequence of numbers to text

num_vocab

Get number of vocabraries

Encoder

class ttslearn.tacotron.encoder.Encoder(num_vocab=51, embed_dim=512, hidden_dim=512, conv_layers=3, conv_channels=512, conv_kernel_size=5, dropout=0.5)[source]

Encoder of Tacotron 2

Parameters
  • num_vocab (int) – number of vocabularies

  • embed_dim (int) – dimension of embeddings

  • hidden_dim (int) – dimension of hidden units

  • conv_layers (int) – number of convolutional layers

  • conv_channels (int) – number of convolutional channels

  • conv_kernel_size (int) – size of convolutional kernel

  • dropout (float) – dropout rate

forward(seqs, in_lens)[source]

Forward step

Parameters
  • seqs (torch.Tensor) – input sequences

  • in_lens (torch.Tensor) – input sequence lengths

Returns

encoded sequences

Return type

torch.Tensor

Attention

class ttslearn.tacotron.attention.BahdanauAttention(encoder_dim=512, decoder_dim=1024, hidden_dim=128)[source]

Bahdanau-style attention

This is an attention mechanism originally used in Tacotron.

Parameters
  • encoder_dim (int) – dimension of encoder outputs

  • decoder_dim (int) – dimension of decoder outputs

  • hidden_dim (int) – dimension of hidden state

reset()[source]

Reset the internal buffer

forward(encoder_outs, src_lens, decoder_state, mask=None)[source]

Forward step

Parameters
  • encoder_outs (torch.FloatTensor) – encoder outputs

  • src_lens (list) – length of each input batch

  • decoder_state (torch.FloatTensor) – decoder hidden state

  • mask (torch.FloatTensor) – mask for padding

class ttslearn.tacotron.attention.LocationSensitiveAttention(encoder_dim=512, decoder_dim=1024, hidden_dim=128, conv_channels=32, conv_kernel_size=31)[source]

Location-sensitive attention

This is an attention mechanism used in Tacotron 2.

Parameters
  • encoder_dim (int) – dimension of encoder outputs

  • decoder_dim (int) – dimension of decoder outputs

  • hidden_dim (int) – dimension of hidden state

  • conv_channels (int) – number of channels of convolutional layer

  • conv_kernel_size (int) – size of convolutional kernel

reset()[source]

Reset the internal buffer

forward(encoder_outs, src_lens, decoder_state, att_prev, mask=None)[source]

Forward step

Parameters
  • encoder_outs (torch.FloatTensor) – encoder outputs

  • src_lens (list) – length of each input batch

  • decoder_state (torch.FloatTensor) – decoder hidden state

  • att_prev (torch.FloatTensor) – previous attention weight

  • mask (torch.FloatTensor) – mask for padding

Decoder

class ttslearn.tacotron.decoder.Prenet(in_dim, layers=2, hidden_dim=256, dropout=0.5)[source]

Pre-Net of Tacotron/Tacotron 2.

Parameters
  • in_dim (int) – dimension of input

  • layers (int) – number of pre-net layers

  • hidden_dim (int) – dimension of hidden layer

  • dropout (float) – dropout rate

forward(x)[source]

Forward step

Parameters

x (torch.Tensor) – input tensor

Returns

output tensor

Return type

torch.Tensor

class ttslearn.tacotron.decoder.Decoder(encoder_hidden_dim=512, out_dim=80, layers=2, hidden_dim=1024, prenet_layers=2, prenet_hidden_dim=256, prenet_dropout=0.5, zoneout=0.1, reduction_factor=1, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31)[source]

Decoder of Tacotron 2.

Parameters
  • encoder_hidden_dim (int) – dimension of encoder hidden layer

  • out_dim (int) – dimension of output

  • layers (int) – number of LSTM layers

  • hidden_dim (int) – dimension of hidden layer

  • prenet_layers (int) – number of pre-net layers

  • prenet_hidden_dim (int) – dimension of pre-net hidden layer

  • prenet_dropout (float) – dropout rate of pre-net

  • zoneout (float) – zoneout rate

  • reduction_factor (int) – reduction factor

  • attention_hidden_dim (int) – dimension of attention hidden layer

  • attention_conv_channel (int) – number of attention convolution channels

  • attention_conv_kernel_size (int) – kernel size of attention convolution

forward(encoder_outs, in_lens, decoder_targets=None)[source]

Forward step

Parameters
  • encoder_outs (torch.Tensor) – encoder outputs

  • in_lens (torch.Tensor) – input lengths

  • decoder_targets (torch.Tensor) – decoder targets for teacher-forcing.

Returns

tuple of outputs, stop token prediction, and attention weights

Return type

tuple

Post-Net

class ttslearn.tacotron.postnet.Postnet(in_dim, layers=5, channels=512, kernel_size=5, dropout=0.5)[source]

Post-Net of Tacotron 2

Parameters
  • in_dim (int) – dimension of input

  • layers (int) – number of layers

  • channels (int) – number of channels

  • kernel_size (int) – kernel size

  • dropout (float) – dropout rate

forward(xs)[source]

Forward step

Parameters

xs (torch.Tensor) – input sequence

Returns

output sequence

Return type

torch.Tensor

Tacotron 2

class ttslearn.tacotron.tacotron2.Tacotron2(num_vocab=51, embed_dim=512, encoder_hidden_dim=512, encoder_conv_layers=3, encoder_conv_channels=512, encoder_conv_kernel_size=5, encoder_dropout=0.5, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31, decoder_out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, decoder_prenet_layers=2, decoder_prenet_hidden_dim=256, decoder_prenet_dropout=0.5, decoder_zoneout=0.1, postnet_layers=5, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.5, reduction_factor=1)[source]

Tacotron 2

This implementation does not include the WaveNet vocoder of the Tacotron 2.

Parameters
  • num_vocab (int) – the size of vocabulary

  • embed_dim (int) – dimension of embedding

  • encoder_hidden_dim (int) – dimension of hidden unit

  • encoder_conv_layers (int) – the number of convolution layers

  • encoder_conv_channels (int) – the number of convolution channels

  • encoder_conv_kernel_size (int) – kernel size of convolution

  • encoder_dropout (float) – dropout rate of convolution

  • attention_hidden_dim (int) – dimension of hidden unit

  • attention_conv_channels (int) – the number of convolution channels

  • attention_conv_kernel_size (int) – kernel size of convolution

  • decoder_out_dim (int) – dimension of output

  • decoder_layers (int) – the number of decoder layers

  • decoder_hidden_dim (int) – dimension of hidden unit

  • decoder_prenet_layers (int) – the number of prenet layers

  • decoder_prenet_hidden_dim (int) – dimension of hidden unit

  • decoder_prenet_dropout (float) – dropout rate of prenet

  • decoder_zoneout (float) – zoneout rate

  • postnet_layers (int) – the number of postnet layers

  • postnet_channels (int) – the number of postnet channels

  • postnet_kernel_size (int) – kernel size of postnet

  • postnet_dropout (float) – dropout rate of postnet

  • reduction_factor (int) – reduction factor

forward(seq, in_lens, decoder_targets)[source]

Forward step

Parameters
  • seq (torch.Tensor) – input sequence

  • in_lens (torch.Tensor) – input sequence lengths

  • decoder_targets (torch.Tensor) – target sequence

Returns

tuple of outputs, outputs (after post-net), stop token prediction

and attention weights.

Return type

tuple

inference(seq)[source]

Inference step

Parameters

seq (torch.Tensor) – input sequence

Returns

tuple of outputs, outputs (after post-net), stop token prediction

and attention weights.

Return type

tuple

Generation utility

synthesis_griffin_lim

Synthesize waveform with Griffin-Lim algorithm.

synthesis

Synthesize waveform