ttslearn.contrib

発展的な実装のためのモジュールです。

TTS

class ttslearn.contrib.tacotron2_pwg.Tacotron2PWGTTS(model_dir=None, device='cpu')[source]

Fast Tacotron 2 based text-to-speech with Parallel WaveGAN

The WaveNet vocoder in Tacotron 2 is replaced with Parallel WaveGAN for fast real-time inference. Both single-speaker and multi-speaker Tacotron are supported.

Parameters
  • model_dir (str) – model directory. A pre-trained model (ID: tacotron2_pwg_jsut24k) is used if None.

  • device (str) – cpu or cuda.

Examples

Singler-speaker TTS

>>> from ttslearn.contrib import Tacotron2PWGTTS
>>> engine = Tacotron2PWGTTS()
>>> wav, sr = engine.tts("発展的な音声合成です!")

Multi-speaker TTS

>>> from ttslearn.contrib import Tacotron2PWGTTS
>>> from ttslearn.pretrained import retrieve_pretrained_model
>>> model_dir = retrieve_pretrained_model("multspk_tacotron2_pwg_jvs24k")
>>> engine = Tacotron2PWGTTS(model_dir)
>>> wav, sr = engine.tts("じぇーぶいえすコーパス10番目の話者です。", spk_id=10)

Note

This class supports not only Parallel WaveGAN but also any models supported in kan-bayashi/ParallelWaveGAN. For example, HifiGAN or MelGAN can also be used without any change.

set_device(device)[source]

Set device for the TTS models

Parameters

device (str) – cpu or cuda.

tts(text, tqdm=None, spk_id=None)[source]

Run TTS

Parameters
  • text (str) – Input text

  • tqdm (obj) – tqdm progress bar

  • spk_id (int) – speaker id. This should be only specified for multi-speaker models.

Returns

audio array (np.int16) and sampling rate (int)

Return type

tuple

Multi-speaker Tacotron 2

class ttslearn.contrib.multispk_tacotron2.MultiSpkTacotron2(num_vocab=51, embed_dim=512, encoder_hidden_dim=512, encoder_conv_layers=3, encoder_conv_channels=512, encoder_conv_kernel_size=5, encoder_dropout=0.5, attention_hidden_dim=128, attention_conv_channels=32, attention_conv_kernel_size=31, decoder_out_dim=80, decoder_layers=2, decoder_hidden_dim=1024, decoder_prenet_layers=2, decoder_prenet_hidden_dim=256, decoder_prenet_dropout=0.5, decoder_zoneout=0.1, postnet_layers=5, postnet_channels=512, postnet_kernel_size=5, postnet_dropout=0.5, reduction_factor=1, n_spks=100, spk_emb_dim=64)[source]

Multi-speaker Tacotron 2

This implementation does not include the WaveNet vocoder of the Tacotron 2.

Parameters
  • num_vocab (int) – the size of vocabulary

  • embed_dim (int) – dimension of embedding

  • encoder_hidden_dim (int) – dimension of hidden unit

  • encoder_conv_layers (int) – the number of convolution layers

  • encoder_conv_channels (int) – the number of convolution channels

  • encoder_conv_kernel_size (int) – kernel size of convolution

  • encoder_dropout (float) – dropout rate of convolution

  • attention_hidden_dim (int) – dimension of hidden unit

  • attention_conv_channels (int) – the number of convolution channels

  • attention_conv_kernel_size (int) – kernel size of convolution

  • decoder_out_dim (int) – dimension of output

  • decoder_layers (int) – the number of decoder layers

  • decoder_hidden_dim (int) – dimension of hidden unit

  • decoder_prenet_layers (int) – the number of prenet layers

  • decoder_prenet_hidden_dim (int) – dimension of hidden unit

  • decoder_prenet_dropout (float) – dropout rate of prenet

  • decoder_zoneout (float) – zoneout rate

  • postnet_layers (int) – the number of postnet layers

  • postnet_channels (int) – the number of postnet channels

  • postnet_kernel_size (int) – kernel size of postnet

  • postnet_dropout (float) – dropout rate of postnet

  • reduction_factor (int) – reduction factor

  • n_spks (int) – the number of speakers

  • spk_emb_dim (int) – dimension of speaker embedding

forward(seq, in_lens, decoder_targets, spk_ids)[source]

Forward step

Parameters
  • seq (torch.Tensor) – input sequence

  • in_lens (torch.Tensor) – input sequence lengths

  • decoder_targets (torch.Tensor) – target sequence

  • spk_ids (torch.Tensor) – speaker ids

Returns

tuple of outputs, outputs (after post-net), stop token prediction

and attention weights.

Return type

tuple

inference(seq, spk_id)[source]

Inference step

Parameters
  • seq (torch.Tensor) – input sequence

  • spk_id (int) – speaker id

Returns

tuple of outputs, outputs (after post-net), stop token prediction

and attention weights.

Return type

tuple

Utility for multi-speaker training

class ttslearn.contrib.multispk_util.Dataset(in_paths, out_paths, spk_paths)[source]

Dataset for numpy files

Parameters
  • in_paths (list) – List of paths to input files

  • out_paths (list) – List of paths to output files

  • spk_paths (list) – List of paths to speaker ID

__getitem__(idx)[source]

Get a pair of input and target

Parameters

idx (int) – index of the pair

Returns

input, target and speaker ID in numpy format

Return type

tuple

__len__()[source]

Returns the size of the dataset

Returns

size of the dataset

Return type

int

DataLoader

collate_fn_ms_tacotron

Collate function for multi-speaker Tacotron.

get_data_loaders

Get data loaders for training and validation.

Helper for training

setup

Setup for traiining