
This module provides dataset abstraction. In this library, a dataset represents fixed-sized set of features (e.g., acoustic features, linguistic features, duration features etc.) composed of multiple utterances, supporting iteration and indexing.


To build dataset and represent variety of features (linguistic, duration, acoustic, etc) in an unified way, we define couple of interfaces.

  1. FileDataSource
  2. Dataset

The former is an abstraction of file data sources, where we find the data and how to process them. Any FileDataSource must implement:

  • collect_files: specifies where to find source files (wav, lab, cmp, bin, etc.).
  • collect_features: specifies how to collect features (just load from file, or do some feature extraction logic, etc).

The later is an abstraction of dataset. Any dataset must implement Dataset interface:

  • __getitem__: returns features (typically, two dimentional numpy.ndarray)
  • __len__: returns the size of dataset (e.g., number of utterances).

One important point is that we use numpy.ndarray to represent features (there might be exception though). For example,

  • F0 trajecoty as T x 1 array, where T represents number of frames.
  • Spectrogram as T x D array, where D is number of feature dimention.
  • Linguistic features as T x D array.
class nnmnkwii.datasets.FileDataSource[source]

File data source interface.

Users are expected to implement custum data source for your own data. All file data sources must implement this interface.


Collect features given path(s).

Parameters:args – File path or tuple of file paths
Returns:T x D features represented by 2d array.
Return type:2darray

Collect data source files

Returns:List of files, or tuple of list if you need multiple files to collect features.
Return type:List or tuple of list
class nnmnkwii.datasets.Dataset[source]

Dataset represents a fixed-sized set of features composed of multiple utterances.


With combination of FileDataSource and Dataset, we define some dataset implementation that can be used for typical situations.


Note that we don’t provide special iterator implementation (e.g., mini-batch iteration, multiprocessing, etc). Users are expected to use dataset with other iterator implementation. For PyTorch users, we can use PyTorch DataLoader for mini-batch iteration and multiprocessing. Our dataset interface is exactly same as PyTorch’s one, so we can use PyTorch DataLoader seamlessly. See tutorials how we can use it practically.

Dataset that supports utterance-wise iteration

class nnmnkwii.datasets.FileSourceDataset(file_data_source)[source]

Most basic dataset implementation. It supports utterance-wise iteration and has utility (asarray method) to convert dataset to an three dimentional numpy.ndarray.

Speech features have typically different number of time resolusion, so we cannot simply represent dataset as an array. To address the issue, the dataset class represents set of features as N x T^max x D array by padding zeros where N is the number of utterances, T^max is maximum number of frame lenghs and D is the dimention of features, respectively.

While this dataset loads features on-demand while indexing, if you are dealing with relatively small dataset, it might be useful to convert it to an array, and then do whatever with numpy/scipy functionalities.


FileDataSource – Data source to specify 1) where to find data to be loaded and 2) how to collect features from them.


ndarray – Collected files are stored.

Parameters:file_data_source (FileDataSource) – File data source.


>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
(578, 425) (578, 187)
(675, 425) (675, 187)
(606, 425) (606, 187)
>>> X.asarray(1000).shape
(3, 1000, 425)
>>> Y.asarray(1000).shape
(3, 1000, 187)
asarray(padded_length=None, dtype=<class 'numpy.float32'>, padded_length_guess=1000, verbose=0)[source]

Convert dataset to numpy array.

This try to load entire dataset into a single 3d numpy array.

  • padded_length (int) – Number of maximum time frames to be expected. If None, it is set to actual maximum time length.
  • dtype (numpy.dtype) – Numpy dtype.
  • padded_length_guess – (int): Initial guess of max time length of padded dataset array. Used if padded_length is None.

Array of shape N x T^max x D if padded_length is None, otherwise N x padded_length x D.

Return type:


class nnmnkwii.datasets.PaddedFileSourceDataset(file_data_source, padded_length)[source]

Basic dataset with padding. Very similar to FileSourceDataset, it supports utterance-wise iteration and has utility (asarray method) to convert dataset to an three dimentional numpy.ndarray.

The difference between FileSourceDataset is that this returns padded features as T^max x D array at __getitem__, while FileSourceDataset returns not-padded T x D array.

  • file_data_source (FileDataSource) – File data source.
  • padded_length (int) – Padded length.





>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import PaddedFileSourceDataset
>>> X.asarray(1000).shape
(3, 1000, 425)
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = PaddedFileSourceDataset(X, 1000), PaddedFileSourceDataset(Y, 1000)
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
(1000, 425) (1000, 187)
(1000, 425) (1000, 187)
(1000, 425) (1000, 187)
>>> X.asarray().shape
(3, 1000, 425)
>>> Y.asarray().shape
(3, 1000, 187)
asarray(dtype=<class 'numpy.float32'>, verbose=0)[source]

Convert dataset to numpy array.

This try to load entire dataset into a single 3d numpy array.

  • padded_length (int) – Number of maximum time frames to be expected. If None, it is set to actual maximum time length.
  • dtype (numpy.dtype) – Numpy dtype.
  • padded_length_guess – (int): Initial guess of max time length of padded dataset array. Used if padded_length is None.

Array of shape N x T^max x D if padded_length is None, otherwise N x padded_length x D.

Return type:


class nnmnkwii.datasets.MemoryCacheDataset(dataset, cache_size=777)[source]

A thin dataset wrapper class that has simple cache functionality. It supports utterance-wise iteration.

  • dataset (Dataset) – Dataset implementation to wrap.
  • cache_size (int) – Cache size (utterance unit).

Dataset – Dataset


OrderedDict – Loaded utterances. Keys are utterance indices and values are numpy arrays.


int – Cache size.


>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> from nnmnkwii.datasets import MemoryCacheDataset
>>> X, Y = MemoryCacheDataset(X), MemoryCacheDataset(Y)
>>> X.cached_utterances
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
(578, 425) (578, 187)
(675, 425) (675, 187)
(606, 425) (606, 187)
>>> len(X.cached_utterances)

Dataset that supports frame-wise iteration

class nnmnkwii.datasets.MemoryCacheFramewiseDataset(dataset, lengths, cache_size=777)[source]

A thin dataset wrapper class that has simple cache functionality. It supports frame-wise iteration. Different from other utterance-wise datasets, you will need to explicitly give number of time frames for each utterance at construction, since the class has to know the size of dataset to implement __len__.


If you are doing random access to the dataset, please be careful that you give sufficient large number of cache size, to avoid many file re-loading.

  • dataset (Dataset) – Dataset implementation to wrap.
  • lengths (list) – Frame lengths for each utterance.
  • cache_size (int) – Cache size (utterance unit).

Dataset – Dataset


OrderedDict – Loaded utterances.


int – Cache size.

>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> from nnmnkwii.datasets import MemoryCacheFramewiseDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> len(X)
>>> lengths = [len(x) for x in X] # collect frame lengths
>>> X = MemoryCacheFramewiseDataset(X, lengths)
>>> Y = MemoryCacheFramewiseDataset(Y, lengths)
>>> len(X)
>>> x[0].shape
>>> y[0].shape

Builtin data sources

There are a couple of builtin file data sources for typical datasets to make it easy to work on those. With the following data source implementation, you only need to implement collect_features, which defines what features you want from wav file or text (depends on data source). If you want maximum flexibility to access dataset, you may want to implement your own data source, instead of using bulitin ones.

e.g. If we are trying to extract acoustic features from wav files from CMU Arctic, then you can write:

from nnmnkwii.preprocessing import trim_zeros_frames
from nnmnkwii.datasets import FileSourceDataset
from nnmnkwii.datasets import cmu_arctic
import pysptk
import pyworld

class MyFileDataSource(cmu_arctic.WavFileDataSource):
    def __init__(self, data_root, speakers, max_files=100):
        super(MyFileDataSource, self).__init__(
            data_root, speakers, max_files=100)

    def collect_features(self, path):
        """Compute mel-cepstrum given a wav file."""
        fs, x =
        x = x.astype(np.float64)
        f0, timeaxis = pyworld.dio(x, fs, frame_period=5)
        f0 = pyworld.stonemask(x, f0, timeaxis, fs)
        spectrogram = pyworld.cheaptrick(x, f0, timeaxis, fs)
        spectrogram = trim_zeros_frames(spectrogram)
        mc = pysptk.sp2mc(spectrogram, order=24, alpha=0.41)
        return mc.astype(np.float32)

DATA_ROOT = "/home/ryuichi/data/cmu_arctic/" # your data path
data_source = MyFileDataSource(DATA_DIR, speakers=["clb"], max_files=100)

# 100 wav files of `clb` speaker will be collected
X = FileSourceDataset(data_source)
assert len(X) == 100

for x in X:
    # do anything on acoustic features (e.g., save to disk)

More real examples can be found in tests directory in nnmnkwii and tutorial notebooks in nnmnkwii_gallery.

CMU Arctic (en)

You can download data from

class nnmnkwii.datasets.cmu_arctic.WavFileDataSource(data_root, speakers, labelmap=None, max_files=None)[source]

Wav file data source for CMU Arctic dataset.

The data source collects wav files from CMU Arctic. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a wav file path.

  • data_root (str) – Data root.
  • speakers (list) – List of speakers to find. Supported names of speaker are awb, bdl, clb, jmk, ksp, rms and slt.
  • labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
  • max_files (int) – Total number of files to be collected.

numpy.ndarray – Speaker labels paired with collected files. Stored in collect_files. This is useful to build multi-speaker models.


Collect wav files for specific speakers.

Returns:List of collected wav files.
Return type:list

VCTK (en)

You can download data (15GB) from


Note that VCTK data sources don’t collect files for speaker 315, since there are no transcriptions available for 315 entries,

class nnmnkwii.datasets.vctk.TranscriptionDataSource(data_root, speakers=['225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '236', '237', '238', '239', '240', '241', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '292', '293', '294', '295', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '310', '311', '312', '313', '314', '316', '317', '318', '323', '326', '329', '330', '333', '334', '335', '336', '339', '340', '341', '343', '345', '347', '351', '360', '361', '362', '363', '364', '374', '376'], labelmap=None, max_files=None)[source]

Transcription data source for VCTK dataset.

The data source collects text transcriptions from VCTK. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a transcription.

  • data_root (str) – Data root.
  • speakers (list) – List of speakers to find. Speaker id must be str. For supported names of speaker, please refer to available_speakers defined in the module.
  • labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
  • max_files (int) – Total number of files to be collected.

dict – Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of AGE, GENDER and REGION.


numpy.ndarray – Speaker labels paired with collected files. Stored in collect_files. This is useful to build multi-speaker models.


Collect data source files

Returns:List of files, or tuple of list if you need multiple files to collect features.
Return type:List or tuple of list
class nnmnkwii.datasets.vctk.WavFileDataSource(data_root, speakers=['225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '236', '237', '238', '239', '240', '241', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '292', '293', '294', '295', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '310', '311', '312', '313', '314', '316', '317', '318', '323', '326', '329', '330', '333', '334', '335', '336', '339', '340', '341', '343', '345', '347', '351', '360', '361', '362', '363', '364', '374', '376'], labelmap=None, max_files=None)[source]

Transcription data source for VCTK dataset.

The data source collects text transcriptions from VCTK. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a transcription.

  • data_root (str) – Data root.
  • speakers (list) – List of speakers to find. Speaker id must be str. For supported names of speaker, please refer to available_speakers defined in the module.
  • labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
  • max_files (int) – Total number of files to be collected.

dict – Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of AGE, GENDER and REGION.


numpy.ndarray – Speaker labels paired with collected files. Stored in collect_files. This is useful to build multi-speaker models.


Collect data source files

Returns:List of files, or tuple of list if you need multiple files to collect features.
Return type:List or tuple of list

LJ-Speech (en)

You can download data (2.6GB) from

class nnmnkwii.datasets.ljspeech.TranscriptionDataSource(data_root, normalized=False)[source]

Transcription data source for LJSpeech dataset.

The data source collects text transcriptions from LJSpeech. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a transcription.

  • data_root (str) – Data root.
  • normalized (bool) – Collect normalized transcriptions or not.

numpy.ndarray – Metadata, shapeo (num_files x 3).


Collect text transcriptions.


Note that it returns list of transcriptions (str), not file paths.

Returns:List of text transcription.
Return type:list
class nnmnkwii.datasets.ljspeech.WavFileDataSource(data_root)[source]

Wav file data source for LJSpeech dataset.

The data source collects wav files from LJSpeech. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a wav file path.

Parameters:data_root (str) – Data root.

numpy.ndarray – Metadata, shape (num_files x 3).


Collect wav files.

Returns:List of wav files.
Return type:list

Voice Conversion Challenge (VCC) 2016 (en)

You can download training data (181MB) and evaluation data (~56 MB) from

class nnmnkwii.datasets.vcc2016.WavFileDataSource(data_root, speakers, labelmap=None, max_files=None, training_data_root=None, evaluation_data_root=None, training=True)[source]

Wav file data source for Voice Conversion Challenge (VCC) 2016 dataset.

The data source collects wav files from VCC2016 dataset. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a wav file path.


VCC2016 datasets are composed of training data and evaluation data, which can be downloaded separately. data_root should point to the directory that contains both the training and evaluation data.

Directory structure should look like for example:

> tree -d ~/data/vcc2016/
├── evaluation_all
│   ├── SF1
│   ├── SF2
│   ├── SF3
│   ├── SM1
│   ├── SM2
│   ├── TF1
│   ├── TF2
│   ├── TM1
│   ├── TM2
│   └── TM3
└── vcc2016_training
    ├── SF1
    ├── SF2
    ├── SF3
    ├── SM1
    ├── SM2
    ├── TF1
    ├── TF2
    ├── TM1
    ├── TM2
    └── TM3
  • data_root (str) – Data root. It’s assumed that training and evaluation data are placed at ${data_root}/vcc2016_training and
  • respectively, by default. (${data_root}/evaluation_all,) –
  • speakers (list) – List of speakers to find. Supported names of speaker are SF1, SF2, SF3, SM1, SM2, TF1, TF2, TM1, TM2 and TM3.
  • labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
  • max_files (int) – Total number of files to be collected.
  • training_data_root – If specified, try to search training data to the directory. If None, set to ${data_root}/vcc2016_training.
  • evaluation_data_root – If specified, try to search evaluation data to the directory. If None, set to ${data_root}/evaluation_all.
  • training (bool) – Whether it collects training data or not. If False, it collects evaluation data.

numpy.ndarray – Speaker labels paired with collected files. Stored in collect_files. This is useful to build multi-speaker models.


Collect wav files for specific speakers.

Returns:List of collected wav files.
Return type:list

Voice statistics (ja)

You can download data (~720MB) from

class nnmnkwii.datasets.voice_statistics.TranscriptionDataSource(data_root, column='sentence', max_files=None)[source]

Transcription data source for VoiceStatistics dataset

Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a transcription.

  • data_root (str) – Data root.
  • column (str) – sentense, yomi or monophone.
  • max_files (int) – Total number of files to be collected.
transcriptions (list): Transcriptions.

Collect text transcriptions.


Note that it returns list of transcriptions (str), not file paths.

Returns:List of text transcription.
Return type:list
class nnmnkwii.datasets.voice_statistics.WavFileDataSource(data_root, speakers, labelmap=None, max_files=None, emotions=['normal'])[source]

Wav file data source for Voice-statistics dataset.

The data source collects wav files from voice-statistics. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a wav file path.

  • data_root (str) – Data root
  • speakers (list) – List of speakers to load. Supported names of speaker are fujitou, tsuchiya and uemura.
  • labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
  • max_files (int) – Total number of files to be collected.
  • emotions (list) – List of emotions we use. Supported names of emotions are angry, happy and normal.

numpy.ndarray – List of speaker identifiers determined by labelmap. Stored in collect_files.


Collect wav files for specific speakers.

Returns:List of collected wav files.
Return type:list

JSUT (ja)

JSUT (Japanese speech corpus of Saruwatari Lab, University of Tokyo).

You can download data (2.7GB) from

class nnmnkwii.datasets.jsut.TranscriptionDataSource(data_root, subsets=['basic5000'], validate=True)[source]

Transcription data source for JSUT dataset.

The data source collects text transcriptions from JSUT. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a transcription.

  • data_root (str) – Data root.
  • subsets (list) – Subsets. Supported names of subset are basic5000, countersuffix26, loanword128, onomatopee300, precedent130, repeat500, travel1000, utparaphrase512. and voiceactress100. Default is [“basic5000”].
class nnmnkwii.datasets.jsut.WavFileDataSource(data_root, subsets=['basic5000'], validate=True)[source]

Wav file data source for JSUT dataset.

The data source collects wav files from JSUT. Users are expected to inherit the class and implement collect_features method, which defines how features are computed given a wav file path.

  • data_root (str) – Data root.
  • subsets (list) – Subsets. Supported names of subset are basic5000, countersuffix26, loanword128, onomatopee300, precedent130, repeat500, travel1000, utparaphrase512. and voiceactress100. Default is [“basic5000”].