Datasets

This module provides dataset abstraction. In this library, a dataset represents fixed-sized set of features (e.g., acoustic features, linguistic features, duration features etc.) composed of multiple utterances, supporting iteration and indexing.

Interface

To build dataset and represent variety of features (linguistic, duration, acoustic, etc) in an unified way, we define couple of interfaces.

  1. FileDataSource
  2. Dataset

The former is an abstraction of file data sources, where we find the data and how to process them. Any FileDataSource must implement:

  • collect_files: specifies where to find source files (wav, lab, cmp, bin, etc.).
  • collect_features: specifies how to collect features (just load from file, or do some feature extraction logic, etc).

The later is an abstraction of dataset. Any dataset must implement Dataset interface:

  • __getitem__: returns features (typically, two dimentional numpy.ndarray)
  • __len__: returns the size of dataset (e.g., number of utterances).

One important point is that we use numpy.ndarray to represent features (there might be exception though). For example,

  • F0 trajecoty as T x 1 array, where T represents number of frames.
  • Spectrogram as T x D array, where D is number of feature dimention.
  • Linguistic features as T x D array.
class nnmnkwii.datasets.FileDataSource[source]

File data source interface.

Users are expected to implement custum data source for your own data. All file data sources must implement this interface.

collect_features(*args)[source]

Collect features given path(s).

Parameters:args – File path or tuple of file paths
Returns:T x D features represented by 2d array.
Return type:2darray
collect_files()[source]

Collect data source files

Returns:List of files, or tuple of list if you need multiple files to collect features.
Return type:List or tuple of list
class nnmnkwii.datasets.Dataset[source]

Dataset represents a fixed-sized set of features composed of multiple utterances.

Implementation

With combination of FileDataSource and Dataset, we define some dataset implementation that can be used for typical situations.

Note

Note that we don’t provide special iterator implementation (e.g., mini-batch iteration, multiprocessing, etc). Users are expected to use dataset with other iterator implementation. For PyTorch users, we can use PyTorch DataLoader for mini-batch iteration and multiprocessing. Our dataset interface is exactly same as PyTorch’s one, so we can use PyTorch DataLoader seamlessly.

Dataset that supports utterance-wise iteration

class nnmnkwii.datasets.FileSourceDataset(file_data_source)[source]

Most basic dataset implementation. It supports utterance-wise iteration and has utility (asarray method) to convert dataset to an three dimentional numpy.ndarray.

Speech features have typically different number of time resolusion, so we cannot simply represent dataset as an array. To address the issue, the dataset class represents set of features as N x T^max x D array by padding zeros where N is the number of utterances, T^max is maximum number of frame lenghs and D is the dimention of features, respectively.

While this dataset loads features on-demand while indexing, if you are dealing with relatively small dataset, it might be useful to convert it to an array, and then do whatever with numpy/scipy functionalities.

file_data_source

FileDataSource – Data source to specify 1) where to find data to be loaded and 2) how to collect features from them.

collected_files

ndarray – Collected files are stored.

Parameters:file_data_source (FileDataSource) – File data source.

Examples

>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
...
(578, 425) (578, 187)
(675, 425) (675, 187)
(606, 425) (606, 187)
>>> X.asarray(1000).shape
(3, 1000, 425)
>>> Y.asarray(1000).shape
(3, 1000, 187)
asarray(padded_length, dtype=<class 'numpy.float32'>)[source]

Convert dataset to numpy array.

This try to load entire dataset into a single 3d numpy array.

Parameters:padded_length (int) – Number of maximum time frames to be expected.
Returns:N x T^max x D array
Return type:3d-array
class nnmnkwii.datasets.PaddedFileSourceDataset(file_data_source, padded_length)[source]

Basic dataset with padding. Very similar to FileSourceDataset, it supports utterance-wise iteration and has utility (asarray method) to convert dataset to an three dimentional numpy.ndarray.

The difference between FileSourceDataset is that this returns padded features as T^max x D array at __getitem__, while FileSourceDataset returns not-padded T x D array.

Parameters:
  • file_data_source (FileDataSource) – File data source.
  • padded_length (int) – Padded length.
file_data_source

FileDataSource

padded_length

int

Examples

>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import PaddedFileSourceDataset
>>> X.asarray(1000).shape
(3, 1000, 425)
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = PaddedFileSourceDataset(X, 1000), PaddedFileSourceDataset(Y, 1000)
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
...
(1000, 425) (1000, 187)
(1000, 425) (1000, 187)
(1000, 425) (1000, 187)
>>> X.asarray().shape
(3, 1000, 425)
>>> Y.asarray().shape
(3, 1000, 187)
class nnmnkwii.datasets.MemoryCacheDataset(dataset, cache_size=777)[source]

A thin dataset wrapper class that has simple cache functionality. It supports utterance-wise iteration.

Parameters:
  • dataset (Dataset) – Dataset implementation to wrap.
  • cache_size (int) – Cache size (utterance unit).
dataset

Dataset – Dataset

cached_utterances

OrderedDict – Loaded utterances. Keys are utterance indices and values are numpy arrays.

cache_size

int – Cache size.

Examples

>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> from nnmnkwii.datasets import MemoryCacheDataset
>>> X, Y = MemoryCacheDataset(X), MemoryCacheDataset(Y)
>>> X.cached_utterances
OrderedDict()
>>> for (x, y) in zip(X, Y):
...     print(x.shape, y.shape)
...
(578, 425) (578, 187)
(675, 425) (675, 187)
(606, 425) (606, 187)
>>> len(X.cached_utterances)
3

Dataset that supports frame-wise iteration

class nnmnkwii.datasets.MemoryCacheFramewiseDataset(dataset, lengths, cache_size=777)[source]

A thin dataset wrapper class that has simple cache functionality. It supports frame-wise iteration. Different from other utterance-wise datasets, you will need to explicitly give number of time frames for each utterance at construction, since the class has to know the size of dataset to implement __len__.

Note

If you are doing random access to the dataset, please be careful that you give sufficient large number of cache size, to avoid many file re-loading.

Parameters:
  • dataset (Dataset) – Dataset implementation to wrap.
  • lengths (list) – Frame lengths for each utterance.
  • cache_size (int) – Cache size (utterance unit).
dataset

Dataset – Dataset

cached_utterances

OrderedDict – Loaded utterances.

cache_size

int – Cache size.

Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model
>>> from nnmnkwii.datasets import FileSourceDataset
>>> from nnmnkwii.datasets import MemoryCacheFramewiseDataset
>>> X, Y = example_file_data_sources_for_acoustic_model()
>>> X, Y = FileSourceDataset(X), FileSourceDataset(Y)
>>> len(X)
3
>>> lengths = [len(x) for x in X] # collect frame lengths
>>> X = MemoryCacheFramewiseDataset(X, lengths)
>>> Y = MemoryCacheFramewiseDataset(Y, lengths)
>>> len(X)
1859
>>> x[0].shape
(425,)
>>> y[0].shape
(187,)