Datasets¶
This module provides dataset abstraction. In this library, a dataset represents fixed-sized set of features (e.g., acoustic features, linguistic features, duration features etc.) composed of multiple utterances, supporting iteration and indexing.
Interface¶
To build dataset and represent variety of features (linguistic, duration, acoustic, etc) in an unified way, we define couple of interfaces.
The former is an abstraction of file data sources, where we find the data and how to process them. Any FileDataSource must implement:
collect_files: specifies where to find source files (wav, lab, cmp, bin, etc.).collect_features: specifies how to collect features (just load from file, or do some feature extraction logic, etc).
The later is an abstraction of dataset. Any dataset must implement
Dataset interface:
__getitem__: returns features (typically, two dimentionalnumpy.ndarray)__len__: returns the size of dataset (e.g., number of utterances).
One important point is that we use numpy.ndarray to represent features
(there might be exception though). For example,
- F0 trajecoty as
T x 1array, whereTrepresents number of frames. - Spectrogram as
T x Darray, whereDis number of feature dimention. - Linguistic features as
T x Darray.
-
class
nnmnkwii.datasets.FileDataSource[source]¶ File data source interface.
Users are expected to implement custum data source for your own data. All file data sources must implement this interface.
Implementation¶
With combination of FileDataSource and Dataset, we define
some dataset implementation that can be used for typical situations.
Note
Note that we don’t provide special iterator implementation (e.g., mini-batch iteration, multiprocessing, etc). Users are expected to use dataset with other iterator implementation. For PyTorch users, we can use PyTorch DataLoader for mini-batch iteration and multiprocessing. Our dataset interface is exactly same as PyTorch’s one, so we can use PyTorch DataLoader seamlessly.
Dataset that supports utterance-wise iteration¶
-
class
nnmnkwii.datasets.FileSourceDataset(file_data_source)[source]¶ Most basic dataset implementation. It supports utterance-wise iteration and has utility (
asarraymethod) to convert dataset to an three dimentionalnumpy.ndarray.Speech features have typically different number of time resolusion, so we cannot simply represent dataset as an array. To address the issue, the dataset class represents set of features as
N x T^max x Darray by padding zeros whereNis the number of utterances,T^maxis maximum number of frame lenghs andDis the dimention of features, respectively.While this dataset loads features on-demand while indexing, if you are dealing with relatively small dataset, it might be useful to convert it to an array, and then do whatever with numpy/scipy functionalities.
-
file_data_source¶ FileDataSource – Data source to specify 1) where to find data to be loaded and 2) how to collect features from them.
-
collected_files¶ ndarray – Collected files are stored.
Parameters: file_data_source (FileDataSource) – File data source. Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> X.asarray(1000).shape (3, 1000, 425) >>> Y.asarray(1000).shape (3, 1000, 187)
-
-
class
nnmnkwii.datasets.PaddedFileSourceDataset(file_data_source, padded_length)[source]¶ Basic dataset with padding. Very similar to
FileSourceDataset, it supports utterance-wise iteration and has utility (asarraymethod) to convert dataset to an three dimentionalnumpy.ndarray.The difference between
FileSourceDatasetis that this returns padded features asT^max x Darray at__getitem__, whileFileSourceDatasetreturns not-paddedT x Darray.Parameters: - file_data_source (FileDataSource) – File data source.
- padded_length (int) – Padded length.
-
file_data_source¶ FileDataSource
-
padded_length¶ int
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import PaddedFileSourceDataset >>> X.asarray(1000).shape (3, 1000, 425) >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = PaddedFileSourceDataset(X, 1000), PaddedFileSourceDataset(Y, 1000) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (1000, 425) (1000, 187) (1000, 425) (1000, 187) (1000, 425) (1000, 187) >>> X.asarray().shape (3, 1000, 425) >>> Y.asarray().shape (3, 1000, 187)
-
class
nnmnkwii.datasets.MemoryCacheDataset(dataset, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports utterance-wise iteration.
Parameters: -
dataset¶ Dataset – Dataset
-
cached_utterances¶ OrderedDict – Loaded utterances. Keys are utterance indices and values are numpy arrays.
-
cache_size¶ int – Cache size.
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> from nnmnkwii.datasets import MemoryCacheDataset >>> X, Y = MemoryCacheDataset(X), MemoryCacheDataset(Y) >>> X.cached_utterances OrderedDict() >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> len(X.cached_utterances) 3
-
Dataset that supports frame-wise iteration¶
-
class
nnmnkwii.datasets.MemoryCacheFramewiseDataset(dataset, lengths, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports frame-wise iteration. Different from other utterance-wise datasets, you will need to explicitly give number of time frames for each utterance at construction, since the class has to know the size of dataset to implement
__len__.Note
If you are doing random access to the dataset, please be careful that you give sufficient large number of cache size, to avoid many file re-loading.
Parameters: -
dataset¶ Dataset – Dataset
-
cached_utterances¶ OrderedDict – Loaded utterances.
-
cache_size¶ int – Cache size.
- Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> from nnmnkwii.datasets import MemoryCacheFramewiseDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> len(X) 3 >>> lengths = [len(x) for x in X] # collect frame lengths >>> X = MemoryCacheFramewiseDataset(X, lengths) >>> Y = MemoryCacheFramewiseDataset(Y, lengths) >>> len(X) 1859 >>> x[0].shape (425,) >>> y[0].shape (187,)
-