Datasets¶
This module provides dataset abstraction. In this library, a dataset represents fixed-sized set of features (e.g., acoustic features, linguistic features, duration features etc.) composed of multiple utterances, supporting iteration and indexing.
Interface¶
To build dataset and represent variety of features (linguistic, duration, acoustic, etc) in an unified way, we define couple of interfaces.
The former is an abstraction of file data sources, where we find the data and how to process them. Any FileDataSource must implement:
collect_files
: specifies where to find source files (wav, lab, cmp, bin, etc.).collect_features
: specifies how to collect features (just load from file, or do some feature extraction logic, etc).
The later is an abstraction of dataset. Any dataset must implement
Dataset
interface:
__getitem__
: returns features (typically, two dimentionalnumpy.ndarray
)__len__
: returns the size of dataset (e.g., number of utterances).
One important point is that we use numpy.ndarray
to represent features
(there might be exception though). For example,
- F0 trajecoty as
T x 1
array, whereT
represents number of frames. - Spectrogram as
T x D
array, whereD
is number of feature dimention. - Linguistic features as
T x D
array.
-
class
nnmnkwii.datasets.
FileDataSource
[source]¶ File data source interface.
Users are expected to implement custum data source for your own data. All file data sources must implement this interface.
Implementation¶
With combination of FileDataSource
and Dataset
, we define
some dataset implementation that can be used for typical situations.
Note
Note that we don’t provide special iterator implementation (e.g., mini-batch iteration, multiprocessing, etc). Users are expected to use dataset with other iterator implementation. For PyTorch users, we can use PyTorch DataLoader for mini-batch iteration and multiprocessing. Our dataset interface is exactly same as PyTorch’s one, so we can use PyTorch DataLoader seamlessly.
Dataset that supports utterance-wise iteration¶
-
class
nnmnkwii.datasets.
FileSourceDataset
(file_data_source)[source]¶ Most basic dataset implementation. It supports utterance-wise iteration and has utility (
asarray
method) to convert dataset to an three dimentionalnumpy.ndarray
.Speech features have typically different number of time resolusion, so we cannot simply represent dataset as an array. To address the issue, the dataset class represents set of features as
N x T^max x D
array by padding zeros whereN
is the number of utterances,T^max
is maximum number of frame lenghs andD
is the dimention of features, respectively.While this dataset loads features on-demand while indexing, if you are dealing with relatively small dataset, it might be useful to convert it to an array, and then do whatever with numpy/scipy functionalities.
-
file_data_source
¶ FileDataSource – Data source to specify 1) where to find data to be loaded and 2) how to collect features from them.
-
collected_files
¶ ndarray – Collected files are stored.
Parameters: file_data_source (FileDataSource) – File data source. Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> X.asarray(1000).shape (3, 1000, 425) >>> Y.asarray(1000).shape (3, 1000, 187)
-
-
class
nnmnkwii.datasets.
PaddedFileSourceDataset
(file_data_source, padded_length)[source]¶ Basic dataset with padding. Very similar to
FileSourceDataset
, it supports utterance-wise iteration and has utility (asarray
method) to convert dataset to an three dimentionalnumpy.ndarray
.The difference between
FileSourceDataset
is that this returns padded features asT^max x D
array at__getitem__
, whileFileSourceDataset
returns not-paddedT x D
array.Parameters: - file_data_source (FileDataSource) – File data source.
- padded_length (int) – Padded length.
-
file_data_source
¶ FileDataSource
-
padded_length
¶ int
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import PaddedFileSourceDataset >>> X.asarray(1000).shape (3, 1000, 425) >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = PaddedFileSourceDataset(X, 1000), PaddedFileSourceDataset(Y, 1000) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (1000, 425) (1000, 187) (1000, 425) (1000, 187) (1000, 425) (1000, 187) >>> X.asarray().shape (3, 1000, 425) >>> Y.asarray().shape (3, 1000, 187)
-
class
nnmnkwii.datasets.
MemoryCacheDataset
(dataset, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports utterance-wise iteration.
Parameters: -
dataset
¶ Dataset – Dataset
-
cached_utterances
¶ OrderedDict – Loaded utterances. Keys are utterance indices and values are numpy arrays.
-
cache_size
¶ int – Cache size.
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> from nnmnkwii.datasets import MemoryCacheDataset >>> X, Y = MemoryCacheDataset(X), MemoryCacheDataset(Y) >>> X.cached_utterances OrderedDict() >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> len(X.cached_utterances) 3
-
Dataset that supports frame-wise iteration¶
-
class
nnmnkwii.datasets.
MemoryCacheFramewiseDataset
(dataset, lengths, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports frame-wise iteration. Different from other utterance-wise datasets, you will need to explicitly give number of time frames for each utterance at construction, since the class has to know the size of dataset to implement
__len__
.Note
If you are doing random access to the dataset, please be careful that you give sufficient large number of cache size, to avoid many file re-loading.
Parameters: -
dataset
¶ Dataset – Dataset
-
cached_utterances
¶ OrderedDict – Loaded utterances.
-
cache_size
¶ int – Cache size.
- Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> from nnmnkwii.datasets import MemoryCacheFramewiseDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> len(X) 3 >>> lengths = [len(x) for x in X] # collect frame lengths >>> X = MemoryCacheFramewiseDataset(X, lengths) >>> Y = MemoryCacheFramewiseDataset(Y, lengths) >>> len(X) 1859 >>> x[0].shape (425,) >>> y[0].shape (187,)
-