Datasets¶
This module provides dataset abstraction. In this library, a dataset represents fixed-sized set of features (e.g., acoustic features, linguistic features, duration features etc.) composed of multiple utterances, supporting iteration and indexing.
Interface¶
To build dataset and represent variety of features (linguistic, duration, acoustic, etc) in an unified way, we define couple of interfaces.
The former is an abstraction of file data sources, where we find the data and how to process them. Any FileDataSource must implement:
collect_files
: specifies where to find source files (wav, lab, cmp, bin, etc.).collect_features
: specifies how to collect features (just load from file, or do some feature extraction logic, etc).
The later is an abstraction of dataset. Any dataset must implement
Dataset
interface:
__getitem__
: returns features (typically, two dimentionalnumpy.ndarray
)__len__
: returns the size of dataset (e.g., number of utterances).
One important point is that we use numpy.ndarray
to represent features
(there might be exception though). For example,
F0 trajecoty as
T x 1
array, whereT
represents number of frames.Spectrogram as
T x D
array, whereD
is number of feature dimention.Linguistic features as
T x D
array.
-
class
nnmnkwii.datasets.
FileDataSource
[source]¶ File data source interface.
Users are expected to implement custum data source for your own data. All file data sources must implement this interface.
Implementation¶
With combination of FileDataSource
and Dataset
, we define
some dataset implementation that can be used for typical situations.
Note
Note that we don’t provide special iterator implementation (e.g., mini-batch iteration, multiprocessing, etc). Users are expected to use dataset with other iterator implementation. For PyTorch users, we can use PyTorch DataLoader for mini-batch iteration and multiprocessing. Our dataset interface is exactly same as PyTorch’s one, so we can use PyTorch DataLoader seamlessly. See tutorials how we can use it practically.
Dataset that supports utterance-wise iteration¶
-
class
nnmnkwii.datasets.
FileSourceDataset
(file_data_source)[source]¶ Most basic dataset implementation. It supports utterance-wise iteration and has utility (
asarray
method) to convert dataset to an three dimentionalnumpy.ndarray
.Speech features have typically different number of time resolusion, so we cannot simply represent dataset as an array. To address the issue, the dataset class represents set of features as
N x T^max x D
array by padding zeros whereN
is the number of utterances,T^max
is maximum number of frame lenghs andD
is the dimention of features, respectively.While this dataset loads features on-demand while indexing, if you are dealing with relatively small dataset, it might be useful to convert it to an array, and then do whatever with numpy/scipy functionalities.
-
file_data_source
¶ Data source to specify 1) where to find data to be loaded and 2) how to collect features from them.
- Type
-
collected_files
¶ Collected files are stored.
- Type
ndarray
- Parameters
file_data_source (FileDataSource) – File data source.
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> X.asarray(1000).shape (3, 1000, 425) >>> Y.asarray(1000).shape (3, 1000, 187)
-
asarray
(padded_length=None, dtype=<class 'numpy.float32'>, padded_length_guess=1000, verbose=0)[source]¶ Convert dataset to numpy array.
This try to load entire dataset into a single 3d numpy array.
- Parameters
padded_length (int) – Number of maximum time frames to be expected. If None, it is set to actual maximum time length.
dtype (numpy.dtype) – Numpy dtype.
padded_length_guess – (int): Initial guess of max time length of padded dataset array. Used if
padded_length
is None.
- Returns
Array of shape
N x T^max x D
ifpadded_length
is None, otherwiseN x padded_length x D
.- Return type
3d-array
-
-
class
nnmnkwii.datasets.
PaddedFileSourceDataset
(file_data_source, padded_length)[source]¶ Basic dataset with padding. Very similar to
FileSourceDataset
, it supports utterance-wise iteration and has utility (asarray
method) to convert dataset to an three dimentionalnumpy.ndarray
.The difference between
FileSourceDataset
is that this returns padded features asT^max x D
array at__getitem__
, whileFileSourceDataset
returns not-paddedT x D
array.- Parameters
file_data_source (FileDataSource) – File data source.
padded_length (int) – Padded length.
-
file_data_source
¶ - Type
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import PaddedFileSourceDataset >>> X.asarray(1000).shape (3, 1000, 425) >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = PaddedFileSourceDataset(X, 1000), PaddedFileSourceDataset(Y, 1000) >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (1000, 425) (1000, 187) (1000, 425) (1000, 187) (1000, 425) (1000, 187) >>> X.asarray().shape (3, 1000, 425) >>> Y.asarray().shape (3, 1000, 187)
-
asarray
(dtype=<class 'numpy.float32'>, verbose=0)[source]¶ Convert dataset to numpy array.
This try to load entire dataset into a single 3d numpy array.
- Parameters
padded_length (int) – Number of maximum time frames to be expected. If None, it is set to actual maximum time length.
dtype (numpy.dtype) – Numpy dtype.
padded_length_guess – (int): Initial guess of max time length of padded dataset array. Used if
padded_length
is None.
- Returns
Array of shape
N x T^max x D
ifpadded_length
is None, otherwiseN x padded_length x D
.- Return type
3d-array
-
class
nnmnkwii.datasets.
MemoryCacheDataset
(dataset, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports utterance-wise iteration.
- Parameters
-
cached_utterances
¶ Loaded utterances. Keys are utterance indices and values are numpy arrays.
- Type
OrderedDict
Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> from nnmnkwii.datasets import MemoryCacheDataset >>> X, Y = MemoryCacheDataset(X), MemoryCacheDataset(Y) >>> X.cached_utterances OrderedDict() >>> for (x, y) in zip(X, Y): ... print(x.shape, y.shape) ... (578, 425) (578, 187) (675, 425) (675, 187) (606, 425) (606, 187) >>> len(X.cached_utterances) 3
Dataset that supports frame-wise iteration¶
-
class
nnmnkwii.datasets.
MemoryCacheFramewiseDataset
(dataset, lengths, cache_size=777)[source]¶ A thin dataset wrapper class that has simple cache functionality. It supports frame-wise iteration. Different from other utterance-wise datasets, you will need to explicitly give number of time frames for each utterance at construction, since the class has to know the size of dataset to implement
__len__
.Note
If you are doing random access to the dataset, please be careful that you give sufficient large number of cache size, to avoid many file re-loading.
- Parameters
-
cached_utterances
¶ Loaded utterances.
- Type
OrderedDict
- Examples
>>> from nnmnkwii.util import example_file_data_sources_for_acoustic_model >>> from nnmnkwii.datasets import FileSourceDataset >>> from nnmnkwii.datasets import MemoryCacheFramewiseDataset >>> X, Y = example_file_data_sources_for_acoustic_model() >>> X, Y = FileSourceDataset(X), FileSourceDataset(Y) >>> len(X) 3 >>> lengths = [len(x) for x in X] # collect frame lengths >>> X = MemoryCacheFramewiseDataset(X, lengths) >>> Y = MemoryCacheFramewiseDataset(Y, lengths) >>> len(X) 1859 >>> x[0].shape (425,) >>> y[0].shape (187,)
Builtin data sources¶
There are a couple of builtin file data sources for typical datasets to make it
easy to work on those. With the following data source implementation,
you only need to implement collect_features
, which
defines what features you want from wav file or text (depends on data source).
If you want maximum flexibility to access dataset, you may want to implement your
own data source, instead of using bulitin ones.
e.g. If we are trying to extract acoustic features from wav files from CMU Arctic, then you can write:
from nnmnkwii.preprocessing import trim_zeros_frames
from nnmnkwii.datasets import FileSourceDataset
from nnmnkwii.datasets import cmu_arctic
import pysptk
import pyworld
class MyFileDataSource(cmu_arctic.WavFileDataSource):
def __init__(self, data_root, speakers, max_files=100):
super(MyFileDataSource, self).__init__(
data_root, speakers, max_files=100)
def collect_features(self, path):
"""Compute mel-cepstrum given a wav file."""
fs, x = wavfile.read(path)
x = x.astype(np.float64)
f0, timeaxis = pyworld.dio(x, fs, frame_period=5)
f0 = pyworld.stonemask(x, f0, timeaxis, fs)
spectrogram = pyworld.cheaptrick(x, f0, timeaxis, fs)
spectrogram = trim_zeros_frames(spectrogram)
mc = pysptk.sp2mc(spectrogram, order=24, alpha=0.41)
return mc.astype(np.float32)
DATA_ROOT = "/home/ryuichi/data/cmu_arctic/" # your data path
data_source = MyFileDataSource(DATA_DIR, speakers=["clb"], max_files=100)
# 100 wav files of `clb` speaker will be collected
X = FileSourceDataset(data_source)
assert len(X) == 100
for x in X:
# do anything on acoustic features (e.g., save to disk)
pass
More real examples can be found in tests directory in nnmnkwii and tutorial notebooks in nnmnkwii_gallery.
CMU Arctic (en)¶
You can download data from http://festvox.org/cmu_arctic/.
-
class
nnmnkwii.datasets.cmu_arctic.
WavFileDataSource
(data_root, speakers, labelmap=None, max_files=None)[source]¶ Wav file data source for CMU Arctic dataset.
The data source collects wav files from CMU Arctic. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a wav file path.- Parameters
data_root (str) – Data root.
speakers (list) – List of speakers to find. Supported names of speaker are
aew
,ahw
,aup
,awb
,axb
,bdl
,clb
,eey
,fem
,gka
,jmk
,ksp
,ljm
,lnh
,rms
,rxr
,slp
,slt
.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type
VCTK (en)¶
You can download data (15GB) from http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html.
Note
Note that VCTK data sources don’t collect files for speaker 315
, since there
are no transcriptions available for 315
entries,
-
class
nnmnkwii.datasets.vctk.
TranscriptionDataSource
(data_root, speakers=['225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '236', '237', '238', '239', '240', '241', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '292', '293', '294', '295', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '310', '311', '312', '313', '314', '316', '317', '318', '323', '326', '329', '330', '333', '334', '335', '336', '339', '340', '341', '343', '345', '347', '351', '360', '361', '362', '363', '364', '374', '376'], labelmap=None, max_files=None)[source]¶ Transcription data source for VCTK dataset.
The data source collects text transcriptions from VCTK. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
data_root (str) – Data root.
speakers (list) – List of speakers to find. Speaker id must be
str
. For supported names of speaker, please refer toavailable_speakers
defined in the module.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
-
speaker_info
¶ Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of
AGE
,GENDER
andREGION
.- Type
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type
-
class
nnmnkwii.datasets.vctk.
WavFileDataSource
(data_root, speakers=['225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '236', '237', '238', '239', '240', '241', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '292', '293', '294', '295', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '310', '311', '312', '313', '314', '316', '317', '318', '323', '326', '329', '330', '333', '334', '335', '336', '339', '340', '341', '343', '345', '347', '351', '360', '361', '362', '363', '364', '374', '376'], labelmap=None, max_files=None)[source]¶ Transcription data source for VCTK dataset.
The data source collects text transcriptions from VCTK. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
data_root (str) – Data root.
speakers (list) – List of speakers to find. Speaker id must be
str
. For supported names of speaker, please refer toavailable_speakers
defined in the module.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
-
speaker_info
¶ Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of
AGE
,GENDER
andREGION
.- Type
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type
LJ-Speech (en)¶
You can download data (2.6GB) from https://keithito.com/LJ-Speech-Dataset/.
-
class
nnmnkwii.datasets.ljspeech.
TranscriptionDataSource
(data_root, normalized=False)[source]¶ Transcription data source for LJSpeech dataset.
The data source collects text transcriptions from LJSpeech. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
-
metadata
¶ Metadata, shapeo (
num_files x 3
).- Type
-
class
nnmnkwii.datasets.ljspeech.
WavFileDataSource
(data_root)[source]¶ Wav file data source for LJSpeech dataset.
The data source collects wav files from LJSpeech. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a wav file path.- Parameters
data_root (str) – Data root.
-
metadata
¶ Metadata, shape (
num_files x 3
).- Type
Voice Conversion Challenge (VCC) 2016 (en)¶
You can download training data (181MB) and evaluation data (~56 MB) from http://datashare.is.ed.ac.uk/handle/10283/2211.
-
class
nnmnkwii.datasets.vcc2016.
WavFileDataSource
(data_root, speakers, labelmap=None, max_files=None, training_data_root=None, evaluation_data_root=None, training=True)[source]¶ Wav file data source for Voice Conversion Challenge (VCC) 2016 dataset.
The data source collects wav files from VCC2016 dataset. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a wav file path.Note
VCC2016 datasets are composed of training data and evaluation data, which can be downloaded separately.
data_root
should point to the directory that contains both the training and evaluation data.Directory structure should look like for example:
> tree -d ~/data/vcc2016/ /home/ryuichi/data/vcc2016/ ├── evaluation_all │ ├── SF1 │ ├── SF2 │ ├── SF3 │ ├── SM1 │ ├── SM2 │ ├── TF1 │ ├── TF2 │ ├── TM1 │ ├── TM2 │ └── TM3 └── vcc2016_training ├── SF1 ├── SF2 ├── SF3 ├── SM1 ├── SM2 ├── TF1 ├── TF2 ├── TM1 ├── TM2 └── TM3
- Parameters
data_root (str) – Data root. It’s assumed that training and evaluation data are placed at
${data_root}/vcc2016_training
and${data_root}/evaluation_all –
respectively –
default. (by) –
speakers (list) – List of speakers to find. Supported names of speaker are
SF1
,SF2
,SF3
,SM1
,SM2
,TF1
,TF2
,TM1
,TM2
andTM3
.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
training_data_root – If specified, try to search training data to the directory. If None, set to
${data_root}/vcc2016_training
.evaluation_data_root – If specified, try to search evaluation data to the directory. If None, set to
${data_root}/evaluation_all
.training (bool) – Whether it collects training data or not. If False, it collects evaluation data.
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type
Voice statistics (ja)¶
You can download data (~720MB) from https://voice-statistics.github.io/.
-
class
nnmnkwii.datasets.voice_statistics.
TranscriptionDataSource
(data_root, column='sentence', max_files=None)[source]¶ Transcription data source for VoiceStatistics dataset
Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
- Atributes:
transcriptions (list): Transcriptions.
-
class
nnmnkwii.datasets.voice_statistics.
WavFileDataSource
(data_root, speakers, labelmap=None, max_files=None, emotions=None)[source]¶ Wav file data source for Voice-statistics dataset.
The data source collects wav files from voice-statistics. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a wav file path.- Parameters
data_root (str) – Data root
speakers (list) – List of speakers to load. Supported names of speaker are
fujitou
,tsuchiya
anduemura
.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
emotions (list) – List of emotions we use. Supported names of emotions are
angry
,happy
andnormal
.
-
labels
¶ List of speaker identifiers determined by labelmap. Stored in
collect_files
.- Type
JSUT (ja)¶
JSUT (Japanese speech corpus of Saruwatari Lab, University of Tokyo).
You can download data (2.7GB) from https://sites.google.com/site/shinnosuketakamichi/publication/jsut.
-
class
nnmnkwii.datasets.jsut.
TranscriptionDataSource
(data_root, subsets=None, validate=True)[source]¶ Transcription data source for JSUT dataset.
The data source collects text transcriptions from JSUT. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.
-
class
nnmnkwii.datasets.jsut.
WavFileDataSource
(data_root, subsets=None, validate=True)[source]¶ Wav file data source for JSUT dataset.
The data source collects wav files from JSUT. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a wav file path.
JVS (ja)¶
JVS: free Japanese multi-speaker voice corpus
You can download data from https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus.
-
class
nnmnkwii.datasets.jvs.
TranscriptionDataSource
(data_root, speakers=['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100'], categories=None, labelmap=None, max_files=None)[source]¶ Transcription data source for JVS dataset
The data source collects text transcriptions from JVS. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
data_root (str) – Data root.
speakers (list) – List of speakers to find. Speaker id must be
str
. For supported names of speaker, please refer toavailable_speakers
defined in the module.labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
-
speaker_info
¶ Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of
gender
,minf0
andmaxf0
.- Type
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type
-
class
nnmnkwii.datasets.jvs.
WavFileDataSource
(data_root, speakers=['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100'], categories=None, labelmap=None, max_files=None)[source]¶ WavFile data source for JVS dataset.
The data source collects text transcriptions from JVS. Users are expected to inherit the class and implement
collect_features
method, which defines how features are computed given a transcription.- Parameters
data_root (str) – Data root.
speakers (list) – List of speakers to find. Speaker id must be
str
. For supported names of speaker, please refer toavailable_speakers
defined in the module.categories (list) – List of categories to collect, the item should be one of “parallel”, “nonpara” and “whisper”
labelmap (dict[optional]) – Dict of speaker labels. If None, it’s assigned as incrementally (i.e., 0, 1, 2) for specified speakers.
max_files (int) – Total number of files to be collected.
-
speaker_info
¶ Dict of speaker information dict. Keyes are speaker ids (str) and each value is speaker information consists of
gender
,minf0
andmaxf0
.- Type
-
labels
¶ Speaker labels paired with collected files. Stored in
collect_files
. This is useful to build multi-speaker models.- Type