We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.
Figure
Figure 1: Overview of our proposed TTS framework. A NANSY++ model (blue) is used as the backbone model to obtain disentangled representations and convert them back to the waveform. The TTS acoustic model (red) converts text to NANSY++ features. It utilizes a style encoder and feature aggregators (aggr.) to extract style embeddings. The description control model (green) predicts style and timbre embeddings from the input text descriptions.
Comparison of different systems
English TTS
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent. A man speaks quickly with average pitch."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++-w2v
ML-PromptTTS++
CL-PromptTTS++
ZS-NANSY-TTS-w2v
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear. A woman speaks with low pitch and normal speed."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++-w2v
ML-PromptTTS++
CL-PromptTTS++
ZS-NANSY-TTS-w2v
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent. A low-pitched man speaks slowly."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++-w2v
ML-PromptTTS++
CL-PromptTTS++
ZS-NANSY-TTS-w2v
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent. A woman speaks with normal pitch and normal speaking speed."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++-w2v
ML-PromptTTS++
CL-PromptTTS++
ZS-NANSY-TTS-w2v
Japanese TTS
Note that the descriptions of speaker characteristics are taken from the English references.
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent. A man speaks quickly with average pitch."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++
CL-PromptTTS++-w2v
ZS-NANSY-TTS-w2v
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear. A woman speaks with low pitch and normal speed."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++
CL-PromptTTS++-w2v
ZS-NANSY-TTS-w2v
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent. A low-pitched man speaks slowly."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++
CL-PromptTTS++-w2v
ZS-NANSY-TTS-w2v
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent. A woman speaks with normal pitch and normal speaking speed."
Reference
CL-NANSY-TTS-w2v (ours)
CL-NANSY-TTS
CL-PromptTTS++
CL-PromptTTS++-w2v
ZS-NANSY-TTS-w2v
Style controllability samples
Pitch control (en)
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Speed control (en)
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Pitch control (ja)
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"
He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch
1
2
3
4
5
Speed control (ja)
Utterance ID: 260_123288_000004_000001
``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1284_1180_000053_000001
``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1089_134686_000021_000000
``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"
He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed
1
2
3
4
5
Utterance ID: 1580_141084_000031_000000
``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"
I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.