Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control

Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana

Sep 9, 2024

Preprint: arXiv:2409.17452 (Submitted to ICASSP 2025)

Abstract
Figure
Comparison of different systems
- English TTS
- Japanese TTS
Style controllability samples

Abstract

We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.

Figure

Figure 1: Overview of our proposed TTS framework. A NANSY++ model (blue) is used as the backbone model to obtain disentangled representations and convert them back to the waveform. The TTS acoustic model (red) converts text to NANSY++ features. It utilizes a style encoder and feature aggregators (aggr.) to extract style embeddings. The description control model (green) predicts style and timbre embeddings from the input text descriptions.

Comparison of different systems

English TTS

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent. A man speaks quickly with average pitch."

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++-w2v	ML-PromptTTS++	CL-PromptTTS++

ZS-NANSY-TTS-w2v

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear. A woman speaks with low pitch and normal speed."

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++-w2v	ML-PromptTTS++	CL-PromptTTS++

ZS-NANSY-TTS-w2v

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent. A low-pitched man speaks slowly."

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++-w2v	ML-PromptTTS++	CL-PromptTTS++

ZS-NANSY-TTS-w2v

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent. A woman speaks with normal pitch and normal speaking speed."

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++-w2v	ML-PromptTTS++	CL-PromptTTS++

ZS-NANSY-TTS-w2v

Japanese TTS

Note that the descriptions of speaker characteristics are taken from the English references.

Utterance ID: 260_123288_000004_000001

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++	CL-PromptTTS++-w2v	ZS-NANSY-TTS-w2v

Utterance ID: 1284_1180_000053_000001

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++	CL-PromptTTS++-w2v	ZS-NANSY-TTS-w2v

Utterance ID: 1089_134686_000021_000000

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++	CL-PromptTTS++-w2v	ZS-NANSY-TTS-w2v

Utterance ID: 1580_141084_000031_000000

Reference	CL-NANSY-TTS-w2v (ours)	CL-NANSY-TTS

CL-PromptTTS++	CL-PromptTTS++-w2v	ZS-NANSY-TTS-w2v

Style controllability samples

Pitch control (en)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Speed control (en)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Pitch control (ja)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

He speaks with a very low pitch
He speaks with a low pitch
He speaks with a normal pitch
He speaks with a high pitch
He speaks with a very high pitch

1	2	3

4	5

Speed control (ja)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

He speaks with a very slow speaking speed
He speaks with a slow speaking speed
He speaks with a normal speaking speed
He speaks with a fast speaking speed
He speaks with a very fast speaking speed

1	2	3

4	5

Ryuichi Yamamoto

Engineer/Researcher

I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.