Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control

Preprint: arXiv:2409.17452 (Submitted to ICASSP 2025)

Abstract

We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.

Figure

Figure 1: Overview of our proposed TTS framework. A NANSY++ model (blue) is used as the backbone model to obtain disentangled representations and convert them back to the waveform. The TTS acoustic model (red) converts text to NANSY++ features. It utilizes a style encoder and feature aggregators (aggr.) to extract style embeddings. The description control model (green) predicts style and timbre embeddings from the input text descriptions.

Comparison of different systems

English TTS

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent. A man speaks quickly with average pitch."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++-w2vML-PromptTTS++CL-PromptTTS++
ZS-NANSY-TTS-w2v

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear. A woman speaks with low pitch and normal speed."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++-w2vML-PromptTTS++CL-PromptTTS++
ZS-NANSY-TTS-w2v

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent. A low-pitched man speaks slowly."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++-w2vML-PromptTTS++CL-PromptTTS++
ZS-NANSY-TTS-w2v

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent. A woman speaks with normal pitch and normal speaking speed."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++-w2vML-PromptTTS++CL-PromptTTS++
ZS-NANSY-TTS-w2v

Japanese TTS

Note that the descriptions of speaker characteristics are taken from the English references.

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent. A man speaks quickly with average pitch."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++CL-PromptTTS++-w2vZS-NANSY-TTS-w2v

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear. A woman speaks with low pitch and normal speed."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++CL-PromptTTS++-w2vZS-NANSY-TTS-w2v

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent. A low-pitched man speaks slowly."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++CL-PromptTTS++-w2vZS-NANSY-TTS-w2v

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent. A woman speaks with normal pitch and normal speaking speed."

ReferenceCL-NANSY-TTS-w2v (ours)CL-NANSY-TTS
CL-PromptTTS++CL-PromptTTS++-w2vZS-NANSY-TTS-w2v

Style controllability samples

Pitch control (en)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Speed control (en)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Pitch control (ja)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

  1. He speaks with a very low pitch
  2. He speaks with a low pitch
  3. He speaks with a normal pitch
  4. He speaks with a high pitch
  5. He speaks with a very high pitch
123
45

Speed control (ja)

Utterance ID: 260_123288_000004_000001

``Descriptions of the speaker's voice are very masculine, very adult-like, very middle-aged, very thick, very tensed, very powerful, bright, slightly soft, muffled, fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1284_1180_000053_000001

``Descriptions of the speaker's voice are slightly masculine, feminine, very gender-neutral, adult-like, slightly thick, slightly tensed, slightly weak, bright, slightly soft, very clear"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1089_134686_000021_000000

``Descriptions of the speaker's voice are very masculine, very adult-like, middle-aged, very thick, slightly tensed, slightly weak, slightly bright, slightly soft, slightly clear, slightly fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45

Utterance ID: 1580_141084_000031_000000

``Descriptions of the speaker's voice are very feminine, adult-like, slightly middle-aged, thin, very tensed, very powerful, very bright, very hard, very clear, very fluent"

  1. He speaks with a very slow speaking speed
  2. He speaks with a slow speaking speed
  3. He speaks with a normal speaking speed
  4. He speaks with a fast speaking speed
  5. He speaks with a very fast speaking speed
123
45
Ryuichi Yamamoto
Ryuichi Yamamoto
Engineer/Researcher

I am a engineer/researcher passionate about speech synthesis. I love to write code and enjoy open-source collaboration on GitHub. Please feel free to reach out on Twitter and GitHub.