{
"cells": [
{
"cell_type": "markdown",
"id": "instrumental-moore",
"metadata": {},
"source": [
"# Advanced TTS demos"
]
},
{
"cell_type": "markdown",
"id": "prescription-funeral",
"metadata": {},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/r9y9/ttslearn/blob/master/notebooks/ch11_Advanced-demos.ipynb)\n",
"\n",
"このページ(ノートブック形式)では、第11章で少し触れた「非自己回帰型ニューラルボコーダ」を用いた、発展的な音声合成のデモを示します。\n",
"書籍ではJSUTコーパスのみを扱いましたが、ここではJVSコーパスを用いた多話者音声合成など、他のコーパスを利用した音声合成のデモも紹介します。\n",
"このページのデモは、書籍では解説していないことに注意してください。\n",
"\n",
"非自己回帰型ニューラルボコーダの実装には、[kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) を利用します。\n",
"多話者音声合成の実装は、書籍では実装の解説はしていませんが、第9章、第10章の内容に、軽微な修正を加えることで実現可能です。\n",
"興味のある読者は、extra_recipes のソースコードを参照してください。"
]
},
{
"cell_type": "markdown",
"id": "interim-essex",
"metadata": {},
"source": [
"## 準備"
]
},
{
"cell_type": "markdown",
"id": "married-measure",
"metadata": {},
"source": [
"### ttslearn のインストール"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "behavioral-circular",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"try:\n",
" import ttslearn\n",
"except ImportError:\n",
" !pip install ttslearn"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "significant-peninsula",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.2.2'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import ttslearn\n",
"ttslearn.__version__"
]
},
{
"cell_type": "markdown",
"id": "grateful-testing",
"metadata": {},
"source": [
"### パッケージのインポート"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "stupid-anthony",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline\n",
"import IPython\n",
"from IPython.display import Audio\n",
"import librosa\n",
"import librosa.display\n",
"from tqdm.notebook import tqdm\n",
"import torch\n",
"import random"
]
},
{
"cell_type": "markdown",
"id": "fuzzy-tablet",
"metadata": {},
"source": [
"## JSUT"
]
},
{
"cell_type": "markdown",
"id": "weekly-crack",
"metadata": {},
"source": [
"### Tacotron + Parallel WaveGAN (16kHz)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "missing-mirror",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Device: cuda\n",
"CPU times: user 211 ms, sys: 7.53 ms, total: 218 ms\n",
"Wall time: 314 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ttslearn.contrib import Tacotron2PWGTTS\n",
"\n",
"if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
"else:\n",
" device = torch.device(\"cpu\")\n",
"print(\"Device:\", device)\n",
"\n",
"pwg_engine = Tacotron2PWGTTS(device=device)\n",
"\n",
"%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n",
"IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "opposite-hotel",
"metadata": {},
"source": [
"### Tacotron + Parallel WaveGAN (24kHz)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "exclusive-interpretation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 197 ms, sys: 3.81 ms, total: 200 ms\n",
"Wall time: 201 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ttslearn.pretrained import create_tts_engine\n",
"\n",
"pwg_engine = create_tts_engine(\"tacotron2_pwg_jsut24k\", device=device)\n",
"\n",
"%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n",
"IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "banned-granny",
"metadata": {},
"source": [
"### Tacotron + HiFi-GAN (24kHz)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "empirical-boutique",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 444 ms, sys: 2.96 ms, total: 447 ms\n",
"Wall time: 187 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ttslearn.pretrained import create_tts_engine\n",
"\n",
"pwg_engine = create_tts_engine(\"tacotron2_hifipwg_jsut24k\", device=device)\n",
"\n",
"%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n",
"IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "answering-longer",
"metadata": {},
"source": [
"## JVS "
]
},
{
"cell_type": "markdown",
"id": "rural-carrier",
"metadata": {},
"source": [
"### Multi-speaker Tacotron + Parallel WaveGAN (16kHz)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "working-allowance",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs001\n",
"タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs010\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs030\n",
"タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs050\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs100\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_jvs16k\", device=device)\n",
"for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n",
" text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n",
" wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n",
" print(f\"Speaker: {spk}\")\n",
" print(text)\n",
" IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "convinced-spectrum",
"metadata": {},
"source": [
"### Multi-speaker Tacotron + Parallel WaveGAN (24kHz)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "optimum-paper",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs001\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs010\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs030\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs050\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs100\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_jvs24k\", device=device)\n",
"for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n",
" text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n",
" wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n",
" print(f\"Speaker: {spk}\")\n",
" print(text)\n",
" IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "moving-multimedia",
"metadata": {},
"source": [
"### Multi-speaker Tacotron + HiFi-GAN (24kHz)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "forced-damages",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs001\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs010\n",
"タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs030\n",
"タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs050\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker: jvs100\n",
"タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pwg_engine = create_tts_engine(\"multspk_tacotron2_hifipwg_jvs24k\", device=device)\n",
"for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n",
" text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n",
" wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n",
" print(f\"Speaker: {spk}\")\n",
" print(text)\n",
" IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "experimental-anniversary",
"metadata": {},
"source": [
"## Common voice (ja)"
]
},
{
"cell_type": "markdown",
"id": "regional-efficiency",
"metadata": {},
"source": [
"### Multi-speaker Tacotron + Parallel WaveGAN (16kHz)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "alleged-dating",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 5\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 6\n",
"明日の天気は、晴れです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 12\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 15\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 19\n",
"今日の天気は、晴れです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_cv16k\", device=device)\n",
"# NOTE: some speaker's voice have significant amount of noise (e.g., speaker 0)\n",
"for spk_id in [5, 6, 12, 15, 19]:\n",
" text = (\"今日\" if random.random() > 0.5 else \"明日\") + \"の天気は、\" + (\"晴れ時々曇り\" if random.random() > 0.5 else \"晴れ\") + \"です。\"\n",
" wav, sr = pwg_engine.tts(text, spk_id=spk_id)\n",
" print(f\"Speaker ID: {spk_id}\")\n",
" print(text)\n",
" IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "oriented-wrong",
"metadata": {},
"source": [
"### Multi-speaker Tacotron + Parallel WaveGAN (24kHz)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cosmetic-requirement",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 5\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 6\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 12\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 15\n",
"明日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Speaker ID: 19\n",
"今日の天気は、晴れ時々曇りです。\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_cv24k\", device=device)\n",
"# NOTE: some speaker's voice have significant amount of noise (e.g., speaker 0)\n",
"for spk_id in [5, 6, 12, 15, 19]:\n",
" text = (\"今日\" if random.random() > 0.5 else \"明日\") + \"の天気は、\" + (\"晴れ時々曇り\" if random.random() > 0.5 else \"晴れ\") + \"です。\"\n",
" wav, sr = pwg_engine.tts(text, spk_id=spk_id)\n",
" print(f\"Speaker ID: {spk_id}\")\n",
" print(text)\n",
" IPython.display.display(Audio(wav, rate=sr))"
]
},
{
"cell_type": "markdown",
"id": "athletic-contents",
"metadata": {},
"source": [
"## 参考\n",
"\n",
"- Parallel WaveGAN: https://arxiv.org/abs/1910.11480\n",
"- HiFi-GAN: https://arxiv.org/abs/2010.05646\n",
"- Parallel WaveGANを含むGANベースの非自己回帰型ニューラルボコーダの実装: https://github.com/kan-bayashi/ParallelWaveGAN"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}