{ "cells": [ { "cell_type": "markdown", "id": "instrumental-moore", "metadata": {}, "source": [ "# Advanced TTS demos" ] }, { "cell_type": "markdown", "id": "prescription-funeral", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/r9y9/ttslearn/blob/master/notebooks/ch11_Advanced-demos.ipynb)\n", "\n", "このページ(ノートブック形式)では、第11章で少し触れた「非自己回帰型ニューラルボコーダ」を用いた、発展的な音声合成のデモを示します。\n", "書籍ではJSUTコーパスのみを扱いましたが、ここではJVSコーパスを用いた多話者音声合成など、他のコーパスを利用した音声合成のデモも紹介します。\n", "このページのデモは、書籍では解説していないことに注意してください。\n", "\n", "非自己回帰型ニューラルボコーダの実装には、[kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) を利用します。\n", "多話者音声合成の実装は、書籍では実装の解説はしていませんが、第9章、第10章の内容に、軽微な修正を加えることで実現可能です。\n", "興味のある読者は、extra_recipes のソースコードを参照してください。" ] }, { "cell_type": "markdown", "id": "interim-essex", "metadata": {}, "source": [ "## 準備" ] }, { "cell_type": "markdown", "id": "married-measure", "metadata": {}, "source": [ "### ttslearn のインストール" ] }, { "cell_type": "code", "execution_count": 1, "id": "behavioral-circular", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "try:\n", " import ttslearn\n", "except ImportError:\n", " !pip install ttslearn" ] }, { "cell_type": "code", "execution_count": 2, "id": "significant-peninsula", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.2.2'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ttslearn\n", "ttslearn.__version__" ] }, { "cell_type": "markdown", "id": "grateful-testing", "metadata": {}, "source": [ "### パッケージのインポート" ] }, { "cell_type": "code", "execution_count": 3, "id": "stupid-anthony", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline\n", "import IPython\n", "from IPython.display import Audio\n", "import librosa\n", "import librosa.display\n", "from tqdm.notebook import tqdm\n", "import torch\n", "import random" ] }, { "cell_type": "markdown", "id": "fuzzy-tablet", "metadata": {}, "source": [ "## JSUT" ] }, { "cell_type": "markdown", "id": "weekly-crack", "metadata": {}, "source": [ "### Tacotron + Parallel WaveGAN (16kHz)" ] }, { "cell_type": "code", "execution_count": 4, "id": "missing-mirror", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Device: cuda\n", "CPU times: user 211 ms, sys: 7.53 ms, total: 218 ms\n", "Wall time: 314 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from ttslearn.contrib import Tacotron2PWGTTS\n", "\n", "if torch.cuda.is_available():\n", " device = torch.device(\"cuda\")\n", "else:\n", " device = torch.device(\"cpu\")\n", "print(\"Device:\", device)\n", "\n", "pwg_engine = Tacotron2PWGTTS(device=device)\n", "\n", "%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n", "IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "opposite-hotel", "metadata": {}, "source": [ "### Tacotron + Parallel WaveGAN (24kHz)" ] }, { "cell_type": "code", "execution_count": 5, "id": "exclusive-interpretation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 197 ms, sys: 3.81 ms, total: 200 ms\n", "Wall time: 201 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from ttslearn.pretrained import create_tts_engine\n", "\n", "pwg_engine = create_tts_engine(\"tacotron2_pwg_jsut24k\", device=device)\n", "\n", "%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n", "IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "banned-granny", "metadata": {}, "source": [ "### Tacotron + HiFi-GAN (24kHz)" ] }, { "cell_type": "code", "execution_count": 6, "id": "empirical-boutique", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 444 ms, sys: 2.96 ms, total: 447 ms\n", "Wall time: 187 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from ttslearn.pretrained import create_tts_engine\n", "\n", "pwg_engine = create_tts_engine(\"tacotron2_hifipwg_jsut24k\", device=device)\n", "\n", "%time wav, sr = pwg_engine.tts(\"あらゆる現実を、すべて自分のほうへねじ曲げたのだ。\")\n", "IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "answering-longer", "metadata": {}, "source": [ "## JVS " ] }, { "cell_type": "markdown", "id": "rural-carrier", "metadata": {}, "source": [ "### Multi-speaker Tacotron + Parallel WaveGAN (16kHz)" ] }, { "cell_type": "code", "execution_count": 7, "id": "working-allowance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs001\n", "タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs010\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs030\n", "タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs050\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs100\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_jvs16k\", device=device)\n", "for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n", " text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n", " wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n", " print(f\"Speaker: {spk}\")\n", " print(text)\n", " IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "convinced-spectrum", "metadata": {}, "source": [ "### Multi-speaker Tacotron + Parallel WaveGAN (24kHz)" ] }, { "cell_type": "code", "execution_count": 8, "id": "optimum-paper", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs001\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs010\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs030\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs050\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs100\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_jvs24k\", device=device)\n", "for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n", " text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n", " wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n", " print(f\"Speaker: {spk}\")\n", " print(text)\n", " IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "moving-multimedia", "metadata": {}, "source": [ "### Multi-speaker Tacotron + HiFi-GAN (24kHz)" ] }, { "cell_type": "code", "execution_count": 9, "id": "forced-damages", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs001\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs010\n", "タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs030\n", "タコスと寿司、あなたはどっちが好きですか?わたしはタコスが好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs050\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker: jvs100\n", "タコスと寿司、あなたはどっちが好きですか?わたしは寿司が好きです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pwg_engine = create_tts_engine(\"multspk_tacotron2_hifipwg_jvs24k\", device=device)\n", "for spk in [\"jvs001\", \"jvs010\", \"jvs030\", \"jvs050\", \"jvs100\"]:\n", " text = \"タコスと寿司、あなたはどっちが好きですか?わたしは\" + (\"寿司\" if random.random() > 0.2 else \"タコス\") + \"が好きです。\"\n", " wav, sr = pwg_engine.tts(text, spk_id=pwg_engine.spk2id[spk])\n", " print(f\"Speaker: {spk}\")\n", " print(text)\n", " IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "experimental-anniversary", "metadata": {}, "source": [ "## Common voice (ja)" ] }, { "cell_type": "markdown", "id": "regional-efficiency", "metadata": {}, "source": [ "### Multi-speaker Tacotron + Parallel WaveGAN (16kHz)" ] }, { "cell_type": "code", "execution_count": 10, "id": "alleged-dating", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 5\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 6\n", "明日の天気は、晴れです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 12\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 15\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 19\n", "今日の天気は、晴れです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_cv16k\", device=device)\n", "# NOTE: some speaker's voice have significant amount of noise (e.g., speaker 0)\n", "for spk_id in [5, 6, 12, 15, 19]:\n", " text = (\"今日\" if random.random() > 0.5 else \"明日\") + \"の天気は、\" + (\"晴れ時々曇り\" if random.random() > 0.5 else \"晴れ\") + \"です。\"\n", " wav, sr = pwg_engine.tts(text, spk_id=spk_id)\n", " print(f\"Speaker ID: {spk_id}\")\n", " print(text)\n", " IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "oriented-wrong", "metadata": {}, "source": [ "### Multi-speaker Tacotron + Parallel WaveGAN (24kHz)" ] }, { "cell_type": "code", "execution_count": 11, "id": "cosmetic-requirement", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 5\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 6\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 12\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 15\n", "明日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Speaker ID: 19\n", "今日の天気は、晴れ時々曇りです。\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pwg_engine = create_tts_engine(\"multspk_tacotron2_pwg_cv24k\", device=device)\n", "# NOTE: some speaker's voice have significant amount of noise (e.g., speaker 0)\n", "for spk_id in [5, 6, 12, 15, 19]:\n", " text = (\"今日\" if random.random() > 0.5 else \"明日\") + \"の天気は、\" + (\"晴れ時々曇り\" if random.random() > 0.5 else \"晴れ\") + \"です。\"\n", " wav, sr = pwg_engine.tts(text, spk_id=spk_id)\n", " print(f\"Speaker ID: {spk_id}\")\n", " print(text)\n", " IPython.display.display(Audio(wav, rate=sr))" ] }, { "cell_type": "markdown", "id": "athletic-contents", "metadata": {}, "source": [ "## 参考\n", "\n", "- Parallel WaveGAN: https://arxiv.org/abs/1910.11480\n", "- HiFi-GAN: https://arxiv.org/abs/2010.05646\n", "- Parallel WaveGANを含むGANベースの非自己回帰型ニューラルボコーダの実装: https://github.com/kan-bayashi/ParallelWaveGAN" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }