Audio Samples from "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search"

Abstract: Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.

Quality Improvement after Paper Submission*

This result was not included in the paper. Lately, we found that two modifications help to improve the synthesis quality of Glow-TTS.; 1) moving to a vocoder, HiFi-GAN (https://arxiv.org/abs/2010.05646) to reduce noise, 2) putting a blank token between any two input tokens to improve pronunciation. Specifically, we used a fine-tuned vocoder with Tacotron 2 which is provided as a pretrained model in the repo (https://github.com/jik876/hifi-gan). If you're interested, please listen to the three samples below.

Abstract

Glow-TTS*

But Mrs. Solomons could not resist the temptation to dabble in stolen goods, and she was found shipping watches of the wrong category to New York.

Glow-TTS*

When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.

Glow-TTS*

Single Speaker TTS

But Mrs. Solomons could not resist the temptation to dabble in stolen goods, and she was found shipping watches of the wrong category to New York.

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

The route chosen from the airport to Main Street was the normal one, except where Harwood Street was selected as the means of access to Main Street

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

Mr. Buxton's friends at once paid the forty shillings, and the boy was released.

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

perhaps the tales that travelers told him were exaggerated as travelers' tales are likely to be,

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

Diversity

at a distance from the prison.

Same gaussian noise ε, different temperature T=0.1 Same gaussian noise ε, different temperature T=0.333 Same gaussian noise ε, different temperature T=0.667 Same gaussian noise ε, different temperature T=1.0
Different gaussian noise ε1, same temperature T=0.667 Different gaussian noise ε2, same temperature T=0.667 Different gaussian noise ε3, same temperature T=0.667

Length Robustness

When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.

Glow-TTS Tacotron 2

"Sorry," he grunted, as the tiny old man stumbled and almost fell. It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. He didn't seem at all upset at being almost knocked to the ground. On the contrary, his face split into a wide smile and he said in a squeaky voice that made passersby stare, "Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!"

Glow-TTS Tacotron 2

If the motorcycle was huge, it was nothing to the man sitting astride it. He was almost twice as tall as a normal man and at least five times as wide. He looked simply too big to be allowed, and so wild - long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins. In his vast, muscular arms he was holding a bundle of blankets.

Glow-TTS Tacotron 2

Length Controllability

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

Predicted Duration x 0.50 Predicted Duration x 0.75 Predicted Duration x 1.00 Predicted Duration x 1.25 Predicted Duration x 1.50

For a while the preacher addresses himself to the congregation at large, who listen attentively

Predicted Duration x 0.50 Predicted Duration x 0.75 Predicted Duration x 1.00 Predicted Duration x 1.25 Predicted Duration x 1.50

The nature of the protective assignment

Predicted Duration x 0.50 Predicted Duration x 0.75 Predicted Duration x 1.00 Predicted Duration x 1.25 Predicted Duration x 1.50

More examples of Controllability & Diversity

I noticed when I went out that the light was on, end quote,

Noise ε0, Temperature T=0.1 Noise ε0, Temperature T=0.333 Noise ε0, Temperature T=0.667 Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667 Noise ε1, Temperature T=0.667 Noise ε2, Temperature T=0.667 Noise ε3, Temperature T=0.667

On the seventh July, eighteen thirty-seven,

Noise ε0, Temperature T=0.1 Noise ε0, Temperature T=0.333 Noise ε0, Temperature T=0.667 Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667 Noise ε1, Temperature T=0.667 Noise ε2, Temperature T=0.667 Noise ε3, Temperature T=0.667

contracted with sheriffs and conveners to work by the job.

Noise ε0, Temperature T=0.1 Noise ε0, Temperature T=0.333 Noise ε0, Temperature T=0.667 Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667 Noise ε1, Temperature T=0.667 Noise ε2, Temperature T=0.667 Noise ε3, Temperature T=0.667

eleven. If I am alive and taken prisoner,

Noise ε0, Temperature T=0.1 Noise ε0, Temperature T=0.333 Noise ε0, Temperature T=0.667 Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667 Noise ε1, Temperature T=0.667 Noise ε2, Temperature T=0.667 Noise ε3, Temperature T=0.667

He was never satisfied with anything.

Noise ε0, Temperature T=0.1 Noise ε0, Temperature T=0.333 Noise ε0, Temperature T=0.667 Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667 Noise ε1, Temperature T=0.667 Noise ε2, Temperature T=0.667 Noise ε3, Temperature T=0.667

Multi Speaker TTS

When terminating below the ankles it was held down by a slender strap passing under the foot.

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

He was at the head of his class in rhetoric.

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

And he looks hungry enough."

GT GT(WaveGlow) Glow-TTS (T=0.333) Glow-TTS (T=0.5) Glow-TTS (T=0.667) Tacotron 2

Speaker Dependent Duration

Some have accepted this as a miracle without any physical explanation

LibriTTS 103 LibriTTS 7178 LibriTTS 229 LibriTTS 2002

Scientists at the CERN laboratory say they have discovered a new particle.

LibriTTS 103 LibriTTS 7178 LibriTTS 229 LibriTTS 2002

Voice Conversion

Samples on the diagonal line are recontructed samples of ground-truth mel-spectrograms through WaveGlow.
From\To LibriTTS 2836 LibriTTS 6476 LibriTTS 118 LibriTTS 446
LibriTTS 2836
LibriTTS 6476
LibriTTS 118
LibriTTS 446