Audio Samples from "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search"
Abstract: Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.
Quality Improvement after Paper Submission*
This result was not included in the paper. Lately, we found that two modifications help to improve the synthesis quality of Glow-TTS.; 1) moving to a vocoder, HiFi-GAN (https://arxiv.org/abs/2010.05646) to reduce noise, 2) putting a blank token between any two input tokens to improve pronunciation. Specifically, we used a fine-tuned vocoder with Tacotron 2 which is provided as a pretrained model in the repo (https://github.com/jik876/hifi-gan). If you're interested, please listen to the three samples below.
Abstract
Glow-TTS*
But Mrs. Solomons could not resist the temptation to dabble in stolen goods, and she was found shipping watches of the wrong category to New York.
Glow-TTS*
When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.
Glow-TTS*
Single Speaker TTS
But Mrs. Solomons could not resist the temptation to dabble in stolen goods, and she was found shipping watches of the wrong category to New York.
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
The route chosen from the airport to Main Street was the normal one, except where Harwood Street was selected as the means of access to Main Street
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
Mr. Buxton's friends at once paid the forty shillings, and the boy was released.
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
perhaps the tales that travelers told him were exaggerated as travelers' tales are likely to be,
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
Diversity
at a distance from the prison.
Same gaussian noise ε, different temperature T=0.1
Same gaussian noise ε, different temperature T=0.333
Same gaussian noise ε, different temperature T=0.667
Same gaussian noise ε, different temperature T=1.0
Different gaussian noise ε1, same temperature T=0.667
Different gaussian noise ε2, same temperature T=0.667
Different gaussian noise ε3, same temperature T=0.667
Length Robustness
When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.
Glow-TTS
Tacotron 2
"Sorry," he grunted, as the tiny old man stumbled and almost fell. It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. He didn't seem at all upset at being almost knocked to the ground. On the contrary, his face split into a wide smile and he said in a squeaky voice that made passersby stare, "Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!"
Glow-TTS
Tacotron 2
If the motorcycle was huge, it was nothing to the man sitting astride it. He was almost twice as tall as a normal man and at least five times as wide. He looked simply too big to be allowed, and so wild - long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins. In his vast, muscular arms he was holding a bundle of blankets.
Glow-TTS
Tacotron 2
Length Controllability
Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?
Predicted Duration x 0.50
Predicted Duration x 0.75
Predicted Duration x 1.00
Predicted Duration x 1.25
Predicted Duration x 1.50
For a while the preacher addresses himself to the congregation at large, who listen attentively
Predicted Duration x 0.50
Predicted Duration x 0.75
Predicted Duration x 1.00
Predicted Duration x 1.25
Predicted Duration x 1.50
The nature of the protective assignment
Predicted Duration x 0.50
Predicted Duration x 0.75
Predicted Duration x 1.00
Predicted Duration x 1.25
Predicted Duration x 1.50
More examples of Controllability & Diversity
I noticed when I went out that the light was on, end quote,
Noise ε0, Temperature T=0.1
Noise ε0, Temperature T=0.333
Noise ε0, Temperature T=0.667
Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667
Noise ε1, Temperature T=0.667
Noise ε2, Temperature T=0.667
Noise ε3, Temperature T=0.667
On the seventh July, eighteen thirty-seven,
Noise ε0, Temperature T=0.1
Noise ε0, Temperature T=0.333
Noise ε0, Temperature T=0.667
Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667
Noise ε1, Temperature T=0.667
Noise ε2, Temperature T=0.667
Noise ε3, Temperature T=0.667
contracted with sheriffs and conveners to work by the job.
Noise ε0, Temperature T=0.1
Noise ε0, Temperature T=0.333
Noise ε0, Temperature T=0.667
Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667
Noise ε1, Temperature T=0.667
Noise ε2, Temperature T=0.667
Noise ε3, Temperature T=0.667
eleven. If I am alive and taken prisoner,
Noise ε0, Temperature T=0.1
Noise ε0, Temperature T=0.333
Noise ε0, Temperature T=0.667
Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667
Noise ε1, Temperature T=0.667
Noise ε2, Temperature T=0.667
Noise ε3, Temperature T=0.667
He was never satisfied with anything.
Noise ε0, Temperature T=0.1
Noise ε0, Temperature T=0.333
Noise ε0, Temperature T=0.667
Noise ε0, Temperature T=1.0
Noise ε0, Temperature T=0.667
Noise ε1, Temperature T=0.667
Noise ε2, Temperature T=0.667
Noise ε3, Temperature T=0.667
Multi Speaker TTS
When terminating below the ankles it was held down by a slender strap passing under the foot.
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
He was at the head of his class in rhetoric.
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
And he looks hungry enough."
GT
GT(WaveGlow)
Glow-TTS (T=0.333)
Glow-TTS (T=0.5)
Glow-TTS (T=0.667)
Tacotron 2
Speaker Dependent Duration
Some have accepted this as a miracle without any physical explanation
LibriTTS 103
LibriTTS 7178
LibriTTS 229
LibriTTS 2002
Scientists at the CERN laboratory say they have discovered a new particle.
LibriTTS 103
LibriTTS 7178
LibriTTS 229
LibriTTS 2002
Voice Conversion
Samples on the diagonal line are recontructed samples of ground-truth mel-spectrograms through WaveGlow.