Audio Samples from "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech"

Abstract: Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Single Speaker (LJ Speech Dataset)
Multi-Speaker (VCTK Dataset)
Voice Conversion
Speech Variation
Ablation Study

Single Speaker (LJ Speech Dataset)

Text	that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day,	The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out.	have now come into general use and are obviously a great improvement on the ordinary "modern style" in use in England, which is in fact the Bodoni type	At two:thirty-eight p.m., Eastern Standard Time, Lyndon Baines Johnson took the oath of office as the thirty-sixth President of the United States.	The boy declared he saw no one, and accordingly passed through without paying the toll of a penny.
Ground Truth
Tacotron 2 + HiFi-GAN
Tacotron 2 + HiFi-GAN (fine-tuned)
Glow-TTS + HiFi-GAN
Glow-TTS + HiFi-GAN (fine-tuned)
VITS (DDP)
VITS

Multi-Speaker (VCTK Dataset)

Text	The teacher would have approved.	The rainbow is a division of white light into many beautiful colors.	There was great support all round the route.	Brown is an interesting man, but he is not desperate.	Military action is the only option we have on the table today.
Ground Truth
Tacotron 2 + HiFi-GAN
Tacotron 2 + HiFi-GAN (fine-tuned)
Glow-TTS + HiFi-GAN
Glow-TTS + HiFi-GAN (fine-tuned)
VITS

Voice Conversion

From\To	VCTK 260	VCTK 287	VCTK 247	VCTK 330	VCTK 310
VCTK 260
VCTK 287
VCTK 247
VCTK 330
VCTK 310

Speech Variation

How much variation is there?

VITS
Tacotron 2 + HiFi-GAN (fine-tuned)
Glow-TTS + HiFi-GAN (fine-tuned)
VITS (multi-speaker)

Ablation Study

Text	that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day,	The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out.	have now come into general use and are obviously a great improvement on the ordinary "modern style" in use in England, which is in fact the Bodoni type	At two:thirty-eight p.m., Eastern Standard Time, Lyndon Baines Johnson took the oath of office as the thirty-sixth President of the United States.	The boy declared he saw no one, and accordingly passed through without paying the toll of a penny.
Ground Truth
VITS (300k training)
w/o Normalizing Flow (300k training)
w Mel-Spectrogram (300k training)

Contents

Single Speaker (LJ Speech Dataset)

Text

Multi-Speaker (VCTK Dataset)

Text

Voice Conversion

Speech Variation

How much variation is there?

Ablation Study

Text