Hierarchical Generative Modeling for Controllable Speech Synthesis
Citations Over Time
Abstract
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.
Related Papers
- → Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions(2018)2,594 cited
- → Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder(2018)139 cited
- → Adam: A Method for Stochastic Optimization(2014)84,465 cited
- → Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis(2018)474 cited
- → Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(2018)219 cited