The Synthesis Company of San Francisco Mountain Logo

Abstract

This paper proposes a non-parallel voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE. The proposed method has two key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture the time dependencies in the acoustic feature sequences of source and target speech. Second, it uses information-theoretic regularization for the model training to ensure that the information in the attribute class label will not be lost in the conversion process. With regular conditional VAEs, the encoder and decoder are free to ignore the attribute class label input. This can be problematic since in such a situation, the attribute class label will have little effect on controlling the voice characteristics of input speech at test time. Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. We also present several ways to convert the feature sequence of input speech using the trained encoder and decoder and compare them in terms of audio quality through objective and subjective evaluations. We confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.

ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

Citations Over TimeTop 10% of 2019 papers

Abstract

Related Papers