PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
Citations Over TimeTop 10% of 2021 papers
Abstract
This paper introduces PnG BERT, a new encoder model for neural TTS.This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them.It can be pre-trained on a large text corpus in a selfsupervised manner, and fine-tuned in a TTS task.Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training.Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.
Related Papers
- → Joint prosody prediction and unit selection for concatenative speech synthesis(2002)58 cited
- Interference of First Language in Pronunciation of English Segmental Sounds(2015)
- Prosody-based unit selection for Japanese speech synthesis.(1998)
- → On grapheme to phoneme conversion for Romanian using pronunciation by analogy(2012)1 cited
- Designing Target Cost Function Based on Prosody of Speech Database(Speech Synthesis and Prosody, Corpus-Based Speech Technologies)(2005)