A Multi-Genre Urdu Broadcast Speech Recognition System
Citations Over Time
Abstract
This paper reports the development of a multi-genre Urdu Broadcast (BC) corpus and a Large Vocabulary Continuous Speech Recognition (LVCSR) system. BC speech corpus of 98 hours from 453 speakers is collected and annotated. For acoustic modeling, Time-delay Neural Network (TDNN) is developed with prior Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) training and alignments. For the language model, 3-gram, 4-gram and Recurrent Neural Network (RNN) based models are developed on a text corpus of 188 million words. The developed models are tested on 4.3 hours of unseen BC multi-genre speech dataset and the best Word Error Rate (WER) 18.59% is achieved using RNN based Language Model (LM). Moreover, a detailed word error analysis is carried out to compare the errors made by humans and the Automatic Speech Recognition (ASR) System. The results showed a similar behavior of word misrecognitions by both humans and ASR.
Related Papers
- → Identifying Language Origin of Person Names With N-Grams of Different Units(2006)19 cited
- → Automatic language identification using support vector machines and phonetic N-gram(2008)11 cited
- → Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia(2020)4 cited
- → Minimum error rate training of inter-word context dependent acoustic model units in speech recognition(1994)47 cited
- → An N-gram based Chinese syllable evaluation approach for speech recognition error detection(2009)2 cited