Towards Fast and Accurate Streaming End-To-End ASR
Citations Over TimeTop 10% of 2020 papers
Abstract
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNN-T) as a streaming E2E model has shown promising potential for on-device ASR [1]. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model [2] with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique [3], we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over [2] on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer [4]. Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP [2] model.
Related Papers
- → Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks(2016)146 cited
- → A Spelling Correction Model for End-to-end Speech Recognition(2019)140 cited
- → Language Models with RNNs for Rescoring Hypotheses of Russian ASR(2016)3 cited
- → Deep Learning Based Language Modeling for Domain-Specific Speech Recognition(2017)1 cited
- → A spelling correction model for end-to-end speech recognition(2019)2 cited