Joint Endpointing and Decoding with End-to-end Models
Citations Over TimeTop 10% of 2019 papers
Abstract
The tradeoff between word error rate (WER) and latency is very important for streaming automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. Conventional ASR systems rely on a separately trained endpointing module, which interacts with the acoustic, pronunciation and language model (AM, PM, and LM) components, and can result in a higher WER or a larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into a single neural network, in this work we look at folding the endpointer into this E2E model to assist with the endpointing task. We refer to this jointly optimized model - which performs both recognition and endpointing - as an E2E enpointer. On a large vocabulary Voice Search task, we show that the combination of such an E2E endpoiner with a conventional endpointer results in no quality degradation, while reducing latency by more than a factor of 2 compared to using a separate endpointer with the E2E model.
Related Papers
- → A Spelling Correction Model for End-to-end Speech Recognition(2019)140 cited
- → Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model(2022)7 cited
- → Speech recognition experiments using multi-span statistical language models(1999)6 cited
- → Deep Learning Based Language Modeling for Domain-Specific Speech Recognition(2017)1 cited
- → A spelling correction model for end-to-end speech recognition(2019)2 cited