0 citations0 references

Joint Endpointing and Decoding with End-to-end Models

2019pp. 5626–5630

Citations Over TimeTop 10% of 2019 papers

Shuo-Yiin Chang, Rohit Prabhavalkar, Yanzhang He, Tara N. Sainath, Gabor Simko

Abstract

The tradeoff between word error rate (WER) and latency is very important for streaming automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. Conventional ASR systems rely on a separately trained endpointing module, which interacts with the acoustic, pronunciation and language model (AM, PM, and LM) components, and can result in a higher WER or a larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into a single neural network, in this work we look at folding the endpointer into this E2E model to assist with the endpointing task. We refer to this jointly optimized model - which performs both recognition and endpointing - as an E2E enpointer. On a large vocabulary Voice Search task, we show that the combination of such an E2E endpoiner with a conventional endpointer results in no quality degradation, while reducing latency by more than a factor of 2 compared to using a separate endpointer with the E2E model.

Citations Over TimeTop 10% of 2019 papers

Abstract

Related Papers