Multistate Encoding with End-To-End Speech RNN Transducer Network
Citations Over TimeTop 20% of 2020 papers
Abstract
Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size.In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
Related Papers
- → Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks(2016)146 cited
- → Speech recognition experiments using multi-span statistical language models(1999)6 cited
- → Language Models with RNNs for Rescoring Hypotheses of Russian ASR(2016)3 cited
- → Deep Learning Based Language Modeling for Domain-Specific Speech Recognition(2017)1 cited
- → A spelling correction model for end-to-end speech recognition(2019)2 cited