Deep Context: End-to-end Contextual Speech Recognition
Citations Over TimeTop 10% of 2018 papers
Abstract
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointly-optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain-of-vocabulary (OOV) terms not seen during training. We compare our proposed system to a more traditional contextualization approach, which performs shallow-fusion between independently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the proposed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components.
Related Papers
- → Contextualization That is Comprehensive(2006)8 cited
- Worldview, Challenge of Contextualization and Church Planting in West Africa – Part 1: Definition of Worldview and the Historical Development of the Concept(2010)
- The Trinity and Contextualization(2010)
- → Research on Influence of Contextualization on Difficulty of Test Questions(2019)
- → Contextualization Mission of Paul in the Book of Acts(2021)