Leveraging Language ID in Multilingual End-to-End Speech Recognition
Citations Over TimeTop 10% of 2019 papers
Abstract
Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch.
Related Papers
- → Visual analysis of attention-based end-to-end speech recognition(2019)3 cited
- → Does End-to-End Trained Deep Model Always Perform Better than Non-End-to-End Counterpart?(2021)2 cited
- → The notion of end-to-end capacity and its application to the estimation of end-to-end network delays(2005)4 cited
- → End-to-end consensus using end-to-end channels(2006)2 cited
- → “There is none like this Arabic language upon the Earth” : The Arabic language and Oriental culture through the eyes of Rabbi Israel Moshe Hazan(2014)