0 citations0 references

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

2020pp. 7829–7833

Citations Over TimeTop 1% of 2020 papers

Qian Zhang, Lu Han, Haşim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar

Abstract

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

Related Papers

→ Mid-long Term Load Forecasting Using Hidden Markov Model(2009)11 cited
→ On modeling context-dependent clustered states: Comparing HMM/GMM, hybrid HMM/ANN and KL-HMM approaches(2014)25 cited
→ Comparative Analysis of 1-D HMM and 2-D HMM for Hand Motion Recognition Applications(2017)6 cited
Research of HMM and I/O HMM used in protein secondary structure prediction(2002)
→ A Study on Performance Improvement of Recurrent Neural Networks Algorithm using Word Group Expansion Technique(2022)