ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Citations Over TimeTop 1% of 2020 papers
Abstract
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance.In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet.ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.In addition, we propose a simple scaling method that scales the widths of Con-textNet that achieves good trade-off between computation and accuracy.We demonstrate that on the widely used Librispeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6%without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0%with only 10M parameters on the clean/noisy LibriSpeech test sets.This compares to the best previously published model of 2.0%/4.6%with LM and 3.9%/11.3%with 20M parameters.The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.
Related Papers
- → Comparative Evaluation of CNN Architectures for Image Caption Generation(2020)36 cited
- → Estimation of crowd density in surveillance scenes based on deep convolutional neural network(2017)29 cited
- → Scene Recognition from Image Using Convolutional Neural Network(2020)20 cited
- → CNNA: A study of Convolutional Neural Networks with Attention(2021)2 cited
- → Hand Gesture Recognition using Machine Learning with Convolutional Neural Network (CNN)(2022)1 cited