Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training
Citations Over TimeTop 13% of 2019 papers
Abstract
In this paper, we propose a cross-modal student-teacher learning framework to make a full use of externally abundant acoustic data in addition to a given task-specific audio-visual training database for improving speech recognition performance under the low signal-to-noise-ratio (SNR) and acoustic mismatch conditions. First, a teacher model is trained with large-sized audio-only databases. Next, a student, namely a deep neural network (DNN) model, is trained on a small-sized audio-visual database to minimize the Kullback-Leibler (KL) divergence between its output and the posterior distribution of the teacher. We evaluate the proposed approach in both matched and mismatch acoustic conditions for phone recognition with the NTCD-TIMIT database. Compared to the DNN recognition system trained with the original audio-visual data only, the proposed solution reduces the phone error rate (PER) from 26.7% to 21.3% on a matched acoustic scenario. In the mismatch conditions, the PER is reduced from 47.9% to 42.9%. Moreover, we show that posteriors generated by the teacher contain environmental information, which enables our proposed student-teacher learning to work as an environmental-aware training and good PER reductions are observed in all SNR conditions.
Related Papers
- → Hybrid speech recognition with Deep Bidirectional LSTM(2013)1,775 cited
- → Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT(2020)8 cited
- → Leveraging End-to-End Speech Recognition with Neural Architecture Search(2019)8 cited
- → Using multiple versions of speech input in phone recognition(2013)