Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement
Citations Over TimeTop 10% of 2019 papers
Abstract
The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions.These improvements are usually evaluated based on the perceptual quality of the enhanced audio or on the performance of automatic speech recognition (ASR) systems.We are interested instead in the usefulness of these algorithms in the field of speech emotion recognition (SER), and specifically in whether an enhancement architecture can effectively remove noise while preserving enough information for an SER algorithm to accurately identify emotion in speech.We first show how a scalable DL architecture can be trained to enhance audio signals in a large number of unseen environments, and go on to show how that can benefit common SER pipelines in terms of noise robustness.Our results show that incorporating a speech enhancement architecture is beneficial, especially for low signal-to-noise ratio (SNR) conditions.
Related Papers
- → Research Progress in Speech Enhancement Technology(2020)2 cited
- → Multi-stage speech enhancement for automatic speech recognition(2016)4 cited
- → Modification on LSA speech enhancement for speech recognition(2017)4 cited
- → Speech enhancement using pre-processing(2002)1 cited
- Thinking of Speech Enhancement Technology and Strategy(2012)