RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification
Citations Over TimeTop 10% of 2019 papers
Abstract
Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains.In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation.In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification.Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance.Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis.Effective back-end classification models that suit the proposed speaker embedding are also explored.We propose an end-toend system that comprises two deep neural networks, one frontend for utterance-level speaker embedding extraction and the other for back-end classification.Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation.The proposed system is also comparable to the state-of-the-art x-vector system that adopts data augmentation.
Related Papers
- → RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification(2019)158 cited
- → Adversarial Regularization for End-to-End Robust Speaker Verification(2019)40 cited
- → STC-Innovation Speaker Recognition Systems for Far-Field Speaker Verification Challenge 2020(2020)9 cited
- → Text-independent speaker verification using speaker clustering and support vector machines(2003)6 cited
- → Joint Speaker Encoder and Neural Back-End Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances(2023)1 cited