HM-CONFORMER: A Conformer-Based Audio Deepfake Detection System with Hierarchical Pooling and Multi-Level Classification Token Aggregation Methods
Citations Over TimeTop 10% of 2024 papers
Abstract
Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
Related Papers
- → Probing effect of weak H-bonding on conformational change in ionic liquid: Experimental and DFT studies(2018)12 cited
- → Structural information from OH stretching vibrations—XVII. On the different conformers in benzylalcohols and anthracylmethanols(1986)22 cited
- → Relative energies, stereoelectronic interactions and conformational interconversions in silathiacyclohexanes(2004)15 cited
- → Effects of side chains in gas-phase amino acids: Conformational analysis and relative stabilities(2009)7 cited
- → Conformation and vibrational spectra of 1,2-diisocyanoethane(1982)7 cited