Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment
Citations Over TimeTop 10% of 2025 papers
Abstract
This paper proposes a novel multimodal emotion recognition framework, termed Sparse Alignment and Liquid-Mamba (SALM), which effectively integrates the complementary strengths of Mamba networks and Liquid Neural Networks (LNNs). To capture neural dynamics, high-resolution EEG spectrograms are generated via Short-Time Fourier Transform (STFT), while heatmap features from facial images, videos, speech, and text are extracted and aligned through entropy-regularized Sinkhorn and Greenkhorn optimal transport algorithms. These aligned representations are fused to mitigate semantic disparities across modalities. The proposed SALM model leverages sparse alignment for efficient cross-modal mapping and employs the Liquid-Mamba architecture to construct a robust and generalizable classifier. Extensive experiments on benchmark datasets demonstrate that SALM consistently outperforms state-of-the-art methods in both classification accuracy and generalization ability.