Abstractive Summarization for Video: A Revisit in Multistage Fusion Network With Forget Gate
Citations Over TimeTop 11% of 2022 papers
Abstract
Multimodal abstractive summarization for videos is an emerging task that aims to generate a summary from multi-source information (i.e., video, audio transcript). The challenge is how to merge multimodal long sequences to capture rich semantic information without allowing possible noise from either lengthy modal sequence to degrade the other modality and thus hurt the entire model. To address the issues, we propose a m ultistage f usion network with f orget g ate (MFFG), which selectively integrates multi-source information through the cross-fusion in encoding and hierarchical fusion in decoding between modalities, and design a fusion forget gate module to suppress the potential multimodal noise flow of multi-source long sequence. Meanwhile, considering that the source text in this task is lengthy and has the same distribution as the output summary text, we inherit the partial structure of the MFFG model and again propose its variant, single-stage fusion network with forget gate (SFFG), which simplifies the fusion schema, and leverages the long source text to enhance the representation of the target summary. Experimental results on How2 dataset and How2-300 dataset demonstrate the superiority of the two multimodal fusion methods. Further, we provide a version of ASR transcription data of How2 dataset to evaluate model performance under noisy scenarios, and experimental results show obvious advantages of our proposed models over prior systems.
Related Papers
- Multilingual Summarization Evaluation without Human Models(2010)
- → Experiences with and Reflections on Text Summarization Tools(2009)9 cited
- On the Applications of the Experience Summarization in Modern Teaching and Research(2000)
- → Dynamic Summarization: Another Stride Towards Summarization(2007)