Spatiotemporal Feature Fusion for Video Summarization

IEEE Multimedia2024Vol. 31(3), pp. 88–97

Citations Over TimeTop 12% of 2024 papers

Shamal Kashid, Lalit Kumar Awasthi, Krishan Berwal, Parul Saini

Abstract

Video summarization (VS) is crucial process for compacting video content into a concise and informative representation, enhancing accessibility and the user experience. This work introduces a new approach based on spatiotemporal features derived from long short-term memory and pretrained convolutional neural network (CNN) models for static VS. It utilizes dual-CNN to identify keyframes by extracting features from benchmark datasets that contain user-generated summaries as the ground truth. Additionally, the incorporation of self-organizing map clustering into the dual-CNN model is investigated for superior performance compared to alternative clustering strategies. This spatiotemporal-based VS method effectively selects the most representative frames from the extracted spatiotemporal features. Unlike traditional methods, it does not require training on specific VS datasets, eliminating the need for extensive labeled data. Compared to existing state-of-the-art techniques in the literature, the proposed approach demonstrates promising results, consistently generating high-quality video summaries across various content categories. It achieved average F-scores of 84.7%, 86.4%, 61.9%, and 53.6% on four benchmark Open Video, YouTube, TVSum, and SumMe datasets, respectively, showing its effectiveness in producing informative video summaries.

Citations Over TimeTop 12% of 2024 papers

Abstract

Related Papers