Spatiotemporal Feature Fusion for Video Summarization
Citations Over TimeTop 12% of 2024 papers
Abstract
Video summarization (VS) is crucial process for compacting video content into a concise and informative representation, enhancing accessibility and the user experience. This work introduces a new approach based on spatiotemporal features derived from long short-term memory and pretrained convolutional neural network (CNN) models for static VS. It utilizes dual-CNN to identify keyframes by extracting features from benchmark datasets that contain user-generated summaries as the ground truth. Additionally, the incorporation of self-organizing map clustering into the dual-CNN model is investigated for superior performance compared to alternative clustering strategies. This spatiotemporal-based VS method effectively selects the most representative frames from the extracted spatiotemporal features. Unlike traditional methods, it does not require training on specific VS datasets, eliminating the need for extensive labeled data. Compared to existing state-of-the-art techniques in the literature, the proposed approach demonstrates promising results, consistently generating high-quality video summaries across various content categories. It achieved average F-scores of 84.7%, 86.4%, 61.9%, and 53.6% on four benchmark Open Video, YouTube, TVSum, and SumMe datasets, respectively, showing its effectiveness in producing informative video summaries.
Related Papers
- → Feature Extraction Using Wavelet-PCA and Neural Network for Application of Object Classification & Face Recognition(2010)29 cited
- → ICA and PCA integrated feature extraction for classification(2016)32 cited
- → Partial discharge pattern recognition using multiscale feature extraction and support vector machine(2013)11 cited
- Image feature extraction from wavelet scalogram based on kernel principle component analysis(2009)
- → Feature Bias Correction: A Feature Augmentation Method for Long-tailed Recognition(2023)