Deep Local Video Feature for Action Recognition
Citations Over TimeTop 10% of 2017 papers
Abstract
We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore propose to instead treat the deep networks trained on local inputs as local feature extractors. The local features are then aggregated to form global features which are used to assign video-level labels through a second classification stage. We investigate a number of design choices for this local feature approach. Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled local features leads to significant performance improvement.
Related Papers
- → A improved pooling method for convolutional neural networks(2024)119 cited
- Pooling in high-throughput drug screening.(2009)
- → A fully trainable network with RNN-based pooling(2019)22 cited
- Alpha-Pooling for Convolutional Neural Networks.(2018)
- → A Fully Trainable Network with RNN-based Pooling(2017)1 cited