0 citations0 references

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

2017pp. 3165–3174

Citations Over TimeTop 1% of 2017 papers

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Šivic, Bryan Russell

Abstract

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

Related Papers

→ A improved pooling method for convolutional neural networks(2024)119 cited
Pooling in high-throughput drug screening.(2009)
→ A fully trainable network with RNN-based pooling(2019)22 cited
Alpha-Pooling for Convolutional Neural Networks.(2018)
→ A Fully Trainable Network with RNN-based Pooling(2017)1 cited