Temporal Pyramid Pooling Based Relation Network for Action Recognition
Citations Over Time
Abstract
Efficient spatiotemporal representations play a vital role in video understanding. In this study, we propose a novel temporal pyramid pooling based relation network (TPPRN) to learn spatiotemporal representations in an end-to-end fashion. More specifically, TPPRN poolings high-level features derived from a convolutional neural network of sampling frames at multi-scale. Then, the features of same-length segments are concatenated to reason about the relation within the same-length segments. Finally, different relations are aggregated to make predictions comprehensively. Our first contribution is the carefully-designed sampling strategy. It splits a video into three clips evenly and samples four frames at each clip uniformly, which allows the model to reduce computation and memory cost. Our second contribution is the temporal pyramid pooling at multi-scale. It provides various-grained segments for relation module to reason about their relations. The experimental results on two standard benchmarks, HMDB-51 and UCF-101, demonstrate the effectiveness of the learned spatiotemporal representations. And the proposed TPPRN achieves the comparable performance with the state-of-the-art.
Related Papers
- → Multiactivation Pooling Method in Convolutional Neural Networks for Image Recognition(2018)27 cited
- → P-LINKNET: LINKNET WITH SPATIAL PYRAMID POOLING FOR HIGH-RESOLUTION SATELLITE IMAGERY(2020)5 cited
- → Spatially-coherent pyramid matching based on max-pooling(2011)1 cited
- → A Fully Trainable Network with RNN-based Pooling(2017)1 cited
- → A Sparse Pyramid Pooling Strategy(2015)