Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
2023pp. 2214–2224
Citations Over TimeTop 10% of 2023 papers
Abstract
We present a simple approach which can turn a ViT en-coder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sam-pling the inputs, the model is able to do training and in-ference from both input modalities. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results 1 1 https://sites.google.com/view/tubevit.
Related Papers
- → Scalability Issues of Blockchain Technology(2020)30 cited
- → On the scalability of multistage interconnection networks(2004)3 cited
- → Using Empirical Data for Scalability Analysis of Parallel Applications(2019)1 cited
- RESEARCH ON THE SCALABILITY OF THE LARGE SCALE PARALLEL APPLICATION PROGRAMS(2000)
- → A Scalability Yardstick(2017)