End-to-end Generative Pretraining for Multimodal Video Captioning
Citations Over TimeTop 10% of 2022 papers
Abstract
Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective - we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.
Related Papers
- → A Comprehensive Review of the Latest Advancements in Large Generative AI Models(2023)29 cited
- → Auxiliary Deep Generative Models(2016)154 cited
- → Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet(2023)9 cited
- → Generative Model for Person Re-Identification: A Review(2020)
- → TC-VAE: Uncovering Out-of-Distribution Data Generative Factors(2023)