FLAVA: A Foundational Language And Vision Alignment Model
Citations Over TimeTop 1% of 2022 papers
Abstract
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once-a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Related Papers
- → Robotics architectures based machine learning and deep learning approaches(2022)16 cited
- → Automated system for classification of COVID-19 infection from lung CT images based on machine learning and deep learning techniques(2022)12 cited
- → Science of Vision(1990)49 cited
- → An Object Detection and Pose Estimation Approach for Position Based Visual Servoing(2017)5 cited
- → Reliance on Artificial Intelligence, Machine Learning and Deep Learning in the Era of Industry 4.0(2021)64 cited