TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Citations Over TimeTop 10% of 2021 papers
Abstract
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to conventional vision-language pretraining that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) during pretraining. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), pre-training with scene text effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5:4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with scene text. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8:3% accuracy on TextVQA, +8:6% accuracy on ST-VQA, and +10:2 CIDEr score on TextCaps.
Related Papers
- Overview of Question-Answering(2002)
- → Natural Language Processing based New Approach to Design Factoid Question Answering System(2020)11 cited
- A Survey on Question and Answering Systems(2012)
- Effective Question Answering Techniques and their Evaluation Metrics(2013)
- → The Effect of Teaching Practical Physical Modalities on the Ordering Skills of Physical Medicine and Rehabilitation Residents(2013)