Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search
Citations Over Time
Abstract
Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.
Related Papers
- → Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection(2022)69 cited
- → Factors Influencing Modality Choice in Multimodal Applications(2008)15 cited
- The Mismatch of Modalities and Its Effects(2012)
- → Study of communication modalities for teaching distance information(2022)
- → TYPES OF MODALITY AND ITS INCONSISTENCIES/ԵՂԱՆԱԿԱՎՈՐՄԱՆ ՏԵՍԱԿՆԵՐԸ ԵՎ ԴՐԱ ՀԵՏ ԿԱՊՎԱԾ ԱՆՀԱՄԱՊԱՏԱՍԽԱՆՈՒԹՅՈՒՆՆԵՐԸ/ТИПЫ МОДАЛЬНОСТИ И ЕЕ НЕСООТВЕТСТВИЯ(2022)