FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
Citations Over TimeTop 10% of 2022 papers
Abstract
Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.
Related Papers
- → Solutions to the Third Benchmark Control Problem(1991)3 cited
- → The Effect of Teaching Practical Physical Modalities on the Ordering Skills of Physical Medicine and Rehabilitation Residents(2013)
- Theoretical Analysis of the Benchmark for Choosing Manipulative Instruments of Monetary Policies(2009)
- → Exploring disk performance benchmarks(2017)
- → Support Structure Performance Benchmark(2023)