Learning Visual N-Grams from Web Data
Citations Over TimeTop 14% of 2017 papers
Abstract
Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments, without using manually labeled images. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.
Related Papers
- → Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties(2017)24 cited
- → Improving N-gram language modeling for code-switching speech recognition(2017)16 cited
- Modeling of term-distance and term-occurrence information for improving n-gram language model performance(2013)
- → Improving language modeling by using distance and co-occurrence information of word-pairs and its application to LVCSR(2014)2 cited
- → Developing a method to build Japanese speech recognition system based on 3-gram language model expansion with Google database(2013)