Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts
Citations Over TimeTop 1% of 2014 papers
Abstract
Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This task goes beyond the traditional zero-shot framework of adapting a given set of classes with training data to unseen classes. We leverage video and image collections with free-form text descriptions from widely available web sources to learn a large bank of concepts, in addition to using several off-the-shelf concept detectors, speech, and video text for representing videos. We utilize natural language processing technologies to generate event description features. The extracted features are then projected to a common high-dimensional space using text expansion, and similarity is computed in this space. We present extensive experimental results on the large TRECVID MED [26] corpus to demonstrate our approach. Our results show that the proposed concept detection methods significantly outperform current attribute classifiers such as Classemes [34], ObjectBank [21], and SUN attributes[28] . Further, we find that fusion, both within as well as between modalities, is crucial for optimal performance.
Related Papers
- → Leverage and Corporate Performance: Evidence from Unsuccessful Takeovers(1999)222 cited
- → News media coverage and corporate leverage adjustments(2019)90 cited
- → The dynamics of leverage of newly controlled target firms: evidence after an acquisition(2023)1 cited
- → Sources of high leverage in linear regression model(2020)1 cited
- Similar and Similarity Surplus in the Figurative Thinking(2000)