Image description through fusion based recurrent multi-modal learning
Citations Over TimeTop 19% of 2016 papers
Abstract
Current research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. The promising results in these areas have inspired research towards solving more complex multi-modal learning problems in the image/video domains such as automatic annotation, segmentation, labelling, and generic understanding. Although solutions have been provided for one or more of these problems, their approaches have been application specific. This paper introduces an end-to-end trainable Fusion-based Recurrent Multi-Modal (FRMM) model to address multi-modal applications. FRMM allows each input modality to be independent in terms of architecture, parameters and length of input sequences. FRMM image description models seamlessly blend convolutional neural network feature descriptors with sequential language data in a recurrent framework. For training and testing we used the Flickr30K and MSCOCO datasets, demonstrating state-of-the-art description results.
Related Papers
- → Comparative analysis of different fusion rules for SAR and multi-spectral image fusion based on NSCT and IHS transform(2015)9 cited
- → Improvements of image fusion methods(2014)5 cited
- → Better and Faster Deep Image Fusion with Spatial Frequency(2021)1 cited
- Image fusion algorithm based on high frequency channel(2003)
- → Cognition-based Fusion Method for Infrared and Visible image(2018)