ThaiTC:Thai Transformer-based Image Captioning
Citations Over Time
Abstract
For problems with image captioning is a technique that has been used for a long time. In the past, there was a way to use convolutional neural network (CNN) for feature extraction and recurrent neural network (RNN) for generating text, and especially in Thai language, It has to be developed further in the era of the popular use of transformers. This paper proposes an end-to-end image captioning with pretrained vision Transformers (ViT) and text transformers in Thai language models namely ThaiTC, Which leverages the transformer architecture both. We has experiment pretrained vision transformer and text transformer in Thai language that best for Thai image captioning and tested on 3 Thai image captioning datasets 1) Travel 2) Food 3) Flickr 30k(t $r$ anslate) with different challenges. Includes freeze vision transformers weight training for image captioning dataset training with less image features, From the experiment, We found that ThaiTC performed much better in the Food and Flickr30k datasets than the Travel datasets, Which allowed us to automatically create subtitles about food and travel.
Related Papers
- → ThaiTC:Thai Transformer-based Image Captioning(2022)8 cited
- → OSCAR and ActivityNet: an Image Captioning model can effectively learn a Video Captioning dataset(2021)1 cited
- → Video Captioning via Hierarchical Reinforcement Learning(2017)22 cited
- → Boosted Attention: Leveraging Human Attention for Image Captioning(2019)1 cited
- → Image Captioning Methodologies Using Deep Learning: A Review(2020)