Abstract
In this work, we propose a novel region-based and time-varying attention network (RTAN) model for image captioning, which can determine where and when to attend to images. The RTAN is composed of region-based attention network (RAN) and time-varying attention network (TAN). For the RAN part, we integrate region proposal network with soft attention mechanism, so that it is able to locate the accurate positions of objects in an image and focus on the object most relevant to the next word. In the TAN, we design a time-varying gate to determine whether visual information is needed to generate the next word. For example, when the next word is a non-visual word, e.g. “the” or “to”, our model would predict the next word based more on the semantic information instead of visual information. Compared with the existing methods, the advantage of the proposed RTAN model is twofold: (1) the RTAN can extract more discriminative visual information; (2) it can attend to only semantic information when predicting the non-visual words. The effectiveness of RTAN is verified on MSCOCO and Flicker30k datasets.
Similar content being viewed by others
References
Chen L, Zhang H, **ao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, pp 85–91
Fu K, ** J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of international conference on learning representations
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek JJ (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision, pp 1251–1259
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang W, Hu H (2017) Multimodal object description network for dense captioning. Electron Lett 53(15):1041–1042
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang L, Hu H (2017) Tvprnn for image caption generation. Electron Lett 53(22):1471–1473
You Q, ** H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Zeng X, Ouyang W, Yang B, Yan J, Wang X (2016) Gated bi-directional cnn for object detection. In: European conference on computer vision. Springer, pp 354–369
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China, under Grant 61673402, 61273270 and 60802069, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002), and by the Science and Technology Program of Guangzhou, China, under Grant 201704020180, and the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, W., Hu, H. Image Captioning Using Region-Based Attention Joint with Time-Varying Attention. Neural Process Lett 50, 1005–1017 (2019). https://doi.org/10.1007/s11063-019-10005-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10005-z