Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Wang, Weixuan; Hu, Haifeng

doi:10.1007/s11063-019-10005-z

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Published: 20 February 2019

Volume 50, pages 1005–1017, (2019)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

385 Accesses
9 Citations
Explore all metrics

Abstract

In this work, we propose a novel region-based and time-varying attention network (RTAN) model for image captioning, which can determine where and when to attend to images. The RTAN is composed of region-based attention network (RAN) and time-varying attention network (TAN). For the RAN part, we integrate region proposal network with soft attention mechanism, so that it is able to locate the accurate positions of objects in an image and focus on the object most relevant to the next word. In the TAN, we design a time-varying gate to determine whether visual information is needed to generate the next word. For example, when the next word is a non-visual word, e.g. “the” or “to”, our model would predict the next word based more on the semantic information instead of visual information. Compared with the existing methods, the advantage of the proposed RTAN model is twofold: (1) the RTAN can extract more discriminative visual information; (2) it can attend to only semantic information when predicting the non-visual words. The effectiveness of RTAN is verified on MSCOCO and Flicker30k datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Brazil)

Instant access to the full article PDF.

Institutional subscriptions

Image captioning with adaptive incremental global context attention

Article 13 September 2021

Context-Assisted Attention for Image Captioning

Image Captioning with Text-Based Visual Attention

Article 27 February 2018

Notes

http://images.cocodataset.org/annotations/annotations_trainval2014.zip.

References

Chen L, Zhang H, **ao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, pp 85–91
Fu K, ** J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of international conference on learning representations
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek JJ (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision, pp 1251–1259
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481
Article Google Scholar
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang W, Hu H (2017) Multimodal object description network for dense captioning. Electron Lett 53(15):1041–1042
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang L, Hu H (2017) Tvprnn for image caption generation. Electron Lett 53(22):1471–1473
Article Google Scholar
You Q, ** H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zeng X, Ouyang W, Yang B, Yan J, Wang X (2016) Gated bi-directional cnn for object detection. In: European conference on computer vision. Springer, pp 354–369

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China, under Grant 61673402, 61273270 and 60802069, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002), and by the Science and Technology Program of Guangzhou, China, under Grant 201704020180, and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Sun Yat-sen University, Guangzhou, 510275, China
Weixuan Wang & Haifeng Hu

Authors

Weixuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, W., Hu, H. Image Captioning Using Region-Based Attention Joint with Time-Varying Attention. Neural Process Lett 50, 1005–1017 (2019). https://doi.org/10.1007/s11063-019-10005-z

Download citation

Published: 20 February 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11063-019-10005-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Brazil)

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Image captioning with adaptive incremental global context attention

Context-Assisted Attention for Image Captioning

Image Captioning with Text-Based Visual Attention

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Image captioning with adaptive incremental global context attention

Context-Assisted Attention for Image Captioning

Image Captioning with Text-Based Visual Attention

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation