Abstract
This paper's residual network is tailored to increase the high-quality image caption generation ability. The captioning is exploited using the relevant content with high-quality interpretation. The research develops a Residual Attention Generative Adversarial Network (RAGAN) and uses attention-based residual learning in Generative Adversarial Network (GAN) to improve the diversity and fidelity of the generated image captions. The RAGAN exploits the words based on the feature maps faster to generate high-quality captions. The RAGAN improves the diversity of captions generated and increases the language metrics scores. The generator is designed as an encoder-decoder mechanism that operates in an unsupervised manner. The residual learning is adopted between the encoder and decoder network. The discriminator is connected to a language evaluator unit, which provides feed-forward to the generator and discriminator to either positively or negatively influence the image captioning process. The experiments show that the proposed RAGAN performs better than the state-of-the-art GAN models.
Similar content being viewed by others
Data availability
Data included in article/supplementary material/referenced in article.
References
Beddiar DR, Oussalah M, Seppänen T (2023) Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev 56(5):4019–4076. https://doi.org/10.1007/s10462-022-10270-w
Braun S, Starr K (2019) Finding the right words: Investigating machine-generated video description quality using a corpus-based approach. J Audiovis Transl 2(2):11–35
Cao S, An G, Zheng Z, Ruan Q (2020) Interactions guided generative adversarial network for unsupervised image captioning. Neurocomputing 417:419–431
Chen T, Li Z, Wu J, Ma H, Su B (2022) Improving image captioning with pyramid attention and SC-GAN. Image vis Comput 117:104340
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. ar**v preprint ar**v:1504.00325
Chen C, Mu S, **ao W, Ye Z, Wu L, Ju Q (2019) Improving image captioning with conditional generative adversarial nets. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, no 01, pp 8142–8150
Cui W, Wang F, He X, Zhang D, Xu X, Yao M et al (2019) Multi-scale semantic segmentation and spatial relationship recognition of remote sensing images based on an attention model. Remote Sens 11(9):1044
Das B, Pal R, Majumder M, Phadikar S, Sekh AA (2023) A visual attention-based model for bengali image captioning. SN Comput Sci 4(2):208
Frolov S, Hinz T, Raue F, Hees J, Dengel A (2021) Adversarial text-to-image synthesis: a review. Neural Netw 144:187–209
Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M (2021) Text to image synthesis for improved image captioning. IEEE Access 9:64918–64928
Huang J, Liu Y, Gong S, ** H (2021) Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7199–7208
Liu J, Wang K, Xu C, Zhao Z, Xu R, Shen Y, Yang M (2020) Interactive dual generative adversarial networks for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, no 07, pp 11588–11595
Munusamy H (2022) Video captioning using semantically contextual generative adversarial network. Comput vis Image Underst 221:103453
Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AÁR (2022) Visual vs internal attention mechanisms in deep neural networks for image classification and object detection. Pattern Recogn 123:108411
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649
Poongodi M, Hamdi M, Wang H (2022). Image and audio caps: automated captioning of background sounds and images using deep learning. Multimedia Syst 1–9
Sargar O, Kinger S (2021) Image captioning methods and metrics. In: 2021 international conference on emerging smart computing and informatics (ESCI). IEEE, pp 522–526
Setiawan D, Saffachrissa MAC, Tamara S, Suhartono D (2022) Image captioning with style using generative adversarial networks. Int J Inf Vis 6(1):26–32
Sharma H, Srivastava S (2022) Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imaging 31(5):053022
Shen C, Kasra M, Pan W, Bassett GA, Malloch Y, O’Brien JF (2019) Fake images: the effects of source, intermediary, and digital media literacy on contextual assessment of image credibility online. New Media Soc 21(2):438–463
Singh A, Singh TD, Bandyopadhyay S (2021) An encoder-decoder based framework for hindi image caption generation. Multimedia Tools Appl 80(28–29):35721–35740
Stowell D (2022) Computational bioacoustics with deep learning: a review and roadmap. PeerJ 10:e13152
Tomii K, Kumar S, Zhi D, Brenner SE (2020) Meta-align: a novel HMM-based algorithm for pairwise alignment of error-prone sequencing reads. bioRxiv, 2020-05
Vizoso Á, Vaz-Álvarez M, López-García X (2021) Fighting deepfakes: media and internet giants’ converging and diverging strategies against hi-tech misinformation. Media Commun 9(1):291–300
Wang J, Xu W, Wang Q, Chan AB (2022) On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell 45(2):2088–2103
Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99
**ong R, Song Y, Li H, Wang Y (2019) Onsite video mining for construction hazards identification with visual relationships. Adv Eng Inform 42:100966
Yan S, Wu F, Smith JS, Lu W, Zhang B (2018). Image captioning using adversarial networks and reinforcement learning. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 248–253
Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2018) Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 21(4):1047–1061
Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640
Zhang B, Zhu J, Su H (2023) Toward the third generation artificial intelligence. Sci China Inf Sci 66(2):1–19
Zhou Y, Tao W, Zhang W (2021) Triple sequence generative adversarial nets for unsupervised image captioning. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7598–7602
Zhou Z, Yang Y, Li Z, Zhang X, Huang F (2022) Image captioning with residual swin transformer and actor-critic. Neural Comput Appl 1–13
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare relevant to this article's content.
Human and animal rights
This research does not involve any human participants and/or animals; hence, any informed consent or statement on the welfare of animals does not apply to this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Deepak, G., Gali, S., Sonker, A. et al. Automatic image captioning system using a deep learning approach. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08544-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s00500-023-08544-8