Log in

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

In recent years, multi-modal pre-training models have demonstrated remarkable cross-modal representation capabilities, catalyzing the rapid evolution of multi-modal downstream tasks, particularly in image caption generation, towards a more comprehensive end-to-end framework. This evolutionary process has significantly enhanced the performance of such models. However, current technology has yet to fully integrate multi-modal information related to the scene into the end-to-end output. Consequently, the model's accuracy in object description is deficient, and its generalization ability across different scenes remains inadequate. To address these challenges, this paper presents the MTMixIC model, an innovative solution for image caption generation. The MTMixIC model adopts a multi-modal pre-training approach, integrating text mixup optimization techniques to enhance image caption generation. Leveraging the intricate relationship between images and text, it directly translates images into descriptive text, even in zero-shot scenarios, achieving end-to-end caption generation. This addresses the limitations of single-modal pre-training models, which struggle with multi-modal characteristics, resulting in improved caption quality. Additionally, a text mixup optimization network refines generated captions by exploiting correlations between multi-modal features and captions. This not only maintains semantic accuracy but also aligns better with the annotation style of the scene, addressing suboptimal performance in zero-shot caption generation and enhancing generalization capabilities. Experimental validation on datasets like MS COCO14, MS COCO17, Flickr8k, Sydney, RSICD, etc., demonstrates superior accuracy, fluency, and generalization compared to traditional architectures. The proposed model offers a novel approach, contributing to the evolution of multi-modal downstream tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

  2. Gao, L.L., Li, X.P., Song, J.K., Shen, H.T.: Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1112–1131 (2020).

  3. Zhang, X., Sun, X., Luo, Y., et al.: RSTNet: Captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New York (2021). https://doi.org/10.1109/CVPR46437.2021.01521.

  4. Cui, H., Zhu, L., Li, J.J., Yang, Y., Nie, L.Q.: Scalable deep hashing for large scale social image retrieval. IEEE Trans. Image Process. 29, 1271–1284 (2019)

    Article  MathSciNet  Google Scholar 

  5. Wu, S., Wieland, J., Farivar, O., et al.: Automatic Alt-text: computer-generated image descriptions for blind users on a social network service. In: The 2017 ACM Conference. ACM, New York (2017). https://doi.org/10.1145/2998181.2998364.

  6. Das, A., Kottur, S., Gupta, K., et al.: Visual Dialog. (2016). https://doi.org/10.1109/tpami.2018.2828437

    Article  Google Scholar 

  7. Jain, U., Lazebnik, S., Schwing, A.: Two can play this game: visual dialog with discriminative question generation and answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, New York (2018). https://doi.org/10.1109/CVPR.2018.00603.

  8. Luo, Y., Ji, J., Sun, X., et al.: Dual-level collaborative transformer for image captioning. Proc AAAI Conf. Artif. Intell. 35(3), 2286–2293 (2021)

    Google Scholar 

  9. Radford, A., Kim, J. W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763 (2021).

  10. Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900 (2022).

  11. Fang, Z.Y., Wang, J.F., Hu, X.W., Liang, L., Gan, Z., Wang, L.J., Yang, Y.Z., Liu, Z.C.: Injecting semantic concepts into Endend image captioning. In: Proceedings of the 2022 IEEE/CVF conference on computer vision and pattern recognition. IEEE, New Orleans, pp 17988–17998 (2022).

  12. Wang, Y. Y., Xu, J.G., Sun, Y.F.: Endend transformer based model for image captioning. In: Proceedings of the 36th AAAI conference on artificial intelligence, pp 2585–2594. AAAI Press, Palo Alto (2022).

  13. Xu, K., Bajl, Kiros R., et al.: Show, attend and tell: Neural Image caption generation with visual attention. In: 32nd International conference on machine learning. International Machine Learning Society (IMLS), Lile, pp 2048–2057 (2015)

  14. Chen, L,, Zhang HW, **ao J, Nie LQ, Shao J, Liu W, Chua TS. SCACNN: Spatial and channel wise attention in convolutional networks for image captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp 6298–6306. IEEE, Honolulu (2017).

  15. Huang, L., Wang, W.M., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, pp 4633–4642. IEEE, Seoul (2019).

  16. Liu, W., Chen, S., Guo, L., et al.: CPTR: full transformer network for image captioning. 2021.

  17. Mao, J.H., Xu, W., Yang, Y., Wang, J., Huang, Z.H., Yuille, A.L.: Deep captioning with multi-modal recurrent neural networks (m-RNN). In: Proceedings of the 3rd International conference on learning representations. San Diego, pp 1–17 (2015).

  18. Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: International Conference on Learning Representations (2021)

  19. Dong, L., Yang, N., Wang, W., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems 32, Volume 17 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), 8–14 December 2019, pp. 13019–13031. Curran Associates, Inc., Vancouver (2020)

  20. Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  21. Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  22. Devlin, J., Chang, M. W., Lee, K, et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv preprint ar** language-image pre-training for unified vision-language understanding and generation (2022). https://doi.org/10.48550/ar**v.2201.12086.

  23. Lu, J., Batra, D., Parikh, D., et al.: Vilbert: Pretraining task-agnostic visio linguistic representations for vision-and-language tasks. Adv. Neural Inform. Process. Syst. 32 (2019).

  24. Hu, R., Singh, A.: Unit: Multi-modal multitask learning with a unified transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1439–1449 (2021).

  25. Guo, H., Mao, Y., Zhang, R.: Augmenting data with mixup for sentence classification: An empirical study. arxiv preprint ar**v:1905.08941 (2019).

  26. Sun, L., **a, C., Yin, W., et al.: Mixup-transformer: dynamic data augmentation for nlp tasks. arxiv preprint ar**v:2010.02394 (2020).

  27. Chen, J., Yang, Z., Yang, D.: Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arxiv preprint ar**v:2004.12239 (2020).

  28. Yoon, S., Kim, G., Park, K.: Ssmix: Saliency-based span mixup for text classification. arxiv preprint ar**v:2106.08062 (2021).

  29. Yang, Z., Dai, Z., Yang, Y., et al.: Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inform. Process. Syst, pp 5754–5764 (2019)

  30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comp. Sci. (2014). https://doi.org/10.48550/ar**v.1409.1556.

  31. Zhang, H., Cisse, M., Dauphin, Y. N., et al.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

  32. Rennie, S.J., Marcheret, E., Mroueh, Y., et al.: Self-critical Sequence Training for Image Captioning. IEEE (2016). https://doi.org/10.1109/CVPR.2017.131

    Article  Google Scholar 

  33. Vinyals, O., Toshev, A., Bengio, S. et al.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164 (2015).

  34. Xu, K.,, Ba. J., Kiros, R., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp 2048–2057. PMLR (2015).

  35. Shen, S., Li, L. H., Tan, H., et al.: How much can CLIP benefit vision-and-language tasks? In: International Conference on Learning Representations (2021)

  36. Zhang, P., Li, X., Hu, X., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)

  37. Chen, L., Zhang, H., **ao, J., et al.: SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New York (2016). https://doi.org/10.1109/CVPR.2017.667.

  38. Gan, Z., Gan, C., He, X., et al.: Semantic compositional networks for visual captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, New York (2017). https://doi.org/10.1109/CVPR.2017.127.

  39. **aofeng, M. A., Zhao, R., Shi, Z.: Multiscale methods for optical remote-sensing image captioning. IEEE Geosci. Remote Sens. Lett. 18(11), 2001–2005 (2020).

  40. Zhang, Z., Diao, W., Zhang, W., et al.: LAM: Remote sensing image captioning with label-attention mechanism. Remote Sensing 11(20):2349 (2019).https://doi.org/10.3390/rs11202349.

  41. Wang, Z., Yu, J., Yu, A. W., et al.: SimVLM: simple visual language model pretraining with weak supervision. In: International Conference on Learning Representations (2021).

  42. Hu, X. et al.: Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989 (2022).

  43. Yan, K., Ji, L., Luo, H., et al.: Control image captioning spatially and temporally. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp 2014–2025 (2021).

  44. Lin, T. Y., Maire, M., Belongie, S., et al.: Microsoft COCO: Common objects in context. In: 13th European Conference on Computer Vision (ECCV2014), pp 740–755. Springer Verlag, Zurich, Switzerland (2014)

  45. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. ACM (2003). https://doi.org/10.1145/860435.860459

    Article  Google Scholar 

  46. Qu, B.: Deep semantic understanding of high resolution remote sensing image. In: International conference on computer, information and telecommunication systems (Cits), pp 1–5. IEEE, New York (2016)

  47. Lu, X. et al.: Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 56(4), 2183–2195 (2017).

  48. Papineni, K. et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 (2002).

  49. Satanjeev B. M.: An Automatic metric for MT evaluation with improved correlation with human judgments. ACL-2005, 228–231 (2005).

  50. Lyn, C.: Automatic evaluation of summaries using N-gram cooccurence statistics. In: Proceedings of Human Language Technology Conference (2003).

  51. Vedantam, R., Zitnick, C. L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New York (2015).https://doi.org/10.1109/CVPR.2015.7299087.

Download references

Funding

This study was funded by the Science and Technology Project in **’an (No. 22GXFW0123), the authors would like to thank the anonymous reviewers for their helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Contributions

**g-Tao Sun and Xuan Min wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xuan Min.

Ethics declarations

Conflict of interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, JT., Min, X. Research on image caption generation method based on multi-modal pre-training model and text mixup optimization. SIViP (2024). https://doi.org/10.1007/s11760-024-03268-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03268-0

Keywords

Navigation