An Investigation of CNN-CARU for Image Captioning

  • Conference paper
  • First Online:
4th International Conference on Electronics and Signal Processing (ICESP 2023)

Abstract

The goal of an image description is to extract essential information and a description of the content of a media feature from an image. This description can be obtained directly from a human-understandable description of an interesting image (retrieval-based image with object(s) and their action description) or encoded by an encoder–decoder neural network. The challenge of the learning model is that it tries to project the media feature into a neutral language, which also produces the description in another feature domain. It may suffer from misidentification of scene or semantic elements. In this chapter, we attempt to address these challenges by introducing a novel image captioning framework that combines generation and retrieval. A CNN-CARU model is introduced, where the image is first encoded by a CNN-based network, and multiple captions are generated/created for a target image by an RNN network of CARU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Jamal, A. Z., Bani-Amer, M. J., & Aljawarneh, S. (2022). Image captioning techniques: A review. In 2022 International Conference on Engineering & MIS (ICEMIS). IEEE. https://doi.org/10.1109/icemis56295.2022.9914173

  2. Aslam, A., & Curry, E. (2021). A survey on object detection for the internet of multimedia things (IoMT) using deep learning and event-based middleware: Approaches, challenges, and future directions. Image and Vision Computing, 106, 104095. https://doi.org/10.1016/j.imavis.2020.104095

    Article  Google Scholar 

  3. Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080

    Article  Google Scholar 

  4. Beddiar, D. R., Nini, B., Sabokrou, M., & Hadid, A. (2020). Vision-based human activity recognition: A survey. Multimedia Tools and Applications, 79(41–42), 30509–30555. https://doi.org/10.1007/s11042-020-09004-3

    Article  Google Scholar 

  5. Chan, K. H., & Im, S. K. (2022). Data stream classification by using stacked CARU networks. In 2022 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. https://doi.org/10.1109/bigcomp54360.2022.00087

  6. Chan, K. H., & Im, S. K. (2022). Partial attention modeling for sentiment analysis of big data. In 2022 7th International Conference on Frontiers of Signal Processing (ICFSP). IEEE. https://doi.org/10.1109/icfsp55781.2022.9924693

  7. Chan, K. H., Im, S. K., & Ke, W. (2020). VGGreNet: A light-weight VGGNet with reused convolutional set. In 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC). IEEE . https://doi.org/10.1109/ucc48980.2020.00068

  8. Chan, K. H., Im, S. K., & Ke, W. (2021). Multiple classifier for concatenate-designed neural network. Neural Computing and Applications, 34(2), 1359–1372. https://doi.org/10.1007/s00521-021-06462-0

    Article  Google Scholar 

  9. Chan, K. H., Im, S. K., Ke, W., & Lei, N. L. (2018). SinP[N]: A fast convergence activation function for convolutional neural networks. In 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion). IEEE. https://doi.org/10.1109/ucc-companion.2018.00082

  10. Chan, K. H., Im, S. K., & Pau, G. (2022). Applying and optimizing NLP model with CARU. In 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS). IEEE. https://doi.org/10.1109/icaccs54159.2022.9785075

  11. Chan, K. H., Im, S. K., & Zhang, Y. (2022). Optimization of language models by word computing. In 2022 The 6th International Conference on Graphics and Signal Processing (ICGSP). ACM. https://doi.org/10.1145/3561518.3561525

  12. Chan, K. H., Ke, W., & Im, S. K. (2020). CARU: A content-adaptive recurrent unit for the transition of hidden state in NLP. In Neural Information Processing (pp. 693–703). Springer International Publishing. https://doi.org/10.1007/978-3-030-63830-6XXSlahUndXX58

  13. Chan, K. H., Pau, G., & Im, S. K. (2021). Chebyshev pooling: An alternative layer for the pooling of CNNs-based classifier. In 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology (CCET). IEEE. https://doi.org/10.1109/ccet52649.2021.9544405

  14. Chen, N., Pan, X., Chen, R., Yang, L., Lin, Z., Ren, Y., Yuan, H., Guo, X., Huang, F., & Wang, W. (2021). Distributed attention for grounded image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3474085.3475354

  15. Chen, Q., & Wang, Y. (2021). Key-performance-indicator-related state monitoring based on kernel canonical correlation analysis. Control Engineering Practice, 107, 104692. https://doi.org/10.1016/j.conengprac.2020.104692

    Article  Google Scholar 

  16. Gao, X., Niu, S., & Sun, Q. (2019). Two-directional two-dimensional kernel canonical correlation analysis. IEEE Signal Processing Letters, 26(11), 1578–1582. https://doi.org/10.1109/lsp.2019.2939986

    Article  Google Scholar 

  17. Gu, Y., Wang, Y., & Li, Y. (2019). A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection. Applied Sciences, 9(10), 2110. https://doi.org/10.3390/app9102110

    Article  Google Scholar 

  18. Hoeser, T., Bachofer, F., & Kuenzer, C. (2020). Object detection and image segmentation with deep learning on Earth observation data: A review—part II: Applications. Remote Sensing, 12(18), 3053. https://doi.org/10.3390/rs12183053

    Article  Google Scholar 

  19. Hu, A., Chen, S., & **, Q. (2020). ICECAP: Information concentrated entity-aware image captioning. In Proceedings of the 28th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3394171.3413576

  20. Hu, M., Wang, H., Wang, X., Yang, J., & Wang, R. (2019). Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks. Journal of Visual Communication and Image Representation, 59, 176–185. https://doi.org/10.1016/j.jvcir.2018.12.039

    Article  Google Scholar 

  21. Jiang, W., Ma, L., Jiang, Y.G., Liu, W., & Zhang, T. (2018). Recurrent fusion network for image captioning. In Computer Vision – ECCV 2018 (pp. 510–526). Springer International Publishing. https://doi.org/10.1007/978-3-030-01216-8XXSlahUndXX31

  22. Ke, W., & Chan, K. H. (2021). A multilayer CARU framework to obtain probability distribution for paragraph-based sentiment analysis. Applied Sciences, 11(23), 11344. https://doi.org/10.3390/app112311344

    Article  Google Scholar 

  23. Li, R., Wang, Z., & Zhang, L. (2021). Image caption and medical report generation based on deep learning: a review and algorithm analysis. In 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI). IEEE. https://doi.org/10.1109/cisai54367.2021.00078

  24. Liu, J., Cheng, K., **, H., & Wu, Z. (2022). An image captioning algorithm based on combination attention mechanism. Electronics, 11(9), 1397. https://doi.org/10.3390/electronics11091397

    Article  Google Scholar 

  25. Mao, Y., Chen, L., Jiang, Z., Zhang, D., Zhang, Z., Shao, J., & **ao, J. (2022). Rethinking the reference-based distinctive image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3503161.3548358

  26. Peng, Y., Qi, J., & Yuan, Y. (2018). Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, 27(11), 5585–5599. https://doi.org/10.1109/tip.2018.2852503

    Article  MathSciNet  Google Scholar 

  27. Sattari, Z. F., Khotanlou, H., & Alighardash, E. (2022). Improving image captioning with local attention mechanism. In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS). IEEE. https://doi.org/10.1109/cfis54774.2022.9756493

  28. Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image captioning: A comprehensive survey. In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC). IEEE. https://doi.org/10.1109/parc49193.2020.236619

  29. Unar, S., Wang, X., Zhang, C., & Wang, C. (2019). Detected text-based image retrieval approach for textual images. IET Image Processing, 13(3), 515–521. https://doi.org/10.1049/iet-ipr.2018.5277

    Article  Google Scholar 

  30. Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., & Li, Y. (2021). Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM. https://doi.org/10.1145/3472749.3474765

  31. Wang, J., Xu, W., Wang, Q., & Chan, A.B. (2022). On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1–1. https://doi.org/10.1109/tpami.2022.3159811

    Google Scholar 

  32. Wang, L., Qian, X., Zhang, Y., Shen, J., & Cao, X. (2020). Enhancing sketch-based image retrieval by CNN semantic re-ranking. IEEE Transactions on Cybernetics, 50(7), 3330–3342. https://doi.org/10.1109/tcyb.2019.2894498

    Article  Google Scholar 

  33. Wu, L., Xu, M., Wang, J., & Perry, S. (2020). Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia, 22(3), 808–818. https://doi.org/10.1109/tmm.2019.2931815

    Article  Google Scholar 

  34. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM. https://doi.org/10.1145/3394486.3403172

  35. Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., Chen, Z., & Gao, X. (2022). Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 32(1), 43–51. https://doi.org/10.1109/tcsvt.2021.3067449

    Article  Google Scholar 

  36. Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation- and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing,29, 9627–9640. https://doi.org/10.1109/tip.2020.3028651

    Article  MathSciNet  MATH  Google Scholar 

  37. Zhao, D., Wang, A., & Russakovsky, O. (2021). Understanding and evaluating racial biases in image captioning. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. https://doi.org/10.1109/iccv48922.2021.01456

  38. Zhu, H., Wang, R., & Zhang, X. (2021). Image captioning with dense fusion connection and improved stacked attention module. Neural Processing Letters, 53(2), 1101–1118. https://doi.org/10.1007/s11063-021-10431-y

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the Macao Polytechnic University (Research Project RP/FCA-06/2023).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ka-Hou Chan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Im, SK., Chan, KH. (2024). An Investigation of CNN-CARU for Image Captioning. In: Yeom, S. (eds) 4th International Conference on Electronics and Signal Processing. ICESP 2023. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-36670-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36670-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36669-7

  • Online ISBN: 978-3-031-36670-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Navigation