An Investigation of CNN-CARU for Image Captioning

Im, Sio-Kei; Chan, Ka-Hou

doi:10.1007/978-3-031-36670-3_2

Part of the book series: Signals and Communication Technology ((SCT))

Included in the following conference series:

International Conference on Electronics and Signal Processing

63 Accesses
1 Citations

Abstract

The goal of an image description is to extract essential information and a description of the content of a media feature from an image. This description can be obtained directly from a human-understandable description of an interesting image (retrieval-based image with object(s) and their action description) or encoded by an encoder–decoder neural network. The challenge of the learning model is that it tries to project the media feature into a neutral language, which also produces the description in another feature domain. It may suffer from misidentification of scene or semantic elements. In this chapter, we attempt to address these challenges by introducing a novel image captioning framework that combines generation and retrieval. A CNN-CARU model is introduced, where the image is first encoded by a CNN-based network, and multiple captions are generated/created for a target image by an RNN network of CARU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al-Jamal, A. Z., Bani-Amer, M. J., & Aljawarneh, S. (2022). Image captioning techniques: A review. In 2022 International Conference on Engineering & MIS (ICEMIS). IEEE. https://doi.org/10.1109/icemis56295.2022.9914173
Aslam, A., & Curry, E. (2021). A survey on object detection for the internet of multimedia things (IoMT) using deep learning and event-based middleware: Approaches, challenges, and future directions. Image and Vision Computing, 106, 104095. https://doi.org/10.1016/j.imavis.2020.104095
Article Google Scholar
Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080
Article Google Scholar
Beddiar, D. R., Nini, B., Sabokrou, M., & Hadid, A. (2020). Vision-based human activity recognition: A survey. Multimedia Tools and Applications, 79(41–42), 30509–30555. https://doi.org/10.1007/s11042-020-09004-3
Article Google Scholar
Chan, K. H., & Im, S. K. (2022). Data stream classification by using stacked CARU networks. In 2022 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. https://doi.org/10.1109/bigcomp54360.2022.00087
Chan, K. H., & Im, S. K. (2022). Partial attention modeling for sentiment analysis of big data. In 2022 7th International Conference on Frontiers of Signal Processing (ICFSP). IEEE. https://doi.org/10.1109/icfsp55781.2022.9924693
Chan, K. H., Im, S. K., & Ke, W. (2020). VGGreNet: A light-weight VGGNet with reused convolutional set. In 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC). IEEE . https://doi.org/10.1109/ucc48980.2020.00068
Chan, K. H., Im, S. K., & Ke, W. (2021). Multiple classifier for concatenate-designed neural network. Neural Computing and Applications, 34(2), 1359–1372. https://doi.org/10.1007/s00521-021-06462-0
Article Google Scholar
Chan, K. H., Im, S. K., Ke, W., & Lei, N. L. (2018). SinP[N]: A fast convergence activation function for convolutional neural networks. In 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion). IEEE. https://doi.org/10.1109/ucc-companion.2018.00082
Chan, K. H., Im, S. K., & Pau, G. (2022). Applying and optimizing NLP model with CARU. In 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS). IEEE. https://doi.org/10.1109/icaccs54159.2022.9785075
Chan, K. H., Im, S. K., & Zhang, Y. (2022). Optimization of language models by word computing. In 2022 The 6th International Conference on Graphics and Signal Processing (ICGSP). ACM. https://doi.org/10.1145/3561518.3561525
Chan, K. H., Ke, W., & Im, S. K. (2020). CARU: A content-adaptive recurrent unit for the transition of hidden state in NLP. In Neural Information Processing (pp. 693–703). Springer International Publishing. https://doi.org/10.1007/978-3-030-63830-6XXSlahUndXX58
Chan, K. H., Pau, G., & Im, S. K. (2021). Chebyshev pooling: An alternative layer for the pooling of CNNs-based classifier. In 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology (CCET). IEEE. https://doi.org/10.1109/ccet52649.2021.9544405
Chen, N., Pan, X., Chen, R., Yang, L., Lin, Z., Ren, Y., Yuan, H., Guo, X., Huang, F., & Wang, W. (2021). Distributed attention for grounded image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3474085.3475354
Chen, Q., & Wang, Y. (2021). Key-performance-indicator-related state monitoring based on kernel canonical correlation analysis. Control Engineering Practice, 107, 104692. https://doi.org/10.1016/j.conengprac.2020.104692
Article Google Scholar
Gao, X., Niu, S., & Sun, Q. (2019). Two-directional two-dimensional kernel canonical correlation analysis. IEEE Signal Processing Letters, 26(11), 1578–1582. https://doi.org/10.1109/lsp.2019.2939986
Article Google Scholar
Gu, Y., Wang, Y., & Li, Y. (2019). A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection. Applied Sciences, 9(10), 2110. https://doi.org/10.3390/app9102110
Article Google Scholar
Hoeser, T., Bachofer, F., & Kuenzer, C. (2020). Object detection and image segmentation with deep learning on Earth observation data: A review—part II: Applications. Remote Sensing, 12(18), 3053. https://doi.org/10.3390/rs12183053
Article Google Scholar
Hu, A., Chen, S., & **, Q. (2020). ICECAP: Information concentrated entity-aware image captioning. In Proceedings of the 28th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3394171.3413576
Hu, M., Wang, H., Wang, X., Yang, J., & Wang, R. (2019). Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks. Journal of Visual Communication and Image Representation, 59, 176–185. https://doi.org/10.1016/j.jvcir.2018.12.039
Article Google Scholar
Jiang, W., Ma, L., Jiang, Y.G., Liu, W., & Zhang, T. (2018). Recurrent fusion network for image captioning. In Computer Vision – ECCV 2018 (pp. 510–526). Springer International Publishing. https://doi.org/10.1007/978-3-030-01216-8XXSlahUndXX31
Ke, W., & Chan, K. H. (2021). A multilayer CARU framework to obtain probability distribution for paragraph-based sentiment analysis. Applied Sciences, 11(23), 11344. https://doi.org/10.3390/app112311344
Article Google Scholar
Li, R., Wang, Z., & Zhang, L. (2021). Image caption and medical report generation based on deep learning: a review and algorithm analysis. In 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI). IEEE. https://doi.org/10.1109/cisai54367.2021.00078
Liu, J., Cheng, K., **, H., & Wu, Z. (2022). An image captioning algorithm based on combination attention mechanism. Electronics, 11(9), 1397. https://doi.org/10.3390/electronics11091397
Article Google Scholar
Mao, Y., Chen, L., Jiang, Z., Zhang, D., Zhang, Z., Shao, J., & **ao, J. (2022). Rethinking the reference-based distinctive image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3503161.3548358
Peng, Y., Qi, J., & Yuan, Y. (2018). Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, 27(11), 5585–5599. https://doi.org/10.1109/tip.2018.2852503
Article MathSciNet Google Scholar
Sattari, Z. F., Khotanlou, H., & Alighardash, E. (2022). Improving image captioning with local attention mechanism. In 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS). IEEE. https://doi.org/10.1109/cfis54774.2022.9756493
Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image captioning: A comprehensive survey. In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC). IEEE. https://doi.org/10.1109/parc49193.2020.236619
Unar, S., Wang, X., Zhang, C., & Wang, C. (2019). Detected text-based image retrieval approach for textual images. IET Image Processing, 13(3), 515–521. https://doi.org/10.1049/iet-ipr.2018.5277
Article Google Scholar
Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., & Li, Y. (2021). Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM. https://doi.org/10.1145/3472749.3474765
Wang, J., Xu, W., Wang, Q., & Chan, A.B. (2022). On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1–1. https://doi.org/10.1109/tpami.2022.3159811
Google Scholar
Wang, L., Qian, X., Zhang, Y., Shen, J., & Cao, X. (2020). Enhancing sketch-based image retrieval by CNN semantic re-ranking. IEEE Transactions on Cybernetics, 50(7), 3330–3342. https://doi.org/10.1109/tcyb.2019.2894498
Article Google Scholar
Wu, L., Xu, M., Wang, J., & Perry, S. (2020). Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia, 22(3), 808–818. https://doi.org/10.1109/tmm.2019.2931815
Article Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM. https://doi.org/10.1145/3394486.3403172
Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., Chen, Z., & Gao, X. (2022). Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 32(1), 43–51. https://doi.org/10.1109/tcsvt.2021.3067449
Article Google Scholar
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation- and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing,29, 9627–9640. https://doi.org/10.1109/tip.2020.3028651
Article MathSciNet MATH Google Scholar
Zhao, D., Wang, A., & Russakovsky, O. (2021). Understanding and evaluating racial biases in image captioning. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. https://doi.org/10.1109/iccv48922.2021.01456
Zhu, H., Wang, R., & Zhang, X. (2021). Image captioning with dense fusion connection and improved stacked attention module. Neural Processing Letters, 53(2), 1101–1118. https://doi.org/10.1007/s11063-021-10431-y
Article Google Scholar

Download references

Acknowledgements

This work is supported by the Macao Polytechnic University (Research Project RP/FCA-06/2023).

Author information

Authors and Affiliations

Macao Polytechnic University, Macau, China
Sio-Kei Im
Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
Ka-Hou Chan

Authors

Sio-Kei Im
View author publications
You can also search for this author in PubMed Google Scholar
Ka-Hou Chan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ka-Hou Chan .

Editor information

Editors and Affiliations

Daegu University Gyeongsan, Gyeongsan-si, Korea (Republic of)
Seokwon Yeom

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Im, SK., Chan, KH. (2024). An Investigation of CNN-CARU for Image Captioning. In: Yeom, S. (eds) 4th International Conference on Electronics and Signal Processing. ICESP 2023. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-36670-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-36670-3_2
Published: 10 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36669-7
Online ISBN: 978-3-031-36670-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics