Multi-level video captioning method based on semantic space

Yao, **ao; Zeng, Yuanlin; Gu, Min; Yuan, Ruxi; Li, Jie; Ge, Junyi

doi:10.1007/s11042-024-18372-z

Multi-level video captioning method based on semantic space

Published: 08 February 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

**ao Yao¹,
Yuanlin Zeng¹,
Min Gu²,
Ruxi Yuan¹,
Jie Li¹ &
…
Junyi Ge²

100 Accesses
Explore all metrics

Abstract

Video captioning is designed to generate natural language descriptions based on video content. Traditional methods extract visual features and interactive relationship features between objects, but the problem of video feature isolation and semantic hierarchy is ignored. This paper proposes a Multi-Level Video Captioning Method based on semantic space (S-MLM) to solve the above problems. S-MLM extracts different levels of visual elements and visual relationships, and the visual information of different levels is aggregated layer by layer to complete the generation of low-level to high-level visual features. The multi-level structure semantic graph is constructed from the semantic point of view. It does not rely on external knowledge bases, and uses its own information as guidance to enhance feature representation and improve semantic understanding. We conduct experiments on MSVD and MSR-VTT datasets, and the experimental results show that the performance of video captioning is further improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

Semantic guidance network for video captioning

Article Open access 26 September 2023

Video Captioning Based on the Spatial-Temporal Saliency Tracing

Data Availability

The data that support the findings of this study are openly available. The MSR-VTT dataset is available at https://disk.pku.edu.cn/#/link/BE39AF93BE1882FF987BAC900202B266. The MSVD dataset is available at https://disk.pku.edu.cn/#/link/CC02BD15907BFFF63E5AAE4BF353A202. The VATEX dataset is available at https://hyper.ai/datasets/17484. The YouCookII dataset is available at https://hyper.ai/datasets/17147. The TVC dataset is s available at https://tvr.cs.unc.edu/tvc.html.

References

Chen S, Yao T, Jiang Y-G (2019) Deep learning for video captioning: A review. IJCAI 1
Monfort M, Pan B, Ramakrishnan K et al (2021) Multi-moments in time: learning and interpreting models for multi-action video understanding. IEEE Trans Pattern Anal Mach Intel 44(12):9434–9445
Article Google Scholar
Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2018) Surveil- lance applications. In: 2018 International Conference on Communication and Signal Processing (ICCSP). IEEE, pp 563–568
Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2019) Multi-view active learning for video recommendation. IJCAI 2019:2053–2059
Google Scholar
Aafaq N et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Google Scholar
He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Hershey S et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE
Google Scholar
Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision
Ng JY-H et al (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang X et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Venugopalan, Subhashini et al (2014) Translating videos to natural language using deep recurrent neural networks. ar**v preprint ar**v:1412.4729
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venu- gopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2625–2634
Gan Z et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Yao L et al (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision
Google Scholar
Chen Y et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV)
Google Scholar
Pei W et al (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Google Scholar
Hou, **gyi et al (2020) Commonsense and relation reasoning for image and video captioning. In: Proceedings of the AAAI conference on artificial intelligence 34(07)
Google Scholar
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Pan B et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Google Scholar
Zhang Z et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Google Scholar
Bai Y, Wang J, Long Y et al (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 3556–3564
He E, Li G, Qi Y et al (2022) Hierarchical modular network for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17939–17948
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp 190–200
Xu J et al (2016) MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Tan G et al (2020) Learning to discretely compose reasoning module networks for video captioning. ar**v preprint ar**v:2007.09049
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp 376–380
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017) Inception-v4, inception- resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. ar**[J]. ar**v preprint ar**v:2305.11003
Patrick M et al (2020) Support-set bottlenecks for video-text representation learning. ar**v preprint ar**v:2010.02824
Li L et al (2021) Value: A multi-task benchmark for video-and-language understanding evaluation. ar**v preprint ar**v:2106.04632
Lei J et al (2020) Tvr: A large-scale dataset for video-subtitle moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing
Google Scholar
Li L et al (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. ar**v preprint ar**v:2005.00200
Shi B et al (2019) Dense procedure captioning in narrated instructional videos. In: Proceedings of the 57th annual meeting of the association for computational linguistics
Google Scholar
Sun C et al (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision
Google Scholar
He C et al (2023) Hqg-net: Unpaired medical image enhancement with high-quality guidance. IEEE Trans Neural Netw Learn Syst
He C et al (2023) Degradation-resistant unfolding network for heterogeneous image fusion. Proceedings of the IEEE/CVF International Conference on Computer Vision
Parmar N et al (2018) Image transformer. In: International conference on machine learning. PMLR
Google Scholar
Ramachandran P et al (2019) Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 32
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? ICML 2(3)
Girdhar R et al (2019) Video action transformer network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Google Scholar
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia 25:8753–8766. https://doi.org/10.1109/TMM.2023.3241517
Article Google Scholar
Chang J, Zhang L, Shao Z (2023) View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer. Multimedia Syst 29(6):3891–3901
Article Google Scholar
Lu J et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
He C et al (2023) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. ar**v preprint ar**v:2308.03166
Yang A, Nagrani A, Seo PH, Miech A, Pont-Tuset J, Laptev I, Sivic J, Schmid C (2023) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10714–10726

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities B220202019, National Nature Science Foundation of China under grants 62276090, Top Talent of Changzhou “The 14th Five-Year Plan” High-Level Health Talents Training Project (Grant No. 2022260), 2023 Soochow University Graduate Education Reform Achievement Award Cultivation Project (KY20231517) and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.

Author information

Authors and Affiliations

The College of IoT Engineering, Hohai University, 200 **ling North Road, Changzhou, 21300, Jiangsu, China
**ao Yao, Yuanlin Zeng, Ruxi Yuan & Jie Li
Department of Stomatology, The Third Affiliated Hospital of Soochow University, The First People’s Hospital of Changzhou, Changzhou, 21303, Jiangsu, China
Min Gu & Junyi Ge

Authors

**ao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yuanlin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Min Gu
View author publications
You can also search for this author in PubMed Google Scholar
Ruxi Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Li
View author publications
You can also search for this author in PubMed Google Scholar
Junyi Ge
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Gu.

Ethics declarations

Conflict of interest

The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, X., Zeng, Y., Gu, M. et al. Multi-level video captioning method based on semantic space. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18372-z

Download citation

Received: 04 May 2023
Revised: 12 January 2024
Accepted: 16 January 2024
Published: 08 February 2024
DOI: https://doi.org/10.1007/s11042-024-18372-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level video captioning method based on semantic space

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

Semantic guidance network for video captioning

Video Captioning Based on the Spatial-Temporal Saliency Tracing

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-level video captioning method based on semantic space

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

Semantic guidance network for video captioning

Video Captioning Based on the Spatial-Temporal Saliency Tracing

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation