Abstract
Video captioning is designed to generate natural language descriptions based on video content. Traditional methods extract visual features and interactive relationship features between objects, but the problem of video feature isolation and semantic hierarchy is ignored. This paper proposes a Multi-Level Video Captioning Method based on semantic space (S-MLM) to solve the above problems. S-MLM extracts different levels of visual elements and visual relationships, and the visual information of different levels is aggregated layer by layer to complete the generation of low-level to high-level visual features. The multi-level structure semantic graph is constructed from the semantic point of view. It does not rely on external knowledge bases, and uses its own information as guidance to enhance feature representation and improve semantic understanding. We conduct experiments on MSVD and MSR-VTT datasets, and the experimental results show that the performance of video captioning is further improved.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are openly available. The MSR-VTT dataset is available at https://disk.pku.edu.cn/#/link/BE39AF93BE1882FF987BAC900202B266. The MSVD dataset is available at https://disk.pku.edu.cn/#/link/CC02BD15907BFFF63E5AAE4BF353A202. The VATEX dataset is available at https://hyper.ai/datasets/17484. The YouCookII dataset is available at https://hyper.ai/datasets/17147. The TVC dataset is s available at https://tvr.cs.unc.edu/tvc.html.
References
Chen S, Yao T, Jiang Y-G (2019) Deep learning for video captioning: A review. IJCAI 1
Monfort M, Pan B, Ramakrishnan K et al (2021) Multi-moments in time: learning and interpreting models for multi-action video understanding. IEEE Trans Pattern Anal Mach Intel 44(12):9434–9445
Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2018) Surveil- lance applications. In: 2018 International Conference on Communication and Signal Processing (ICCSP). IEEE, pp 563–568
Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2019) Multi-view active learning for video recommendation. IJCAI 2019:2053–2059
Aafaq N et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Hershey S et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE
Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision
Ng JY-H et al (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang X et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Venugopalan, Subhashini et al (2014) Translating videos to natural language using deep recurrent neural networks. ar**v preprint ar**v:1412.4729
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venu- gopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2625–2634
Gan Z et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yao L et al (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision
Chen Y et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV)
Pei W et al (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hou, **gyi et al (2020) Commonsense and relation reasoning for image and video captioning. In: Proceedings of the AAAI conference on artificial intelligence 34(07)
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Pan B et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang Z et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Bai Y, Wang J, Long Y et al (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 3556–3564
He E, Li G, Qi Y et al (2022) Hierarchical modular network for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17939–17948
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp 190–200
Xu J et al (2016) MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Tan G et al (2020) Learning to discretely compose reasoning module networks for video captioning. ar**v preprint ar**v:2007.09049
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp 376–380
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017) Inception-v4, inception- resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. ar**[J]. ar**v preprint ar**v:2305.11003
Patrick M et al (2020) Support-set bottlenecks for video-text representation learning. ar**v preprint ar**v:2010.02824
Li L et al (2021) Value: A multi-task benchmark for video-and-language understanding evaluation. ar**v preprint ar**v:2106.04632
Lei J et al (2020) Tvr: A large-scale dataset for video-subtitle moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing
Li L et al (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. ar**v preprint ar**v:2005.00200
Shi B et al (2019) Dense procedure captioning in narrated instructional videos. In: Proceedings of the 57th annual meeting of the association for computational linguistics
Sun C et al (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision
He C et al (2023) Hqg-net: Unpaired medical image enhancement with high-quality guidance. IEEE Trans Neural Netw Learn Syst
He C et al (2023) Degradation-resistant unfolding network for heterogeneous image fusion. Proceedings of the IEEE/CVF International Conference on Computer Vision
Parmar N et al (2018) Image transformer. In: International conference on machine learning. PMLR
Ramachandran P et al (2019) Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 32
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? ICML 2(3)
Girdhar R et al (2019) Video action transformer network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia 25:8753–8766. https://doi.org/10.1109/TMM.2023.3241517
Chang J, Zhang L, Shao Z (2023) View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer. Multimedia Syst 29(6):3891–3901
Lu J et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
He C et al (2023) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. ar**v preprint ar**v:2308.03166
Yang A, Nagrani A, Seo PH, Miech A, Pont-Tuset J, Laptev I, Sivic J, Schmid C (2023) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10714–10726
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities B220202019, National Nature Science Foundation of China under grants 62276090, Top Talent of Changzhou “The 14th Five-Year Plan” High-Level Health Talents Training Project (Grant No. 2022260), 2023 Soochow University Graduate Education Reform Achievement Award Cultivation Project (KY20231517) and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yao, X., Zeng, Y., Gu, M. et al. Multi-level video captioning method based on semantic space. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18372-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-18372-z