Log in

Multi-level video captioning method based on semantic space

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video captioning is designed to generate natural language descriptions based on video content. Traditional methods extract visual features and interactive relationship features between objects, but the problem of video feature isolation and semantic hierarchy is ignored. This paper proposes a Multi-Level Video Captioning Method based on semantic space (S-MLM) to solve the above problems. S-MLM extracts different levels of visual elements and visual relationships, and the visual information of different levels is aggregated layer by layer to complete the generation of low-level to high-level visual features. The multi-level structure semantic graph is constructed from the semantic point of view. It does not rely on external knowledge bases, and uses its own information as guidance to enhance feature representation and improve semantic understanding. We conduct experiments on MSVD and MSR-VTT datasets, and the experimental results show that the performance of video captioning is further improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The data that support the findings of this study are openly available. The MSR-VTT dataset is available at https://disk.pku.edu.cn/#/link/BE39AF93BE1882FF987BAC900202B266. The MSVD dataset is available at https://disk.pku.edu.cn/#/link/CC02BD15907BFFF63E5AAE4BF353A202. The VATEX dataset is available at https://hyper.ai/datasets/17484. The YouCookII dataset is available at https://hyper.ai/datasets/17147. The TVC dataset is s available at https://tvr.cs.unc.edu/tvc.html.

References

  1. Chen S, Yao T, Jiang Y-G (2019) Deep learning for video captioning: A review. IJCAI 1

  2. Monfort M, Pan B, Ramakrishnan K et al (2021) Multi-moments in time: learning and interpreting models for multi-action video understanding. IEEE Trans Pattern Anal Mach Intel 44(12):9434–9445

    Article  Google Scholar 

  3. Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2018) Surveil- lance applications. In: 2018 International Conference on Communication and Signal Processing (ICCSP). IEEE, pp 563–568

  4. Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ (2019) Multi-view active learning for video recommendation. IJCAI 2019:2053–2059

    Google Scholar 

  5. Aafaq N et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Google Scholar 

  6. He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  7. Hershey S et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE

    Google Scholar 

  8. Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision

  9. Ng JY-H et al (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  10. Wang X et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  11. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  12. Venugopalan, Subhashini et al (2014) Translating videos to natural language using deep recurrent neural networks. ar**v preprint ar**v:1412.4729

  13. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venu- gopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2625–2634

  14. Gan Z et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  15. Yao L et al (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision

    Google Scholar 

  16. Chen Y et al (2018) Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV)

    Google Scholar 

  17. Pei W et al (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Google Scholar 

  18. Hou, **gyi et al (2020) Commonsense and relation reasoning for image and video captioning. In: Proceedings of the AAAI conference on artificial intelligence 34(07)

    Google Scholar 

  19. Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  20. Pan B et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Google Scholar 

  21. Zhang Z et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Google Scholar 

  22. Bai Y, Wang J, Long Y et al (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 3556–3564

  23. He E, Li G, Qi Y et al (2022) Hierarchical modular network for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17939–17948

  24. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp 190–200

  25. Xu J et al (2016) MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  26. Tan G et al (2020) Learning to discretely compose reasoning module networks for video captioning. ar**v preprint ar**v:2007.09049

  27. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318

  28. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp 376–380

  29. Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  30. Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out

  31. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017) Inception-v4, inception- resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31

  32. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308

  33. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. ar**[J]. ar**v preprint ar**v:2305.11003

  34. Patrick M et al (2020) Support-set bottlenecks for video-text representation learning. ar**v preprint ar**v:2010.02824

  35. Li L et al (2021) Value: A multi-task benchmark for video-and-language understanding evaluation. ar**v preprint ar**v:2106.04632

  36. Lei J et al (2020) Tvr: A large-scale dataset for video-subtitle moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing

    Google Scholar 

  37. Li L et al (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. ar**v preprint ar**v:2005.00200

  38. Shi B et al (2019) Dense procedure captioning in narrated instructional videos. In: Proceedings of the 57th annual meeting of the association for computational linguistics

    Google Scholar 

  39. Sun C et al (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision

    Google Scholar 

  40. He C et al (2023) Hqg-net: Unpaired medical image enhancement with high-quality guidance. IEEE Trans Neural Netw Learn Syst

  41. He C et al (2023) Degradation-resistant unfolding network for heterogeneous image fusion. Proceedings of the IEEE/CVF International Conference on Computer Vision

  42. Parmar N et al (2018) Image transformer. In: International conference on machine learning. PMLR

    Google Scholar 

  43. Ramachandran P et al (2019) Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 32

  44. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? ICML 2(3)

  45. Girdhar R et al (2019) Video action transformer network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Google Scholar 

  46. Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia 25:8753–8766. https://doi.org/10.1109/TMM.2023.3241517

    Article  Google Scholar 

  47. Chang J, Zhang L, Shao Z (2023) View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer. Multimedia Syst 29(6):3891–3901

    Article  Google Scholar 

  48. Lu J et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32

  49. He C et al (2023) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. ar**v preprint ar**v:2308.03166

  50. Yang A, Nagrani A, Seo PH, Miech A, Pont-Tuset J, Laptev I, Sivic J, Schmid C (2023) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10714–10726

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities B220202019, National Nature Science Foundation of China under grants 62276090, Top Talent of Changzhou “The 14th Five-Year Plan” High-Level Health Talents Training Project (Grant No. 2022260), 2023 Soochow University Graduate Education Reform Achievement Award Cultivation Project (KY20231517) and the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Gu.

Ethics declarations

Conflict of interest

The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, X., Zeng, Y., Gu, M. et al. Multi-level video captioning method based on semantic space. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18372-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18372-z

Keywords

Navigation