Abstract
Video description aims to translate the visual content in a video with appropriate natural language. Most of current works only focus on the description of factual content, paying insufficient attention to the emotions in the video. And the sentences always lack flexibility and vividness. In this work, a fact enhancement and emotion awakening based model is proposed to describe the video, making the sentence more attractive and colorful. The strategy of deep incremental leaning is employed to build a multi-layer sequential network firstly, and multi-stage training method is used to sufficiently optimize the model. Secondly, the modules of fact inspiration, fact reinforcement and emotion awakening are constructed layer by layer to discovery more facts and embed emotions naturally. The three modules are cumulatively trained to sufficiently mine the factual and emotional information. Two public datasets including EmVidCap-S and EmVidCap are employed to evaluate the proposed model. The experimental results show that the performance of the proposed model is superior to not only the baseline models, but also the other popular methods.
Similar content being viewed by others
Data availability
All the experimental data of this study have been included in the paper, which can be consulted through the charts and tables in the paper. For the datasets employed in this work, EmVidCap (including EmVidCap-S and EmVidCap-L) can be obtained by contacting the author. For MSVD datasets, it can be obtained on [“http://www.cs.utexas.edu/users/ml/clamp/videoDescription/Youtubeclips.tar"].
References
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Annual Meeting of the Association for Computational Linguistics Workshop, pp 65–72
Chang X, Yu Y, Yang Y et al (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Chang X, Ren P, Xu P et al (2023) A comprehensive survey of scene graphs: generation and application. IEEE Trans Pattern Anal Mach Intell 45(1):1–26
Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: AAAI Conference on artificial intelligence, pp 8191–8198
Chen T, Zhang Z, You Q, et al (2018) “factual” or “emotional”: Stylized image captioning with adaptive learning and attention. In: European Conference on computer vision, pp 527–543
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: IEEE Conference on computer vision and pattern recognition, pp 1251–1258
Deb T, Sadmanee A, Bhaumik K et al (2022) Variational stacked local attention networks for diverse video captioning. In: IEEE Winter Conference on applications of computer vision, pp 4070–4079
Fan S, Shen Z, Jiang M et al (2018) Emotional attention: a study of image sentiment and visual attention. In: IEEE Conference on computer vision and pattern recognition, pp 7521–7531
Fu T, Li L, Gan Z et al (2023) An empirical study of end-to-end video-language transformers with masked visual modeling. In: IEEE Conference on computer vision and pattern recognition, pp 22898–22909
Gan C, Gan Z, He X et al (2017) Stylenet: generating attractive visual captions with styles. In: IEEE Conference on computer vision and pattern recognition, pp 955–964
Gao L, Guo Z, Zhang H et al (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Guadarrama S, Krishnamoorthy N, Malkarnenkar G et al (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: IEEE International Conference on computer vision, pp 2712–2719
Gupta A, Srinivasan P, Shi J et al (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on computer vision and pattern recognition, pp 2012–2019
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778
Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Jiang Y, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: AAAI Conference on artificial intelligence, pp 73–79
Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM Conference on multimedia, pp 675–678
Karayil T, Irfan A, Raue F et al (2019) Conditional gans for image captioning with sentiments. In: International Conference on artificial neural networks, pp 300–312
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE Winter Conference on applications of computer vision, pp 339–348
Li C, Wang G, Wang B et al (2023a) Ds-net++: dynamic weight slicing for efficient inference in cnns and vision transformers. IEEE Trans Pattern Anal Mach Intell 45(4):4430–4446
Li M, Huang P, Chang X et al (2023b) Video pivoting unsupervised multi-modal machine translation. IEEE Trans Pattern Anal Mach Intell 45(3):3918–3932
Lin C, Och F (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual Meeting of the Association for Computational Linguistics, pp 21–26
Lin T, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European Conference on computer vision, pp 740–755
Lin K, Li L, Lin C et al (2022) Swinbert: End-to-end transformers with sparse attention for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 17949–17958
Liu T, Wan J, Dai X et al (2020) Sentiment recognition for short annotated gifs using visualtextual fusion. IEEE Trans Multimed 22(4):1098–1110
Liu S, Ren Z, Yuan J (2021) Sibnet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–3272
Luo G, Zhou Y, Sun X et al (2022) Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans Image Process 31:3386–3398
Mathews A, **e L, He X (2016) Senticap: generating image descriptions with sentiments. In: AAAI Conference on artificial intelligence, pp 3574–3580
Nagel H (1994) A vision of “vision and language” comprises action: an example from road traffic. Artif Intell Rev 8(2):189–214
Pan B, Cai H, Huang D et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE Conference on computer vision and pattern recognition, pp 10870–10879
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, pp 311–318
Park J, Rohrbach M, Darrell T et al (2019) Adversarial inference for multi-sentence video description. In: IEEE Conference on computer vision and pattern recognition, pp 6591–6601
Pei W, Zhang J, Wang X et al (2019) Memory-attended recurrent network for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 8347–8356
Perez-Martin J, Bustos B, Guimaraes S et al (2022) A comprehensive review of the video-to-text problem. Artif Intell Rev 55:4165–4239
Ren S, He K, Girshick R et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Song J, Guo Z, Gao L et al (2017) Hierarchical lstm with adjusted temporal attention for video captioning. In: International Joint Conference on artificial intelligence, pp 2737–2743
Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164
Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimed Comput Commun Appl 15(2):311–323
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: International Conference on neural information processing systems, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE Conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J et al (2015) Sequence to sequence-video to text. In: IEEE International Conference on computer vision, pp 4534–4542
Wang J, Wang W, Huang Y, et al (2018a) M3: multimodal memory modelling for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 7512–7520
Wang X, Chen W, Wu J et al (2018b) Video captioning via hierarchical reinforcement learning. In: IEEE Conference on computer vision and pattern recognition, pp 4213–4222
Wang T, Zhang R, Lu Z et al (2021) End-to-end dense video vaptioning with parallel decoding. In: IEEE International Conference on computer vision, pp 4847–6857
Wang H, Tang P, Li Q et al (2022) Emotion expression with fact transfer for video description. IEEE Trans Multimed 24:715–727
Xue F, Shi Z, Wei F et al (2022) Go wider instead of deeper. In: AAAI Conference on artificial intelligence, pp 8779–8787
Yan C, Chang X, Luo M et al (2020) Self-weighted robust lda for multiclass classification with edge classes. ACM Trans Intell Syst Technol 12(1):41–419
Yan C, Chang X, Li Z et al (2022) Zeronas: differentiable generative adversarial networks search for zero-shot learning. IEEE Trans Pattern Anal Mach Intell 44(12):9733–9740
Yang Y, Zhou J, Ai J et al (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
You Q, Luo J, ** H, et al (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: AAAI Conference on artificial intelligence, pp 381–388
Yuan D, Chang X, Li Z et al (2022) Learning adaptive spatial-temporal context-aware correlation filters for uav tracking. ACM Trans Multimed Comput Commun Appl 18(3):701–718
Zhang Z, Shi Y, Yuan C et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 13275–13285
Zhang L, Chang X, Liu J et al (2023) Tn-zstad: transferable network for zero-shot temporal activity detection. IEEE Trans Pattern Anal Mach Intell 45(3):3848–3861
Zhao Z, Lu H, Cai D, et al. (2017) Microblog sentiment classification via recurrent random walk network learning. In: International Joint Conference on Artificial Intelligence, pp 3532–3538
Zhao B, Li X, Lu X (2019) Cam-rnn: co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (No. 62362041, 62062041, 62362003), Scientific Research Foundation of Education Bureau of Jiangxi Province (No. GJJ211009), Jiangxi Provincial Natural Science Foundation (No. 20212BAB202020, 20232BAB202017), Ph.D. Research Initiation Project of **ggangshan University (No. JZB1923).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest with other people or organizations.
Ethical approval
This article does not contain any study with human participants performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tang, P., Rao, H., Zhang, A. et al. Video emotional description with fact reinforcement and emotion awaking. J Ambient Intell Human Comput 15, 2839–2852 (2024). https://doi.org/10.1007/s12652-024-04779-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-024-04779-x