Video emotional description with fact reinforcement and emotion awaking

Tang, Pengjie; Rao, Hong; Zhang, Ai; Tan, Yunlan

doi:10.1007/s12652-024-04779-x

Video emotional description with fact reinforcement and emotion awaking

Original Research
Published: 20 April 2024

Volume 15, pages 2839–2852, (2024)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Pengjie Tang^1,3,
Hong Rao²,
Ai Zhang¹ &
…
Yunlan Tan¹

70 Accesses
Explore all metrics

Abstract

Video description aims to translate the visual content in a video with appropriate natural language. Most of current works only focus on the description of factual content, paying insufficient attention to the emotions in the video. And the sentences always lack flexibility and vividness. In this work, a fact enhancement and emotion awakening based model is proposed to describe the video, making the sentence more attractive and colorful. The strategy of deep incremental leaning is employed to build a multi-layer sequential network firstly, and multi-stage training method is used to sufficiently optimize the model. Secondly, the modules of fact inspiration, fact reinforcement and emotion awakening are constructed layer by layer to discovery more facts and embed emotions naturally. The three modules are cumulatively trained to sufficiently mine the factual and emotional information. Two public datasets including EmVidCap-S and EmVidCap are employed to evaluate the proposed model. The experimental results show that the performance of the proposed model is superior to not only the baseline models, but also the other popular methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotion Aware Reinforcement Network for Visual Storytelling

Describing Unseen Videos via Multi-modal Cooperative Dialog Agents

Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

Data availability

All the experimental data of this study have been included in the paper, which can be consulted through the charts and tables in the paper. For the datasets employed in this work, EmVidCap (including EmVidCap-S and EmVidCap-L) can be obtained by contacting the author. For MSVD datasets, it can be obtained on [“http://www.cs.utexas.edu/users/ml/clamp/videoDescription/Youtubeclips.tar"].

References

Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Annual Meeting of the Association for Computational Linguistics Workshop, pp 65–72
Chang X, Yu Y, Yang Y et al (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Article Google Scholar
Chang X, Ren P, Xu P et al (2023) A comprehensive survey of scene graphs: generation and application. IEEE Trans Pattern Anal Mach Intell 45(1):1–26
Article Google Scholar
Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: AAAI Conference on artificial intelligence, pp 8191–8198
Chen T, Zhang Z, You Q, et al (2018) “factual” or “emotional”: Stylized image captioning with adaptive learning and attention. In: European Conference on computer vision, pp 527–543
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: IEEE Conference on computer vision and pattern recognition, pp 1251–1258
Deb T, Sadmanee A, Bhaumik K et al (2022) Variational stacked local attention networks for diverse video captioning. In: IEEE Winter Conference on applications of computer vision, pp 4070–4079
Fan S, Shen Z, Jiang M et al (2018) Emotional attention: a study of image sentiment and visual attention. In: IEEE Conference on computer vision and pattern recognition, pp 7521–7531
Fu T, Li L, Gan Z et al (2023) An empirical study of end-to-end video-language transformers with masked visual modeling. In: IEEE Conference on computer vision and pattern recognition, pp 22898–22909
Gan C, Gan Z, He X et al (2017) Stylenet: generating attractive visual captions with styles. In: IEEE Conference on computer vision and pattern recognition, pp 955–964
Gao L, Guo Z, Zhang H et al (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Article Google Scholar
Guadarrama S, Krishnamoorthy N, Malkarnenkar G et al (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: IEEE International Conference on computer vision, pp 2712–2719
Gupta A, Srinivasan P, Shi J et al (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on computer vision and pattern recognition, pp 2012–2019
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778
Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Jiang Y, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: AAAI Conference on artificial intelligence, pp 73–79
Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM Conference on multimedia, pp 675–678
Karayil T, Irfan A, Raue F et al (2019) Conditional gans for image captioning with sentiments. In: International Conference on artificial neural networks, pp 300–312
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Article Google Scholar
Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE Winter Conference on applications of computer vision, pp 339–348
Li C, Wang G, Wang B et al (2023a) Ds-net++: dynamic weight slicing for efficient inference in cnns and vision transformers. IEEE Trans Pattern Anal Mach Intell 45(4):4430–4446
MathSciNet Google Scholar
Li M, Huang P, Chang X et al (2023b) Video pivoting unsupervised multi-modal machine translation. IEEE Trans Pattern Anal Mach Intell 45(3):3918–3932
Google Scholar
Lin C, Och F (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual Meeting of the Association for Computational Linguistics, pp 21–26
Lin T, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European Conference on computer vision, pp 740–755
Lin K, Li L, Lin C et al (2022) Swinbert: End-to-end transformers with sparse attention for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 17949–17958
Liu T, Wan J, Dai X et al (2020) Sentiment recognition for short annotated gifs using visualtextual fusion. IEEE Trans Multimed 22(4):1098–1110
Article Google Scholar
Liu S, Ren Z, Yuan J (2021) Sibnet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–3272
Article Google Scholar
Luo G, Zhou Y, Sun X et al (2022) Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans Image Process 31:3386–3398
Article Google Scholar
Mathews A, **e L, He X (2016) Senticap: generating image descriptions with sentiments. In: AAAI Conference on artificial intelligence, pp 3574–3580
Nagel H (1994) A vision of “vision and language” comprises action: an example from road traffic. Artif Intell Rev 8(2):189–214
Article Google Scholar
Pan B, Cai H, Huang D et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE Conference on computer vision and pattern recognition, pp 10870–10879
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, pp 311–318
Park J, Rohrbach M, Darrell T et al (2019) Adversarial inference for multi-sentence video description. In: IEEE Conference on computer vision and pattern recognition, pp 6591–6601
Pei W, Zhang J, Wang X et al (2019) Memory-attended recurrent network for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 8347–8356
Perez-Martin J, Bustos B, Guimaraes S et al (2022) A comprehensive review of the video-to-text problem. Artif Intell Rev 55:4165–4239
Article Google Scholar
Ren S, He K, Girshick R et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Song J, Guo Z, Gao L et al (2017) Hierarchical lstm with adjusted temporal attention for video captioning. In: International Joint Conference on artificial intelligence, pp 2737–2743
Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164
Article Google Scholar
Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimed Comput Commun Appl 15(2):311–323
Article Google Scholar
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: International Conference on neural information processing systems, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE Conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J et al (2015) Sequence to sequence-video to text. In: IEEE International Conference on computer vision, pp 4534–4542
Wang J, Wang W, Huang Y, et al (2018a) M3: multimodal memory modelling for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 7512–7520
Wang X, Chen W, Wu J et al (2018b) Video captioning via hierarchical reinforcement learning. In: IEEE Conference on computer vision and pattern recognition, pp 4213–4222
Wang T, Zhang R, Lu Z et al (2021) End-to-end dense video vaptioning with parallel decoding. In: IEEE International Conference on computer vision, pp 4847–6857
Wang H, Tang P, Li Q et al (2022) Emotion expression with fact transfer for video description. IEEE Trans Multimed 24:715–727
Article Google Scholar
Xue F, Shi Z, Wei F et al (2022) Go wider instead of deeper. In: AAAI Conference on artificial intelligence, pp 8779–8787
Yan C, Chang X, Luo M et al (2020) Self-weighted robust lda for multiclass classification with edge classes. ACM Trans Intell Syst Technol 12(1):41–419
Google Scholar
Yan C, Chang X, Li Z et al (2022) Zeronas: differentiable generative adversarial networks search for zero-shot learning. IEEE Trans Pattern Anal Mach Intell 44(12):9733–9740
Article Google Scholar
Yang Y, Zhou J, Ai J et al (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Article MathSciNet Google Scholar
You Q, Luo J, ** H, et al (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: AAAI Conference on artificial intelligence, pp 381–388
Yuan D, Chang X, Li Z et al (2022) Learning adaptive spatial-temporal context-aware correlation filters for uav tracking. ACM Trans Multimed Comput Commun Appl 18(3):701–718
Article Google Scholar
Zhang Z, Shi Y, Yuan C et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 13275–13285
Zhang L, Chang X, Liu J et al (2023) Tn-zstad: transferable network for zero-shot temporal activity detection. IEEE Trans Pattern Anal Mach Intell 45(3):3848–3861
Google Scholar
Zhao Z, Lu H, Cai D, et al. (2017) Microblog sentiment classification via recurrent random walk network learning. In: International Joint Conference on Artificial Intelligence, pp 3532–3538
Zhao B, Li X, Lu X (2019) Cam-rnn: co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 62362041, 62062041, 62362003), Scientific Research Foundation of Education Bureau of Jiangxi Province (No. GJJ211009), Jiangxi Provincial Natural Science Foundation (No. 20212BAB202020, 20232BAB202017), Ph.D. Research Initiation Project of **ggangshan University (No. JZB1923).

Author information

Authors and Affiliations

Electronics and Information Engineering College, **ggangshan University, 28 Xueyuan Road, Ji’an, 343009, Jiangxi, People’s Republic of China
Pengjie Tang, Ai Zhang & Yunlan Tan
School of Software, Nanchang University, 235 Nan**g East Road, Nanchang, 330096, Jiangxi, People’s Republic of China
Hong Rao
Key Laboratory of Jiangxi Electronic Data Control and Forensics, **ggangshan University, Ji’an, China
Pengjie Tang

Authors

Pengjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Rao
View author publications
You can also search for this author in PubMed Google Scholar
Ai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pengjie Tang or Ai Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest with other people or organizations.

Ethical approval

This article does not contain any study with human participants performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, P., Rao, H., Zhang, A. et al. Video emotional description with fact reinforcement and emotion awaking. J Ambient Intell Human Comput 15, 2839–2852 (2024). https://doi.org/10.1007/s12652-024-04779-x

Download citation

Received: 11 April 2023
Accepted: 28 February 2024
Published: 20 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s12652-024-04779-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video emotional description with fact reinforcement and emotion awaking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Aware Reinforcement Network for Visual Storytelling

Describing Unseen Videos via Multi-modal Cooperative Dialog Agents

Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Video emotional description with fact reinforcement and emotion awaking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Aware Reinforcement Network for Visual Storytelling

Describing Unseen Videos via Multi-modal Cooperative Dialog Agents

Video Captioning Using Attention Based Visual Fusion with Bi-temporal Context and Bi-modal Semantic Feature Learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation