A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

Dixit, Chhavi; Satapathy, Shashank Mouli

doi:10.1007/s00530-023-01188-6

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

Regular Paper
Published: 12 October 2023

Volume 29, pages 3151–3168, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Chhavi Dixit¹^na1 &
Shashank Mouli Satapathy²^na1

336 Accesses
3 Citations
Explore all metrics

Abstract

Multimodal emotion recognition of videos of human oration, commonly called opinion videos, has a wide scope of applications across all domains. Here, the speakers express their views or opinions about various topics. This field is being researched by many with the aim of introducing accurate and efficient architectures for the same. This study also carries the same objective while exploring novel concepts in the field of emotion recognition. The proposed framework uses cross-dataset training and testing, so that the resultant architecture and models are unrestricted by the domain of input. It uses benchmark datasets and ensemble learning to make sure that even if the individual models are slightly biased, they can be countered by the learnings of the other models. Therefore, to achieve this objective with the mentioned novelties, three benchmark datasets, ISEAR, RAVDESS, and FER-2013, are used to train independent models for each of the three modalities of text, audio, and images. Another dataset is used in addition to the ISEAR dataset to train the text model. They are then combined and tested on the benchmark multimodal dataset of CMU-MOSEI. For the text analysis model, ELMo embedding and RNN are used, for audio, a simple DNN is used and for image emotion recognition, a 2D CNN is used after pre-processing. They are aggregated using the stacking technique for the final result. The complete architecture can be used as a partially pre-trained algorithm for the prediction of individual modalities, and partially trainable for stacking the results to get efficient emotion prediction based on input quality. The accuracy obtained on the CMU-MOSEI data set is 86.60% and the F1-score for the same is 0.84.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge

Multimodal modelling of human emotion using sound, image and text fusion

Article 11 August 2023

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Article 10 June 2016

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

Rahman, M.M., Sarkar, A.K., Hossain, M.A., Hossain, M.S., Islam, M.R., Hossain, M.B., Quinn, J.M., Moni, M.A.: Recognition of human emotions using EEG signals: a review. Comput. Biol. Med. 136, 104696 (2021)
Article Google Scholar
Hwooi, S.K.W., Othmani, A., Sabri, A.Q.M.: Monitoring application-driven continuous affect recognition from video frames. In: Proceedings of the 2023 5th International Conference on Image, Video and Signal Processing, pp. 36–42 (2023)
Ramya, H., Bhatt, M.R.: Personalised emotion recognition utilising speech signal and linguistic cues. In: 2019 11th International Conference on Communication Systems & Networks (COMSNETS), pp. 856–860. IEEE (2019)
Alamoodi, A.H., Zaidan, B.B., Zaidan, A.A., Albahri, O.S., Mohammed, K., Malik, R.Q., Almahdi, E.M., Chyad, M.A., Tareq, Z., Albahri, A.S., et al.: Sentiment analysis and its applications in fighting Covid-19 and infectious diseases: a systematic review. Expert Syst. Appl. 167, 114155 (2021)
Article Google Scholar
Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). (Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing)
Article Google Scholar
Ahmed, N., Al Aghbari, Z., Girija, S.: A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 17, 200171 (2023)
Google Scholar
Terbouche, H., Schoneveld, L., Benson, O., Othmani, A.: Comparing learning methodologies for self-supervised audio-visual representation learning. IEEE Access 10, 41622–41638 (2022)
Article Google Scholar
Mai, S., Zeng, Y., Zheng, S., Hu, H.: Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(3), 2276–2289 (2022). https://doi.org/10.1109/TAFFC.2022.3172360
Article Google Scholar
Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1642–1651 (2022)
Cai, C., He, Y., Sun, L., Lian, Z., Liu, B., Tao, J., Xu, M., Wang, K.: Multimodal sentiment analysis based on recurrent neural network and multimodal attention. In: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, pp. 61–67 (2021)
Wolf, K.: Measuring facial expression of emotion. Dialog. Clin. Neurosci. 17, 457–462 (2015). https://doi.org/10.31887/DCNS.2015.17.4/kwolf
Article Google Scholar
Keltner, D., Tracy, J.L., Sauter, D.A., Cowen, A.S.: What basic emotion theory really says for the twenty-first century study of emotion. J. Nonverbal Behav. 43, 195–201 (2019)
Article Google Scholar
Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. (2019). https://doi.org/10.1007/s10919-019-00293-3
Article Google Scholar
Alotaibi, F.M.: Classifying text-based emotions using logistic regression. VAWKUM Trans. Comput. Sci. (2019). https://doi.org/10.21015/vtcs.v16i2.551
Article Google Scholar
Singh, L., Singh, S., Aggarwal, N.: Two-stage text feature selection method for human emotion recognition. In: Proceedings of 2nd International Conference on Communication, Computing and Networking, pp. 531–538. Springer (2019)
Dobša, J., Šebalj, D., Bužić, D.: Classification of emotions based on text and qualitative variables. In: 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 383–388. IEEE (2021)
Adoma, A.F., Henry, N.-M., Chen, W., Andre, N.R.: Recognizing emotions from texts using a Bert-based approach. In: 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 62–66. IEEE (2020)
Acheampong, F.A., Nunoo-Mensah, H., Chen, W.: Recognizing emotions from texts using an ensemble of transformer-based language models. In: 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 161–164. IEEE (2021)
Shaaban, Y., Korashy, H., Medhat, W.: Emotion detection using deep learning. In: 2021 16th International Conference on Computer Engineering and Systems (ICCES), pp. 1–10. IEEE (2021)
Er, M.B.: A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8, 221640–221653 (2020)
Article Google Scholar
Yadav, A., Vishwakarma, D.K.: A multilingual framework of CNN and bi-LSTM for emotion detection. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2020)
Singh, R., Puri, H., Aggarwal, N., Gupta, V.: An efficient language-independent acoustic emotion classification system. Arab. J. Sci. Eng. (2019). https://doi.org/10.1007/s13369-019-04293-9
Article Google Scholar
Kumar, S., Gahalawat, M., Roy, P.P., Dogra, D.P., Kim, B.-G.: Exploring impact of age and gender on sentiment analysis using machine learning. Electronics 9(2), 374 (2020)
Article Google Scholar
Xu, M., Zhang, F., Zhang, W.: Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9, 74539–74549 (2021)
Article Google Scholar
Kanwal, S., Asghar, S.: Speech emotion recognition using clustering based GA-optimized feature set. IEEE Access 9, 125830–125842 (2021)
Article Google Scholar
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Article Google Scholar
Verbitskiy, S., Berikov, V., Vyshegorodtsev, V.: ERANNs: efficient residual audio neural networks for audio pattern recognition. Pattern Recognit. Lett. (2022). https://doi.org/10.1016/j.patrec.2022.07.012
Article Google Scholar
Zahara, L., Musa, P., Wibowo, E.P., Karim, I., Musa, S.B.: The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions face using the convolutional neural network (CNN) algorithm based Raspberry Pi. In: 2020 Fifth International Conference on Informatics and Computing (ICIC), pp. 1–9. IEEE (2020)
Agrawal, A., Mittal, N.: Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Vis. Comput. 36(2), 405–412 (2020)
Article Google Scholar
Hazourli, A.R., Djeghri, A., Salam, H., Othmani, A.: Multi-facial patches aggregation network for facial expression recognition and facial regions contributions to emotion display. Multimed. Tools Appl. 80, 13639–13662 (2021)
Article Google Scholar
Joseph, J.L., Mathew, S.P.: Facial expression recognition for the blind using deep learning. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–5. IEEE (2021)
Dong, J., Li, X., Liao, S., Xu, J., Xu, D., Du, X.: Image retrieval by cross-media relevance fusion. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 173–176 (2015)
Helaly, R., Hajjaji, M.A., M’Sahli, F., Mtibaa, A.: Deep convolution neural network implementation for emotion recognition system. In: 2020 20th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA), pp. 261–265. IEEE (2020)
Hwooi, S.K.W., Othmani, A., Sabri, A.Q.M.: Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 10, 96053–96065 (2022)
Article Google Scholar
Zhou, N., Liang, R., Shi, W.: A lightweight convolutional neural network for real-time facial expression detection. IEEE Access 9, 5573–5584 (2020)
Article Google Scholar
Lasri, I., Solh, A.R., El Belkacemi, M.: Facial emotion recognition of students using convolutional neural network. In: 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS), pp. 1–6. IEEE (2019)
Schoneveld, L., Othmani, A.: Towards a general deep feature extractor for facial expression recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2339–2342. IEEE (2021)
Kumar, A., Vepa, J.: Gated mechanism for attention based multi modal sentiment analysis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4477–4481. IEEE (2020)
Schoneveld, L., Othmani, A., Abdelkawy, H.: Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognit. Lett. 146, 1–7 (2021)
Article Google Scholar
Hu, F., Chen, A., Wang, Z., Zhou, F., Dong, J., Li, X.: Lightweight attentional feature fusion: a new baseline for text-to-video retrieval. In: European Conference on Computer Vision, pp. 444–461. Springer (2022)
Bilodeau, G.-A., Rockemann, J., et al.: Leveraging sentiment analysis knowledge to solve emotion detection tasks. ar**v e-prints, 2111 (2021)
Khare, A., Parthasarathy, S., Sundaram, S.: Self-supervised learning with cross-modal transformers for emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 381–388. IEEE (2021)
Singh, P., Srivastava, R., Rana, K., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl.-Based Syst. 229, 107316 (2021)
Article Google Scholar
Guo, X., Kong, A., Zhou, H., Wang, X., Wang, M.: Unimodal and crossmodal refinement network for multimodal sequence fusion. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9143–9153 (2021)
Huddar, M., Sannakki, S., Rajpurohit, V.: Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN. Int. J. Interact. Multimedia Artif. Intell. (2021). https://doi.org/10.9781/ijimai.2020.07.004
Article Google Scholar
Li, Y., Zhang, K., Wang, J., Gao, X.: A cognitive brain model for multimodal sentiment analysis based on attention neural networks. Neurocomputing 430, 159–173 (2021)
Article Google Scholar
Scherer, K.R., Wallbott, H.G.: Evidence for universality and cultural variation of differential emotion response patterning. J. Pers. Soc. Psychol. 66(2), 310 (1994)
Article Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), 0196391 (2018)
Article Google Scholar
Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: a report on three machine learning contests. In: International Conference on Neural Information Processing, pp. 117–124. Springer (2013)
Sarangi, P.P., Nayak, D.R., Panda, M., Majhi, B.: A feature-level fusion based improved multimodal biometric recognition system using ear and profile face. J. Ambient. Intell. Humaniz. Comput. 13(4), 1867–1898 (2022)
Article Google Scholar
Zadeh, A., Pu, P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers) (2018)
Obaid, W., Nassif, A.B.: The effects of resampling on classifying imbalanced datasets. In: 2022 Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–6. IEEE (2022)
Straka, M., Straková, J., Hajič, J.: Evaluating contextualized embeddings on 54 languages in PoS tagging, lemmatization and dependency parsing. ar**v preprint ar**v:1908.07448 (2019)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations (2018). ar**v:1802.05365
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O.: Librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25. Citeseer (2015)
Berger, A., Guda, S.: Threshold optimization for F measure of macro-averaged precision and recall. Pattern Recognit. 102, 107250 (2020)
Article Google Scholar
Muzammel, M., Salam, H., Othmani, A.: End-to-end multimodal clinical depression recognition using deep neural networks: a comparative analysis. Comput. Methods Progr. Biomed. 211, 106433 (2021)
Article Google Scholar
Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)
Article Google Scholar

Download references

Author information

Chhavi Dixit and Shashank Mouli Satapathy contributed equally to this paper.

Authors and Affiliations

Shell India Markets Pvt. Ltd., Bengaluru, Karnataka, 560103, India
Chhavi Dixit
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Vellore, Tamil Nadu, 632014, India
Shashank Mouli Satapathy

Authors

Chhavi Dixit
View author publications
You can also search for this author in PubMed Google Scholar
Shashank Mouli Satapathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shashank Mouli Satapathy.

Ethics declarations

Conflict of interest

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The authors have no competing interests to declare that are relevant to the content of this article.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by X. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dixit, C., Satapathy, S.M. A customizable framework for multimodal emotion recognition using ensemble of deep neural network models. Multimedia Systems 29, 3151–3168 (2023). https://doi.org/10.1007/s00530-023-01188-6

Download citation

Received: 22 January 2023
Accepted: 17 September 2023
Published: 12 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00530-023-01188-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge

Multimodal modelling of human emotion using sound, image and text fusion

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge

Multimodal modelling of human emotion using sound, image and text fusion

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation