Log in

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Multimodal emotion recognition of videos of human oration, commonly called opinion videos, has a wide scope of applications across all domains. Here, the speakers express their views or opinions about various topics. This field is being researched by many with the aim of introducing accurate and efficient architectures for the same. This study also carries the same objective while exploring novel concepts in the field of emotion recognition. The proposed framework uses cross-dataset training and testing, so that the resultant architecture and models are unrestricted by the domain of input. It uses benchmark datasets and ensemble learning to make sure that even if the individual models are slightly biased, they can be countered by the learnings of the other models. Therefore, to achieve this objective with the mentioned novelties, three benchmark datasets, ISEAR, RAVDESS, and FER-2013, are used to train independent models for each of the three modalities of text, audio, and images. Another dataset is used in addition to the ISEAR dataset to train the text model. They are then combined and tested on the benchmark multimodal dataset of CMU-MOSEI. For the text analysis model, ELMo embedding and RNN are used, for audio, a simple DNN is used and for image emotion recognition, a 2D CNN is used after pre-processing. They are aggregated using the stacking technique for the final result. The complete architecture can be used as a partially pre-trained algorithm for the prediction of individual modalities, and partially trainable for stacking the results to get efficient emotion prediction based on input quality. The accuracy obtained on the CMU-MOSEI data set is 86.60% and the F1-score for the same is 0.84.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

  1. Rahman, M.M., Sarkar, A.K., Hossain, M.A., Hossain, M.S., Islam, M.R., Hossain, M.B., Quinn, J.M., Moni, M.A.: Recognition of human emotions using EEG signals: a review. Comput. Biol. Med. 136, 104696 (2021)

    Article  Google Scholar 

  2. Hwooi, S.K.W., Othmani, A., Sabri, A.Q.M.: Monitoring application-driven continuous affect recognition from video frames. In: Proceedings of the 2023 5th International Conference on Image, Video and Signal Processing, pp. 36–42 (2023)

  3. Ramya, H., Bhatt, M.R.: Personalised emotion recognition utilising speech signal and linguistic cues. In: 2019 11th International Conference on Communication Systems & Networks (COMSNETS), pp. 856–860. IEEE (2019)

  4. Alamoodi, A.H., Zaidan, B.B., Zaidan, A.A., Albahri, O.S., Mohammed, K., Malik, R.Q., Almahdi, E.M., Chyad, M.A., Tareq, Z., Albahri, A.S., et al.: Sentiment analysis and its applications in fighting Covid-19 and infectious diseases: a systematic review. Expert Syst. Appl. 167, 114155 (2021)

    Article  Google Scholar 

  5. Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). (Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing)

    Article  Google Scholar 

  6. Ahmed, N., Al Aghbari, Z., Girija, S.: A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 17, 200171 (2023)

    Google Scholar 

  7. Terbouche, H., Schoneveld, L., Benson, O., Othmani, A.: Comparing learning methodologies for self-supervised audio-visual representation learning. IEEE Access 10, 41622–41638 (2022)

    Article  Google Scholar 

  8. Mai, S., Zeng, Y., Zheng, S., Hu, H.: Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(3), 2276–2289 (2022). https://doi.org/10.1109/TAFFC.2022.3172360

    Article  Google Scholar 

  9. Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1642–1651 (2022)

  10. Cai, C., He, Y., Sun, L., Lian, Z., Liu, B., Tao, J., Xu, M., Wang, K.: Multimodal sentiment analysis based on recurrent neural network and multimodal attention. In: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, pp. 61–67 (2021)

  11. Wolf, K.: Measuring facial expression of emotion. Dialog. Clin. Neurosci. 17, 457–462 (2015). https://doi.org/10.31887/DCNS.2015.17.4/kwolf

    Article  Google Scholar 

  12. Keltner, D., Tracy, J.L., Sauter, D.A., Cowen, A.S.: What basic emotion theory really says for the twenty-first century study of emotion. J. Nonverbal Behav. 43, 195–201 (2019)

    Article  Google Scholar 

  13. Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. (2019). https://doi.org/10.1007/s10919-019-00293-3

    Article  Google Scholar 

  14. Alotaibi, F.M.: Classifying text-based emotions using logistic regression. VAWKUM Trans. Comput. Sci. (2019). https://doi.org/10.21015/vtcs.v16i2.551

    Article  Google Scholar 

  15. Singh, L., Singh, S., Aggarwal, N.: Two-stage text feature selection method for human emotion recognition. In: Proceedings of 2nd International Conference on Communication, Computing and Networking, pp. 531–538. Springer (2019)

  16. Dobša, J., Šebalj, D., Bužić, D.: Classification of emotions based on text and qualitative variables. In: 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 383–388. IEEE (2021)

  17. Adoma, A.F., Henry, N.-M., Chen, W., Andre, N.R.: Recognizing emotions from texts using a Bert-based approach. In: 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 62–66. IEEE (2020)

  18. Acheampong, F.A., Nunoo-Mensah, H., Chen, W.: Recognizing emotions from texts using an ensemble of transformer-based language models. In: 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 161–164. IEEE (2021)

  19. Shaaban, Y., Korashy, H., Medhat, W.: Emotion detection using deep learning. In: 2021 16th International Conference on Computer Engineering and Systems (ICCES), pp. 1–10. IEEE (2021)

  20. Er, M.B.: A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8, 221640–221653 (2020)

    Article  Google Scholar 

  21. Yadav, A., Vishwakarma, D.K.: A multilingual framework of CNN and bi-LSTM for emotion detection. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2020)

  22. Singh, R., Puri, H., Aggarwal, N., Gupta, V.: An efficient language-independent acoustic emotion classification system. Arab. J. Sci. Eng. (2019). https://doi.org/10.1007/s13369-019-04293-9

    Article  Google Scholar 

  23. Kumar, S., Gahalawat, M., Roy, P.P., Dogra, D.P., Kim, B.-G.: Exploring impact of age and gender on sentiment analysis using machine learning. Electronics 9(2), 374 (2020)

    Article  Google Scholar 

  24. Xu, M., Zhang, F., Zhang, W.: Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9, 74539–74549 (2021)

    Article  Google Scholar 

  25. Kanwal, S., Asghar, S.: Speech emotion recognition using clustering based GA-optimized feature set. IEEE Access 9, 125830–125842 (2021)

    Article  Google Scholar 

  26. Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)

    Article  Google Scholar 

  27. Verbitskiy, S., Berikov, V., Vyshegorodtsev, V.: ERANNs: efficient residual audio neural networks for audio pattern recognition. Pattern Recognit. Lett. (2022). https://doi.org/10.1016/j.patrec.2022.07.012

    Article  Google Scholar 

  28. Zahara, L., Musa, P., Wibowo, E.P., Karim, I., Musa, S.B.: The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions face using the convolutional neural network (CNN) algorithm based Raspberry Pi. In: 2020 Fifth International Conference on Informatics and Computing (ICIC), pp. 1–9. IEEE (2020)

  29. Agrawal, A., Mittal, N.: Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Vis. Comput. 36(2), 405–412 (2020)

    Article  Google Scholar 

  30. Hazourli, A.R., Djeghri, A., Salam, H., Othmani, A.: Multi-facial patches aggregation network for facial expression recognition and facial regions contributions to emotion display. Multimed. Tools Appl. 80, 13639–13662 (2021)

    Article  Google Scholar 

  31. Joseph, J.L., Mathew, S.P.: Facial expression recognition for the blind using deep learning. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–5. IEEE (2021)

  32. Dong, J., Li, X., Liao, S., Xu, J., Xu, D., Du, X.: Image retrieval by cross-media relevance fusion. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 173–176 (2015)

  33. Helaly, R., Hajjaji, M.A., M’Sahli, F., Mtibaa, A.: Deep convolution neural network implementation for emotion recognition system. In: 2020 20th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA), pp. 261–265. IEEE (2020)

  34. Hwooi, S.K.W., Othmani, A., Sabri, A.Q.M.: Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 10, 96053–96065 (2022)

    Article  Google Scholar 

  35. Zhou, N., Liang, R., Shi, W.: A lightweight convolutional neural network for real-time facial expression detection. IEEE Access 9, 5573–5584 (2020)

    Article  Google Scholar 

  36. Lasri, I., Solh, A.R., El Belkacemi, M.: Facial emotion recognition of students using convolutional neural network. In: 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS), pp. 1–6. IEEE (2019)

  37. Schoneveld, L., Othmani, A.: Towards a general deep feature extractor for facial expression recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2339–2342. IEEE (2021)

  38. Kumar, A., Vepa, J.: Gated mechanism for attention based multi modal sentiment analysis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4477–4481. IEEE (2020)

  39. Schoneveld, L., Othmani, A., Abdelkawy, H.: Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognit. Lett. 146, 1–7 (2021)

    Article  Google Scholar 

  40. Hu, F., Chen, A., Wang, Z., Zhou, F., Dong, J., Li, X.: Lightweight attentional feature fusion: a new baseline for text-to-video retrieval. In: European Conference on Computer Vision, pp. 444–461. Springer (2022)

  41. Bilodeau, G.-A., Rockemann, J., et al.: Leveraging sentiment analysis knowledge to solve emotion detection tasks. ar**v e-prints, 2111 (2021)

  42. Khare, A., Parthasarathy, S., Sundaram, S.: Self-supervised learning with cross-modal transformers for emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 381–388. IEEE (2021)

  43. Singh, P., Srivastava, R., Rana, K., Kumar, V.: A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl.-Based Syst. 229, 107316 (2021)

    Article  Google Scholar 

  44. Guo, X., Kong, A., Zhou, H., Wang, X., Wang, M.: Unimodal and crossmodal refinement network for multimodal sequence fusion. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9143–9153 (2021)

  45. Huddar, M., Sannakki, S., Rajpurohit, V.: Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN. Int. J. Interact. Multimedia Artif. Intell. (2021). https://doi.org/10.9781/ijimai.2020.07.004

    Article  Google Scholar 

  46. Li, Y., Zhang, K., Wang, J., Gao, X.: A cognitive brain model for multimodal sentiment analysis based on attention neural networks. Neurocomputing 430, 159–173 (2021)

    Article  Google Scholar 

  47. Scherer, K.R., Wallbott, H.G.: Evidence for universality and cultural variation of differential emotion response patterning. J. Pers. Soc. Psychol. 66(2), 310 (1994)

    Article  Google Scholar 

  48. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), 0196391 (2018)

    Article  Google Scholar 

  49. Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: a report on three machine learning contests. In: International Conference on Neural Information Processing, pp. 117–124. Springer (2013)

  50. Sarangi, P.P., Nayak, D.R., Panda, M., Majhi, B.: A feature-level fusion based improved multimodal biometric recognition system using ear and profile face. J. Ambient. Intell. Humaniz. Comput. 13(4), 1867–1898 (2022)

    Article  Google Scholar 

  51. Zadeh, A., Pu, P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers) (2018)

  52. Obaid, W., Nassif, A.B.: The effects of resampling on classifying imbalanced datasets. In: 2022 Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–6. IEEE (2022)

  53. Straka, M., Straková, J., Hajič, J.: Evaluating contextualized embeddings on 54 languages in PoS tagging, lemmatization and dependency parsing. ar**v preprint ar**v:1908.07448 (2019)

  54. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations (2018). ar**v:1802.05365

  55. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  56. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)

  57. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O.: Librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25. Citeseer (2015)

  58. Berger, A., Guda, S.: Threshold optimization for F measure of macro-averaged precision and recall. Pattern Recognit. 102, 107250 (2020)

    Article  Google Scholar 

  59. Muzammel, M., Salam, H., Othmani, A.: End-to-end multimodal clinical depression recognition using deep neural networks: a comparative analysis. Comput. Methods Progr. Biomed. 211, 106433 (2021)

    Article  Google Scholar 

  60. Abdullah, S.M.S.A., Ameen, S.Y.A., Sadeeq, M.A., Zeebaree, S.: Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2(02), 52–58 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shashank Mouli Satapathy.

Ethics declarations

Conflict of interest

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The authors have no competing interests to declare that are relevant to the content of this article.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by X. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dixit, C., Satapathy, S.M. A customizable framework for multimodal emotion recognition using ensemble of deep neural network models. Multimedia Systems 29, 3151–3168 (2023). https://doi.org/10.1007/s00530-023-01188-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01188-6

Keywords

Navigation