Log in

Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

  • Published:
Journal of Systems Science and Systems Engineering Aims and scope Submit manuscript

Abstract

Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

References

  • Aggarwal A, Srivastava A, Agarwal A, Chahal N, Singh D, Alnuaim A A, Alhadlaq A, Lee H N (2022). Two-way feature extraction for speech emotion recognition using deep learning. Sensors 22(6): 2378.

    Article  Google Scholar 

  • Ahmad F, Shahid M, Alam M, Ashraf Z, Sajid M, Kotecha K, Dhiman G (2022). Levelized multiple workflow allocation strategy under precedence constraints with task merging in IaaS cloud environment. IEEE Access 10: 92809–27.

    Article  Google Scholar 

  • Aouf A (2019). Basic arabic vocal emotions dataset (BAVED) - GitHub. Retrieved from https://github.com/40uf411/Basic-Arabic-VocalEmotions-Dataset.

  • Atmaja B T, Sasou A (2022). Sentiment analysis and emotion recognition from speech using universal speech representations. Sensors 22(17): 6369.

    Article  Google Scholar 

  • Baevski A, Zhou H, Mohamed A, Auli M (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33: 12449–12460.

    Google Scholar 

  • Beale R, Peter C (2008). The role of affect and emotion in HCI. Affect and Emotion in Human-computer Interaction: 1–11.

  • Boateng G, Kowatsch T (2020). Speech emotion recognition among elderly individuals using multimodal fusion and transfer learning. International Conference on Multimodal Interaction: 12–16.

  • Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W F, Weiss B A (2005). Database of german emotional speech. Proceedings of the Ninth European Conference on Speech Communication and Technology: 4–8.

  • Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4): 335–359.

    Article  Google Scholar 

  • Butt S A, Iqbal U, Ghazali R, Shoukat I A, Lasisi A, Al-Saedi A K (2022). An improved convolutional neural network for speech emotion recognition. International Conference on Soft Computing and Data Mining: 194–201.

  • Chatterjee I (2021). Artificial intelligence and patentability: Review and discussions. International Journal of Modern Research 1: 15–21.

    Google Scholar 

  • Chang H, Yang S, Lee H (2022). Distil HuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 7087–7091.

  • Chang Y, Ren Z, Nguyen T T, Qian K, Schuller B W (2023). Knowledge transfer for on-device speech emotion recognition with neural structured learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1–5.

  • Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J G (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine 18(1): 32–80.

    Article  Google Scholar 

  • Dehghani M, Montazeri Z, Dhiman G, Malik O P, Morales-Menendez R, Ramirez-Mendoza R A, Parra-Arroyo L (2020). A spring search algorithm applied to engineering optimization problems. Applied Sciences 10(18): 6173.

    Article  Google Scholar 

  • Devlin J, Chang M W, Lee K, Toutanova K (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805.

  • Dhiman G (2021). ESA: A hybrid bio-inspired metaheuristic optimization approach for engineering problems. Engineering with Computers 37: 323–353.

    Article  Google Scholar 

  • Dhiman G, Kumar V (2018). Emperor penguin optimizer: A bio-inspired algorithm for engineering problems. Knowledge-Based Systems 159: 20–50.

    Article  Google Scholar 

  • Dhiman G, Kaur A (2019). STOA: A bio-inspired based optimization algorithm for industrial engineering problems. Engineering Applications of Artificial Intelligence 82: 148–174.

    Article  Google Scholar 

  • Dhiman G, Garg M, Nagar A, Kumar V, Dehghani M (2021). A novel algorithm for global optimization: Rat swarm optimizer. Journal of Ambient Intelligence and Humanized Computing 12: 8457–8482.

    Article  Google Scholar 

  • Feng K, Chaspari T (2020). A review of generalizable transfer learning in automatic emotion recognition. Frontiers in Computer Science 2: 9.

    Article  Google Scholar 

  • Gao M, Dong J, Zhou D, Zhang Q, Yang D (2019). End-to-end speech emotion recognition based on one-dimensional convolutional neural network. Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence: 78–82.

  • Hinton G E, Sabour S, Frosst N (2018). Matrix capsules with EM routing. International Conference on Learning Epresentations: 1–15.

  • Georgiou E, Paraskevopoulos G, Potamianos A (2021). M3: MultiModal masking applied to sentiment analysis. Interspeech: 2876–2880.

  • Ghosh S, Laksana E, Morency L P, Scherer S (2016). Representation learning for speech emotion recognition. Interspeech: 3603–3607.

  • Gideon J, Khorram S, Aldeneh Z, Dimitriadis D, Provost E M (2017). Progressive neural networks for transfer learning in emotion recognition. ar**v: 1706.03256.

  • Gideon J, McInnis M, Provost E M (2019). Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Transactions on Affective Computing 12(4): 1055–1068.

    Article  Google Scholar 

  • Guo J (2022). Deep learning approach to text analysis for human emotion detection from big data. Journal of Intelligent Systems 31(1): 113–126.

    Article  Google Scholar 

  • Guo Y, **ong X, Liu Y, Xu L, Li Q (2022). A novel speech emotion recognition method based on feature construction and ensemble learning. PLoS ONE 17(8): e0267132.

    Article  Google Scholar 

  • Gupta V K, Shukla S K, Rawat R S. (2022). Crime tracking system and people’s safety in India using machine learning approaches. International Journal of Modern Research 2(1): 1–7.

    Google Scholar 

  • Halabi N (2021). Arabic speech corpus. Retrieved from http://ar.arabicspeechcorpus.com/.

  • Han K, Yu D, Tashev I (2014). Speech emotion recognition using deep neural network and extreme learning machine. Interspeech: 223–227.

  • Hastie Trevor (2017). Generalized additive models. Statistical Models in S: 249–307.

  • Hsu W N, Bolte B, Tsai Y H H, Lakhotia K, Salakhutdinov R, Mohamed A (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. ar**v preprint ar**v:2106.07447.

  • Kanwal S, Asghar S, Ali H (2022). Feature selection enhancement and feature space visualization for speech-based emotion recognition. PeerJ Computer Science 8: e1091.

    Article  Google Scholar 

  • Kaur S, Awasthi L K, Sangal A L, Dhiman G (2020). Tunicate swarm algorithm: A new bio-inspired based metaheuristic paradigm for global optimization. Engineering Applications of Artificial Intelligence 90: 103541.

    Article  Google Scholar 

  • Kwon O W, Chan K, Hao J, Lee T W (2003). Emotion recognition by speech signals. Interspeech: 125–128.

  • Latif S, Rana R, Younis S, Qadir J, Epps J (2018). Transfer learning for improving speech emotion classification accuracy. Interspeech: 257–261.

  • Liang P P, Salakhutdinov R, Morency L P (2018). Computational modeling of human multimodal language: The MOSEI dataset and interpretable dynamic fusion. Proceedings of the First Workshop and Grand Challenge on Computational Modelling of Human Multimodal Language.

  • Livingstone S R, Russo F A (2018). The ryerson audio-Visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 13(5): e0196391.

    Article  Google Scholar 

  • Ma X, Wu Z, Jia J, Xu M, Meng H, Cai L (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. Interspeech: 3683–3687.

  • Mohamed A R (2021). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ar**v preprint ar**v:1910.01108.

  • Nair R, Gomathi S (2022). Breast cancer image classification using transfer learning and convolutional neural network. International Journal of Modern Research 2(1): 8–16.

    Google Scholar 

  • Neumann M, Vu N T (2017). Attentive convolutional neural network-based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. Interspeech: 1263–1267.

  • Padi S, Sadjadi S O, Manocha D, Sriram R D (2022). Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. ar**v preprint ar**v:2202.08974.

  • Parada-Cabaleiro E, Costantini G, Batliner A, Schmitt M, Schuller B (2019). DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation: 1–43.

  • Patnaik S (2022). Speech emotion recognition by using complex MFCC and deep sequential model. Multimedia Tools and Applications 82: 11897–11922.

    Article  Google Scholar 

  • Pennington J, Socher R, Manning C D (2014). Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP): 1532–1543.

  • Piastowski A, Czyzewski A, Nadachowski P, Operlejn M, Kaczor K (2022). Recognition of emotions in speech using convolutional neural networks on different datasets. Electronics 11: 3831.

    Article  Google Scholar 

  • Provost E M, Gideon J, McInnis M (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101.

  • Puri T, Soni M, Dhiman G, Khalaf O I, Khan I R (2022). Detection of emotion of speech for ravdess audio using hybrid convolution neural network. Journal of Healthcare Engineering. Doi: https://doi.org/10.1155/2022/8472947.

  • Rawat R S, Shukla S K, Gupta V K (2022). A review of predictive modeling of antimicrobial peptides using machine learning techniques. International Journal of Modern Research 2(1): 28–38.

    Google Scholar 

  • Ren J, Zhang Y, Wang L, Zhang M, Lu H (2022). Speech emotion recognition using deep neural network with an ensemble learning strategy. Multimedia Tools and Applications 81(10): 15147–15170.

    Google Scholar 

  • Richard H, Tom R, Yvonne R, Abigail S (2008). Being human: Human-computer interaction in the year 2020. Report, Microsoft Corporation.

  • Sanh V, Debut L, Chaumond J, Wolf T (2019). DistilBERT, A distilled version of BERT: Smaller, faster, cheaper and lighter. ar**v preprint ar**v:1910.01108.

  • Sarma M, Ghahremani P, Povey D, Goel N K, Sarma K K, Dehak N (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101.

  • Satt A, Rozenberg S, Hoory R (2017). Efficient emotion recognition from speech using deep learning on spectrograms. Interspeech: 1089–1093.

  • Schuller B W (2018). Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM 61(5): 90–99.

    Article  Google Scholar 

  • Sepp H, Jürgen S (1997). Long short-term memory. Neural Computation 9(8): 1735–1780.

    Article  Google Scholar 

  • Shukla S K, Gupta V K, Joshi K, Gupta A, Singh M K (2022). Self-aware execution environment model (SAE2) for the performance improvement of multicore systems. International Journal of Modern Research 2(1): 17–27.

    Google Scholar 

  • Singamaneni K K, Dhiman G, Juneja S, Muhammad G, AlQahtani S A, Zaki J (2022). A novel QKD approach to enhance IIOT privacy and computational knacks. Sensors 22(18): 6741.

    Article  Google Scholar 

  • Song P, ** Y, Zhao L, **n M (2014). Speech emotion recognition using transfer learning. IEICE Trans. Information and Systems 97(9): 2530–2532.

    Article  Google Scholar 

  • Trinh V L, Dao T L T, Le X T, Castelli E (2022). Emotional speech recognition using deep neural networks. Sensors 22: 1414.

    Article  Google Scholar 

  • Tsai Y H, Bai S, Liang P P, Kolter J Z, Morency L, Salakhutdinov R (2019). Multimodal transformer for unaligned multimodal language sequences. ar**v preprint ar**v:1906.00295.

  • Vaishnav P K, Sharma S, Sharma P (2021). Analytical review analysis for screening COVID-19. International Journal of Modern Research 1: 22–29.

    Google Scholar 

  • Venkataramanan K, Rajamohan R, Haresh R (2019). Emotion recognition from speech. ar**v preprint ar**v:1912.10458.

  • Ververidis D, Kotropoulos C (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication 48(9): 1162–1181.

    Article  Google Scholar 

  • Yang Z, Hirschberg J (2018). Predicting arousal and valence from waveforms and spectrograms using deep neural networks. Interspeech: 3092–3096.

  • Yenigalla P, Kumar A, Tripathi S, Singh C, Kar, S, Vepa J (2018). Speech emotion recognition using spectrogram & phoneme embedding. Interspeech: 3688–3692.

  • Zadeh A, Liang P P, Vanbriesen J, Poria S, Tong E, Cambria E, Morency, L P (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: 2236–2246.

  • Zhang L, Yan Y, Mao Q, Song Y (2019). Speech emotion recognition based on the combination of pre-trained CNN and GAN discriminator. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

  • Zhang W, Jia Y (2022). A study on speech emotion recognition model based on mel-spectrogram and CapsNet. International Academic Exchange Conference on Science and Technology Innovation (IAECST).

  • Zhao J, Mao X, Chen L (2019). Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomedical Signal Processing and Control 47: 312–323.

    Article  Google Scholar 

  • Zhao L, Song P, ** Y, **n M (2014). Speech emotion recognition based on deep neural network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  • Zhou Y, Zhang Y, Wang L (2022). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications.

  • Zolfaghari A, Fakhreddin M (2022). Speech emotion recognition using multi-task learning with deep neural networks. Neural Computing & Applications. https://arxiv.org/pdf/2203.16794.

  • Zolfaghari A, Fakhreddin M (2021a). Speech emotion recognition using a hybrid deep neural network. Applied Acoustics 288: 108480.

    Google Scholar 

  • Zolfaghari A, Fakhreddin M (2021b). Speech emotion recognition using a multi-stream deep neural network. Speech Communications 133: 1–12.

    Google Scholar 

  • Zou Y, Wang L (2022a). Speech emotion recognition based on convolutional neural network and attention mechanism. Neural Computing & Applications.

  • Zou Y, Wang L (2022b). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications.

  • Zou Y, Wang L, Zhang Y (2022). Speech emotion recognition based on bidirectional long short-term memory network and attention mechanism. Multimedia Tools and Applications.

Download references

Acknowledgments

We would like to express my heartfelt gratitude to the referees who generously dedicated their time and expertise to review and provide valuable feedback on the manuscript. Their insightful comments and constructive suggestions have significantly contributed to improving the quality and clarity of the paper.

We are truly grateful for their thorough evaluation, which helped identify areas for improvement, refine the methodology, and enhance the overall coherence of the research. Their expertise and attention to detail have undoubtedly elevated the scholarly contribution of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karim Dabbabi.

Ethics declarations

The authors declare no conflict of interest.

Additional information

Karim Dabbabi received his doctorate in electronics from the Faculty of Sciences of Tunis in July 2019. Prior to that, he obtained a Research Master’s degree in automatic and signal processing from the National School of Engineers of Tunis and a Professional Master’s degree in embedded electronics from the Higher School of Sciences and Technology of Hammam Sousse. In 2010, he earned his first university diploma in biomedical engineering from the Higher Institute of Medical Technologies of Tunis. Dr. Dabbabi’s research focuses on four main axes: automatic speech recognition, natural language processing (NLP), computer vision (CV), and biomedical engineering. In these thematic areas, he has explored various machine learning and deep learning algorithms, as well as transformer-based models and large language models (LLM). His work encompasses supervised, semi-supervised, unsupervised learning, and active learning approaches. Dr. Dabbabi has actively contributed to the advancement of these fields through his research and has published extensively in reputable conferences and journals. He is a member of several professional societies and has received recognition for his contributions to the field, including awards for his outstanding research achievements.

Abdelkarim Mars obtained his doctorate from the University of Grenoble, specializing in speech and natural language processing (NLP), with a focus on Arabic NLP and large language models (LLMs). He serves as an Assistant Professor, actively researching in speech recognition and NLP, particularly exploring multilingual applications, especially in Arabic. Dr. Mars is renowned for his work on optimizing LLMs for cross-linguistic contexts. His publications span prestigious journals and conferences, where he contributes significant advancements in language technologies. He is a prominent member of various professional networks, driving innovation in his field.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dabbabi, K., Mars, A. Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases. J. Syst. Sci. Syst. Eng. (2024). https://doi.org/10.1007/s11518-024-5607-y

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s11518-024-5607-y

Keywords

Navigation