Log in

Speech emotion recognition by using complex MFCC and deep sequential model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech Emotion Recognition (SER) is one of the front-line research areas. For a machine, inferring SER is difficult because emotions are subjective and annotation is challenging. Nevertheless, researchers feel that SER is possible because speech is quasi-stationery and emotions are declarative finite states. This paper is about emotion classification by using Complex Mel Frequency Cepstral Coefficients (c-MFCC) as the representative trait and a deep sequential model as a classifier. The experimental setup is speaker independent and accommodates marginal variations in the underlying phonemes. Testing for this work has been carried out on RAVDESS and TESS databases. Conceptually, the proposed model is erogenous towards prosody observance. The main contributions of this work are of two-folds. Firstly, introducing conception of c-MFCC and investigating it as a robust cue of emotion and there by leading to significant improvement in accuracy performance. Secondly, establishing correlation between MFCC based accuracy and Russell’s emotional circumplex pattern. As per the Russell’s 2D emotion circumplex model, emotional signals are combinations of several psychological dimensions though perceived as discrete categories. Results of this work are outcome from a deep sequential LSTM model. Proposed c-MFCC are found to be more robust to handle signal framing, informative in terms of spectral roll off, and therefore put forward as an input to the classifier. For RAVDESS database the best accuracy achieved is 78.8% for fourteen classes, which subsequently improved to 91.6% for gender integrated eight classes and 98.5% for affective separated six classes. Though, the RAVDESS dataset has two analogous sentences revealed results are for the complete dataset and without applying any phonetic separation of the samples. Thus, proposed method appears to be semi-commutative on phonemes. Results obtained from this study are presented and discussed in forms of confusion matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Database analyzed in this work are publicly available.

Code availability

Software application.

References

  1. Abdel-Hamid O, Mohamed A-r, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. Ieee/Acm Trans Audio, Speech, Language Process 22(10):1533–1545

    Article  Google Scholar 

  2. Alsteris LD, Paliwal KK (2007) Short-time phase spectrum in speech processing: A review and some experimental results. Digital Signal Process 17(3):578–616 ISSN 1051-2004

  3. Alsteris LD, Paliwal KK, Leigh D (2006) Paliwal, Further intelligibility results from human listening tests using the short-time phase spectrum, Speech Communication 48(6):727–736

    Google Scholar 

  4. Anagnostopoulos C-N, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177

    Article  Google Scholar 

  5. Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affective Comput 4(3):280–290

    Article  Google Scholar 

  6. Ayadi MEI, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Patt Recog 44(3):572–587

    Article  MATH  Google Scholar 

  7. Burkhardt F, Paeschke A, Rolfes M, Sendimeier WF, Weiss B (2005) A database of germ an emotional speech. Interspeech 5:1517–1520

    Article  Google Scholar 

  8. de Pinto MG, Polignano M, Lops P, Semeraro G (2020) Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients, IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS)

  9. Digital Processing of Speech Signals, 1e, Lawrence Rabiner and Ronald W. Schafer

  10. Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653. https://doi.org/10.1109/ACCESS.2020.3043201

    Article  Google Scholar 

  11. Gaich A, Mowlaee P (2015) On speech quality estimation on phase-aware single-channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane (Australia), pp 216–220

  12. Gao Y et al. (2017) “Speech Emotion Recognition Using Local and Global Features,” Brain Informatica. Bei**g China

  13. Ghaleb E, Popa M, Asteriadis S (2019) “Multimodal and temporal perception of audio-visual cues for emotion recognition”, 2019 8th international conference on affective computing and intelligent interaction (ACII). United Kingdom, Cambridge pp. 552-558

  14. Golik P, Tuske Z, Schluter R, Ney H (2015) Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, 16th Annual Conference of the International Speech Communication Association

  15. Han K, Yu D, Tashev I (2014) “Speech emotion recognition using deep neural network and extreme learning machine,” in Proceedings of the Annual Conference of the International Speech Communication Association. [Online]. Available: https://www.microsoft.com/en-us/research/publication/speech-emotion-recognition-using-deep-neural-network-and-extreme-learning-machine/

  16. Hinton G, … Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  17. Huang C, Gong W, Wenlong F, Feng D , (2014) "A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM", Mathematical Problems in Engineering, vol. 2014, Article ID 749604, 7 pages

  18. Kate Dupuis MKP (2010) “Toronto emotional speech set (TESS),.” [Online]. Available: https://tspace.library.utoronto.ca/handle/1807/24487

  19. Kleinschmidt T, Sridharan S, Mason M, The use of phase in complex spectrum subtraction for robust speech recognition (2011) Computer. Speech Language 25(3):585–600. https://doi.org/10.1016/j.csl.2010.09.001

    Article  Google Scholar 

  20. Koutsogiannaki M, Simantiraki O, Degottex G, Stylianou Y (2014) The importance of phase on voice quality assessment, In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore. 1653–1657

  21. Liu Y, Li Y, Yuan Y (2018) A Complete Canonical Correlation Analysis for Multiview Learning. 25th IEEE Int Conf Image Process (ICIP). Athens 2018:3254–3258

    Google Scholar 

  22. Maly A, Mahale PMB (2016) On the importance of harmonic phase modification for improved speech signal reconstruction. IEEE International Conference on Acoustics. Speech Signal Process (ICASSP):584–588

  23. McCowan I, Dean D, McLaren M, Vogt R (2011) Sridharan S, the delta-phase spectrum with application to voice activity detection and speaker recognition, IEEE transactions on audio. Speech Language Process 19(7):2026–2038

    Article  Google Scholar 

  24. Mower E, Mataric MJ, Narayanan S (2011) A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio, Speech Language Process 19(5):1057–1070

    Article  Google Scholar 

  25. Muthusamy H, Polat K, Yaacob S (2015) Improved Emotion Recognition Using Gaussian Mixture Model and Extreme Learning Machine in Speech and Glottal Signal. Math Problems Eng:394083. https://doi.org/10.1155/2015/394083

  26. Rabiner LR, Schafer RW (2009) Theory and application of digital speech processing: Pearson

  27. Rajak R, Mall R (2019) "Emotion recognition from audio, dimensional and discrete categorization using CNNs," TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kochi, India, pp. 301–305

  28. Shahin I, Nassif AB, Hamsa S (2019) Emotion recognition using hybrid Gaussian mixture model and deep neural network. IEEE Access 7:26777–26787

    Article  Google Scholar 

  29. Speech and Audio Signal Processing: Processing and Perception of Speech and Music by Nelson Morgan Ben Gold

  30. Stolar MN, Lech M, Stolar SJ, Allen NB (2018) Detection of adolescent depression from speech using optimised spectral roll-off parameters. Biomed J Sci Techn Res

  31. Trigeorgis G et al. (2016) "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204, https://doi.org/10.1109/ICASSP.2016.7472669

  32. Trochidis K, Delbé C, Bigand E (2011) Investigation of the relationships between audio features and induced emotions in contemporary Western music

  33. Tzirakis P, Zhang J, Schuller BW (2018) “End-to-end speech emotion recognition using deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP). pp. 5089–5093

  34. Ververidis D, Koropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48:1162–1181

    Article  Google Scholar 

  35. Wang K, An N (2015) Bing Nan li, Yanyong Zhang, and Lian li. Speech emotion recognition using fourier parameters. IEEE Trans Affect Comput 6(1):69–75

    Article  Google Scholar 

  36. Xu C, Cao T, Feng Z, Dong C (2012) “ Multi-modal fusion emotion recognition based on HMM and ANN”. In: Khachidze V., Wang T., Siddiqui S., Liu V., Cappuccio S., Lim A. (eds) Contemporary Research on E-business Technology and Strategy. iCETS . Communications in Computer and Information Science, vol 332. Springer, Berlin, Heidelberg

  37. Zhang S, Zhang S, Huang T, Gao W (2018) “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans Multimedia. 20(6): 1576–1590. [Onlisne]. Available: ieeexplore.ieee.org/abstract/document/8085174/

  38. Rebai I, BenAyed Y, Mahdi W, Lorré J-P (2017) Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput Sci 112:316–322. https://doi.org/10.1016/j.procs.2017.08.003

  39. Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American english. PLoS ONE 13(5):Article e0196391. https://doi.org/10.1371/journal.pone.0196391

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suprava Patnaik.

Ethics declarations

Conflicts of interest/competing interests

Author has no conflicts of interest/ competing interests with anybody.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patnaik, S. Speech emotion recognition by using complex MFCC and deep sequential model. Multimed Tools Appl 82, 11897–11922 (2023). https://doi.org/10.1007/s11042-022-13725-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13725-y

Keywords

Navigation