Speech emotion recognition by using complex MFCC and deep sequential model

Patnaik, Suprava

doi:10.1007/s11042-022-13725-y

Speech emotion recognition by using complex MFCC and deep sequential model

Published: 12 September 2022

Volume 82, pages 11897–11922, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Suprava Patnaik ORCID: orcid.org/0000-0002-7068-5960¹

1338 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Speech Emotion Recognition (SER) is one of the front-line research areas. For a machine, inferring SER is difficult because emotions are subjective and annotation is challenging. Nevertheless, researchers feel that SER is possible because speech is quasi-stationery and emotions are declarative finite states. This paper is about emotion classification by using Complex Mel Frequency Cepstral Coefficients (c-MFCC) as the representative trait and a deep sequential model as a classifier. The experimental setup is speaker independent and accommodates marginal variations in the underlying phonemes. Testing for this work has been carried out on RAVDESS and TESS databases. Conceptually, the proposed model is erogenous towards prosody observance. The main contributions of this work are of two-folds. Firstly, introducing conception of c-MFCC and investigating it as a robust cue of emotion and there by leading to significant improvement in accuracy performance. Secondly, establishing correlation between MFCC based accuracy and Russell’s emotional circumplex pattern. As per the Russell’s 2D emotion circumplex model, emotional signals are combinations of several psychological dimensions though perceived as discrete categories. Results of this work are outcome from a deep sequential LSTM model. Proposed c-MFCC are found to be more robust to handle signal framing, informative in terms of spectral roll off, and therefore put forward as an input to the classifier. For RAVDESS database the best accuracy achieved is 78.8% for fourteen classes, which subsequently improved to 91.6% for gender integrated eight classes and 98.5% for affective separated six classes. Though, the RAVDESS dataset has two analogous sentences revealed results are for the complete dataset and without applying any phonetic separation of the samples. Thus, proposed method appears to be semi-commutative on phonemes. Results obtained from this study are presented and discussed in forms of confusion matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition using MFCC-based entropy feature

Article 22 August 2023

Using Deep Learning to Recognize Emotions Through Speech Analysis

Data availability

Database analyzed in this work are publicly available.

Code availability

Software application.

References

Abdel-Hamid O, Mohamed A-r, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. Ieee/Acm Trans Audio, Speech, Language Process 22(10):1533–1545
Article Google Scholar
Alsteris LD, Paliwal KK (2007) Short-time phase spectrum in speech processing: A review and some experimental results. Digital Signal Process 17(3):578–616 ISSN 1051-2004
Alsteris LD, Paliwal KK, Leigh D (2006) Paliwal, Further intelligibility results from human listening tests using the short-time phase spectrum, Speech Communication 48(6):727–736
Google Scholar
Anagnostopoulos C-N, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Article Google Scholar
Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affective Comput 4(3):280–290
Article Google Scholar
Ayadi MEI, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Patt Recog 44(3):572–587
Article MATH Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendimeier WF, Weiss B (2005) A database of germ an emotional speech. Interspeech 5:1517–1520
Article Google Scholar
de Pinto MG, Polignano M, Lops P, Semeraro G (2020) Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients, IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS)
Digital Processing of Speech Signals, 1e, Lawrence Rabiner and Ronald W. Schafer
Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653. https://doi.org/10.1109/ACCESS.2020.3043201
Article Google Scholar
Gaich A, Mowlaee P (2015) On speech quality estimation on phase-aware single-channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane (Australia), pp 216–220
Gao Y et al. (2017) “Speech Emotion Recognition Using Local and Global Features,” Brain Informatica. Bei**g China
Ghaleb E, Popa M, Asteriadis S (2019) “Multimodal and temporal perception of audio-visual cues for emotion recognition”, 2019 8th international conference on affective computing and intelligent interaction (ACII). United Kingdom, Cambridge pp. 552-558
Golik P, Tuske Z, Schluter R, Ney H (2015) Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, 16th Annual Conference of the International Speech Communication Association
Han K, Yu D, Tashev I (2014) “Speech emotion recognition using deep neural network and extreme learning machine,” in Proceedings of the Annual Conference of the International Speech Communication Association. [Online]. Available: https://www.microsoft.com/en-us/research/publication/speech-emotion-recognition-using-deep-neural-network-and-extreme-learning-machine/
Hinton G, … Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Huang C, Gong W, Wenlong F, Feng D , (2014) "A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM", Mathematical Problems in Engineering, vol. 2014, Article ID 749604, 7 pages
Kate Dupuis MKP (2010) “Toronto emotional speech set (TESS),.” [Online]. Available: https://tspace.library.utoronto.ca/handle/1807/24487
Kleinschmidt T, Sridharan S, Mason M, The use of phase in complex spectrum subtraction for robust speech recognition (2011) Computer. Speech Language 25(3):585–600. https://doi.org/10.1016/j.csl.2010.09.001
Article Google Scholar
Koutsogiannaki M, Simantiraki O, Degottex G, Stylianou Y (2014) The importance of phase on voice quality assessment, In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore. 1653–1657
Liu Y, Li Y, Yuan Y (2018) A Complete Canonical Correlation Analysis for Multiview Learning. 25th IEEE Int Conf Image Process (ICIP). Athens 2018:3254–3258
Google Scholar
Maly A, Mahale PMB (2016) On the importance of harmonic phase modification for improved speech signal reconstruction. IEEE International Conference on Acoustics. Speech Signal Process (ICASSP):584–588
McCowan I, Dean D, McLaren M, Vogt R (2011) Sridharan S, the delta-phase spectrum with application to voice activity detection and speaker recognition, IEEE transactions on audio. Speech Language Process 19(7):2026–2038
Article Google Scholar
Mower E, Mataric MJ, Narayanan S (2011) A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio, Speech Language Process 19(5):1057–1070
Article Google Scholar
Muthusamy H, Polat K, Yaacob S (2015) Improved Emotion Recognition Using Gaussian Mixture Model and Extreme Learning Machine in Speech and Glottal Signal. Math Problems Eng:394083. https://doi.org/10.1155/2015/394083
Rabiner LR, Schafer RW (2009) Theory and application of digital speech processing: Pearson
Rajak R, Mall R (2019) "Emotion recognition from audio, dimensional and discrete categorization using CNNs," TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kochi, India, pp. 301–305
Shahin I, Nassif AB, Hamsa S (2019) Emotion recognition using hybrid Gaussian mixture model and deep neural network. IEEE Access 7:26777–26787
Article Google Scholar
Speech and Audio Signal Processing: Processing and Perception of Speech and Music by Nelson Morgan Ben Gold
Stolar MN, Lech M, Stolar SJ, Allen NB (2018) Detection of adolescent depression from speech using optimised spectral roll-off parameters. Biomed J Sci Techn Res
Trigeorgis G et al. (2016) "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204, https://doi.org/10.1109/ICASSP.2016.7472669
Trochidis K, Delbé C, Bigand E (2011) Investigation of the relationships between audio features and induced emotions in contemporary Western music
Tzirakis P, Zhang J, Schuller BW (2018) “End-to-end speech emotion recognition using deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP). pp. 5089–5093
Ververidis D, Koropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48:1162–1181
Article Google Scholar
Wang K, An N (2015) Bing Nan li, Yanyong Zhang, and Lian li. Speech emotion recognition using fourier parameters. IEEE Trans Affect Comput 6(1):69–75
Article Google Scholar
Xu C, Cao T, Feng Z, Dong C (2012) “ Multi-modal fusion emotion recognition based on HMM and ANN”. In: Khachidze V., Wang T., Siddiqui S., Liu V., Cappuccio S., Lim A. (eds) Contemporary Research on E-business Technology and Strategy. iCETS . Communications in Computer and Information Science, vol 332. Springer, Berlin, Heidelberg
Zhang S, Zhang S, Huang T, Gao W (2018) “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans Multimedia. 20(6): 1576–1590. [Onlisne]. Available: ieeexplore.ieee.org/abstract/document/8085174/
Rebai I, BenAyed Y, Mahdi W, Lorré J-P (2017) Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput Sci 112:316–322. https://doi.org/10.1016/j.procs.2017.08.003
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American english. PLoS ONE 13(5):Article e0196391. https://doi.org/10.1371/journal.pone.0196391

Download references

Author information

Authors and Affiliations

School of Electronics, Kalinga Institute of Industrial Technology, Bhubaneswar, Odisha, India
Suprava Patnaik

Authors

Suprava Patnaik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suprava Patnaik.

Ethics declarations

Conflicts of interest/competing interests

Author has no conflicts of interest/ competing interests with anybody.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Patnaik, S. Speech emotion recognition by using complex MFCC and deep sequential model. Multimed Tools Appl 82, 11897–11922 (2023). https://doi.org/10.1007/s11042-022-13725-y

Download citation

Received: 11 October 2020
Revised: 09 May 2022
Accepted: 25 August 2022
Published: 12 September 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11042-022-13725-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition by using complex MFCC and deep sequential model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition using MFCC-based entropy feature

Using Deep Learning to Recognize Emotions Through Speech Analysis

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Speech emotion recognition by using complex MFCC and deep sequential model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition using MFCC-based entropy feature

Using Deep Learning to Recognize Emotions Through Speech Analysis

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation