Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals

Khan, Mustaqeem; Ishaq, Muhammad; Swain, Monorama; Kwon, Soonil

doi:10.1007/978-3-031-34873-0_13

Mustaqeem Khan⁴,
Muhammad Ishaq⁴,
Monorama Swain⁵ &
…
Soonil Kwon⁴

153 Accesses

Abstract

The most significant and successful method of human communication is speech, which can also serve as a channel for human-computer interaction (HCI). The use of sensors to identify auditory emotions is an emerging field of HCI research. In the ongoing scenario, emotion recognition is a persistent issue that is crucial to real-time applications. Recognizing human emotions is difficult when analysing and predicting a person’s behaviour from a collection of audio clips. This chapter is going to cover the concept of sequence learning for emotion recognition using LSTM, GRUs, and their modifications, such as multi-layer or deep LSTM and bi-directional LSTM networks. The successes and weaknesses of current LSTM/GRUs-based emotion recognition systems will be assessed and discussed. We’ll also discuss the drawbacks of traditional RNNs and why LSTM is better than RNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Emotion Recognition from Speech Using Deep Learning

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Article 19 February 2024

Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

References

Akçay MB, Oğuz KJSC (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76
Article Google Scholar
Guo L, Wang L, Dang J, Liu Z, Guan HJIA (2019) Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access 7:75798–75809
Article Google Scholar
Al Dujaili MJ, Ebrahimi-Moghadam A, Fatlawi AJIJOE, C. Engineering (2021) Speech emotion recognition based on SVM and KNN classifications fusion. Int J Electr Comput Eng 11(2):1259
Google Scholar
Ali MS, Islam MS, Hossain MAJIJOCS, Engineering, and I. Technology (2012) Gender recognition system using speech signal. Int J Comput Sci Eng Inf Technol 2(1):1–9
Google Scholar
Amodei D et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International conference on machine learning. PMLR, pp 173–182
Google Scholar
Roh M-C, Shin H-K, Lee S-W (2010) View-independent human action recognition with volume motion template on single stereo camera. Pattern Recogn Lett 31(7):639–647
Article Google Scholar
**n M, Zhang H, Wang H, Sun M, Yuan D (2016) Arch: adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178:87–102
Article Google Scholar
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257
Article Google Scholar
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 154–159
Google Scholar
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, San Francisco, pp 2046–2053
Chapter Google Scholar
Sekma M, Mejdoub M, Amar CB (2015) Human action recognition based on multi-layer fisher vector encoding method. Pattern Recogn Lett 65:37–43
Article Google Scholar
Hou J, Wu X, Sun Y, Jia Y (2018) Content-attention representation by factorized action-scene network for action recognition. IEEE Trans Multimed 20(6):1537–1547
Article Google Scholar
Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19(11):2472
Article Google Scholar
Sherstinsky AJPDNP (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D: Nonlinear Phenom 404:132306
Article MathSciNet MATH Google Scholar
Yu Y, Si X, Hu C, Zhang JJNC (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270
Article MathSciNet MATH Google Scholar
Huang Z, Xu W, Kai Y (2015) Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint ar**v:1508.01991
Google Scholar
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber JJITONN, l. Systems (2016) LSTM: a search space odyssey. IEEE Trans Neural Netw 28(10):2222–2232
MathSciNet Google Scholar
Gers FA, Schraudolph NN, Schmidhuber JJJOMLR (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3(1):115–143
MathSciNet MATH Google Scholar
Zhang S, Liu X, **ao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, Santa Rosa, pp 148–157
Chapter Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 799–804
Google Scholar
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, Piscataway, pp 1597–1600
Chapter Google Scholar
Kwon SJESWA (2021) MLT-DNet: speech emotion recognition using 1D dilated based on multi-learning trick approach. Expert Syst Appl 167:114177
Article Google Scholar
Mustaqeem SJC-C, Kwon M (2021) 1D-CNN: speech emotion recognition system using a stacked network with dilated CNN features. Comput Mater Contin 67(3):4039–4059
Google Scholar
Sajjad M, Kwon SJIA (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861–79875
Article Google Scholar
Zhang Y, Li R, Tsai C-LJJOTASA (2010) Regularization parameter selections via generalized information criterion. J Am Stat Assoc 105(489):312–323
Article MathSciNet MATH Google Scholar
Muhammad K et al (2021) Human action recognition using attention based LSTM network with dilated CNN features. Future Gener Comput Syst 125:820–830
Article Google Scholar
Busso C et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Article Google Scholar
Livingstone SR, Russo FAJPO (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One 13(5):e0196391
Article Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth european conference on speech communication and technology, Lisbon
Google Scholar
Mustaqeem M, Kwon S (2021) Speech emotion recognition based on deep networks: a review. In: Proceedings of the Korea information processing society conference. Korea Information Processing Society, pp 331–334
Google Scholar
Ishaq M, Son G, Kwon S (2021) Utterance-level speech emotion recognition using parallel convolutional neural network with self-attention module. In: 7th international conference on next generation computing 2021. Korean Institute of Next Generation Computing
Google Scholar
Zheng W, Yu J, Zou Y (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 international conference on affective computing and intelligent interaction (ACII). IEEE, Piscataway, pp 827–831
Chapter Google Scholar
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association
Google Scholar
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881
Article Google Scholar
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444
Article Google Scholar
Zhao Z et al (2019) Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525
Article Google Scholar
Luo D, Zou Y, Huang D (2018) Investigation on joint representation learning for robust feature extraction in speech emotion recognition. Interspeech 2018:152–156
Google Scholar
Bhat GM, Mustafa M, Parah SA, Ahmad J (2010) Field programmable gate array (FPGA) implementation of novel complex PN-code-generator-based data scrambler and descrambler. Maejo Int J Sci Technol 4(1):125–135
Google Scholar
Badshah AM et al (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78(5):5571–5589
Article MathSciNet Google Scholar
Sarosh P, Parah SA, Bhat GM, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the Internet of Health Things. Sustain Cities Soc 74:103129
Article Google Scholar
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722
Article Google Scholar
Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition. Proc Interspeech 2019:1701–1705
Google Scholar
Bhavan A, Chauhan P, Shah RR (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886
Article Google Scholar
Parah SA, Rashid M, Varadarajan V (2022) Artificial intelligence for innovative healthcare informatics. Springer, Cham
Book Google Scholar
Hafiz AM, Parah SA, Bhat RU (2021) Attention mechanisms and deep learning for machine vision: a survey of the state of the art. ar**v preprint ar**v:2106.07550
Google Scholar
Sugan N, Srinivas NS, Kar N, Kumar L, Nath MK, Kanhe A (2018) Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET conference on control, communication, and computing (IC4). IEEE, Piscataway, pp 266–271
Chapter Google Scholar
Sarosh P, Parah SA, Mansur RF, Bhat GM (2020) Artificial intelligence for COVID-19 detection--a state-of-the-art review. ar**v preprint ar**v:2012.06310
Google Scholar
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Article Google Scholar
Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437
Chapter Google Scholar
Avots E, Sapiński T, Bachmann M, Kamińska D (2019) Audiovisual emotion recognition in wild. Mach Vis Appl 30(5):975–985
Article Google Scholar
Mekruksavanich S, Jitpattanakul A, Hnoohom N (2020) Negative emotion recognition using deep learning for thai language. In: 2020 Joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON). IEEE, Piscataway, pp 71–74
Chapter Google Scholar
Parsa S, Parah SA, Bhat GM, Khan M (2021) A security management framework for big data in smart healthcare. Big Data Res 25:100225
Article Google Scholar
Assunção G, Menezes P, Perdigão F (2020) Speaker awareness for speech emotion recognition. Int J Online Biomed Eng 16(04):15–22
Article Google Scholar
Kwon SJIJOIS (2021) Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int J Intell Syst 36:5116
Article Google Scholar
Zhai Y-J, Yu D-W, Guo H-Y, Yu D-LJEAOAI (2010) Robust air/fuel ratio control with adaptive DRNN model and AD tuning. Eng Appl Artif Intell 23(2):283–289
Article Google Scholar
Jackson P, Haq SJUOSG (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford
Google Scholar
Anvarjon T, Kwon SJS (2020) Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20(18):5212
Article Google Scholar
Kwon SJS (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183
Google Scholar
Kwon SJM (2020) CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 8(12):2133
Article Google Scholar
Duan M, Li K, Yang C, Li KJN (2018) A hybrid deep learning CNN–ELM for age and gender classification. Neurocomputing 275:448–461
Article Google Scholar
van Kleef D (2012) Redesign of the control model of the catering distribution network of KLM. Delft University of Technology, Delft
Google Scholar
Paz-Ares L et al (2019) Durvalumab plus platinum–etoposide versus platinum–etoposide in first-line treatment of extensive-stage small-cell lung cancer (CASPIAN): a randomised, controlled, open-label, phase 3 trial. Lancet 394(10212):1929–1939
Article Google Scholar
Hestness J, Ardalani N, Diamos G (2019) Beyond human-level accuracy: computational challenges in deep learning. In: Proceedings of the 24th symposium on principles and practice of parallel programming. Association for Computing Machinery, New York, pp 1–14
Google Scholar
Howard AG et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v
Google Scholar
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer KJAPA (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. ar**v
Google Scholar
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 6848–6856
Google Scholar
Zou Z, Shi Z, Guo Y, Ye JJAPA (2019) Object detection in 20 years: a survey. ar**v
Google Scholar
Sebe N, Lew MS, Zhou X, Huang TS, Bakker EM (2003) The state of the art in image and video retrieval. In: International conference on image and video retrieval. Springer, Berlin, Heidelberg, pp 1–8
MATH Google Scholar
Apostolidis E, Adamantidou E, Metsai AI, Mezaris V, Patras IJPOTI (2021) Video summarization using deep neural networks: a survey. Proc IEEE 109(11):1838–1863
Article Google Scholar

Download references

Acknowledgments

This research was made possible thanks to a grant from the National Research Foundation of Korea, which was sponsored by the Korean government through the Ministry of Science and ICT under Grant NRF-2020R1F1A1060659.

Author information

Authors and Affiliations

Sejong University, Software Convergence, Seoul, Republic of Korea
Mustaqeem Khan, Muhammad Ishaq & Soonil Kwon
Silicon Institute of Technology, Electronics and Communication, Bhubaneswar, India
Monorama Swain

Authors

Mustaqeem Khan
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Ishaq
View author publications
You can also search for this author in PubMed Google Scholar
Monorama Swain
View author publications
You can also search for this author in PubMed Google Scholar
Soonil Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mustaqeem Khan or Soonil Kwon .

Editor information

Editors and Affiliations

Department of Electronics and Instrumentation, University of Kashmir, Srinagar, Jammu and Kashmir, India
Shabir A. Parah
Aligarh Muslim University, Aligarh, India
Nasir N. Hurrah
Department of Electronics Engineering, Aligarh Muslim University, AMU Aligarh, India
Ekram Khan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Khan, M., Ishaq, M., Swain, M., Kwon, S. (2023). Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-34873-0_13
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34872-3
Online ISBN: 978-3-031-34873-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition from Speech Using Deep Learning

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition from Speech Using Deep Learning

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation