Abstract
The most significant and successful method of human communication is speech, which can also serve as a channel for human-computer interaction (HCI). The use of sensors to identify auditory emotions is an emerging field of HCI research. In the ongoing scenario, emotion recognition is a persistent issue that is crucial to real-time applications. Recognizing human emotions is difficult when analysing and predicting a person’s behaviour from a collection of audio clips. This chapter is going to cover the concept of sequence learning for emotion recognition using LSTM, GRUs, and their modifications, such as multi-layer or deep LSTM and bi-directional LSTM networks. The successes and weaknesses of current LSTM/GRUs-based emotion recognition systems will be assessed and discussed. We’ll also discuss the drawbacks of traditional RNNs and why LSTM is better than RNNs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akçay MB, Oğuz KJSC (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76
Guo L, Wang L, Dang J, Liu Z, Guan HJIA (2019) Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access 7:75798–75809
Al Dujaili MJ, Ebrahimi-Moghadam A, Fatlawi AJIJOE, C. Engineering (2021) Speech emotion recognition based on SVM and KNN classifications fusion. Int J Electr Comput Eng 11(2):1259
Ali MS, Islam MS, Hossain MAJIJOCS, Engineering, and I. Technology (2012) Gender recognition system using speech signal. Int J Comput Sci Eng Inf Technol 2(1):1–9
Amodei D et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International conference on machine learning. PMLR, pp 173–182
Roh M-C, Shin H-K, Lee S-W (2010) View-independent human action recognition with volume motion template on single stereo camera. Pattern Recogn Lett 31(7):639–647
**n M, Zhang H, Wang H, Sun M, Yuan D (2016) Arch: adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178:87–102
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 154–159
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, San Francisco, pp 2046–2053
Sekma M, Mejdoub M, Amar CB (2015) Human action recognition based on multi-layer fisher vector encoding method. Pattern Recogn Lett 65:37–43
Hou J, Wu X, Sun Y, Jia Y (2018) Content-attention representation by factorized action-scene network for action recognition. IEEE Trans Multimed 20(6):1537–1547
Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19(11):2472
Sherstinsky AJPDNP (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D: Nonlinear Phenom 404:132306
Yu Y, Si X, Hu C, Zhang JJNC (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270
Huang Z, Xu W, Kai Y (2015) Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint ar**v:1508.01991
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber JJITONN, l. Systems (2016) LSTM: a search space odyssey. IEEE Trans Neural Netw 28(10):2222–2232
Gers FA, Schraudolph NN, Schmidhuber JJJOMLR (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3(1):115–143
Zhang S, Liu X, **ao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, Santa Rosa, pp 148–157
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 799–804
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, Piscataway, pp 1597–1600
Kwon SJESWA (2021) MLT-DNet: speech emotion recognition using 1D dilated based on multi-learning trick approach. Expert Syst Appl 167:114177
Mustaqeem SJC-C, Kwon M (2021) 1D-CNN: speech emotion recognition system using a stacked network with dilated CNN features. Comput Mater Contin 67(3):4039–4059
Sajjad M, Kwon SJIA (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861–79875
Zhang Y, Li R, Tsai C-LJJOTASA (2010) Regularization parameter selections via generalized information criterion. J Am Stat Assoc 105(489):312–323
Muhammad K et al (2021) Human action recognition using attention based LSTM network with dilated CNN features. Future Gener Comput Syst 125:820–830
Busso C et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Livingstone SR, Russo FAJPO (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One 13(5):e0196391
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth european conference on speech communication and technology, Lisbon
Mustaqeem M, Kwon S (2021) Speech emotion recognition based on deep networks: a review. In: Proceedings of the Korea information processing society conference. Korea Information Processing Society, pp 331–334
Ishaq M, Son G, Kwon S (2021) Utterance-level speech emotion recognition using parallel convolutional neural network with self-attention module. In: 7th international conference on next generation computing 2021. Korean Institute of Next Generation Computing
Zheng W, Yu J, Zou Y (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 international conference on affective computing and intelligent interaction (ACII). IEEE, Piscataway, pp 827–831
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444
Zhao Z et al (2019) Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525
Luo D, Zou Y, Huang D (2018) Investigation on joint representation learning for robust feature extraction in speech emotion recognition. Interspeech 2018:152–156
Bhat GM, Mustafa M, Parah SA, Ahmad J (2010) Field programmable gate array (FPGA) implementation of novel complex PN-code-generator-based data scrambler and descrambler. Maejo Int J Sci Technol 4(1):125–135
Badshah AM et al (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78(5):5571–5589
Sarosh P, Parah SA, Bhat GM, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the Internet of Health Things. Sustain Cities Soc 74:103129
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722
Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition. Proc Interspeech 2019:1701–1705
Bhavan A, Chauhan P, Shah RR (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886
Parah SA, Rashid M, Varadarajan V (2022) Artificial intelligence for innovative healthcare informatics. Springer, Cham
Hafiz AM, Parah SA, Bhat RU (2021) Attention mechanisms and deep learning for machine vision: a survey of the state of the art. ar**v preprint ar**v:2106.07550
Sugan N, Srinivas NS, Kar N, Kumar L, Nath MK, Kanhe A (2018) Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET conference on control, communication, and computing (IC4). IEEE, Piscataway, pp 266–271
Sarosh P, Parah SA, Mansur RF, Bhat GM (2020) Artificial intelligence for COVID-19 detection--a state-of-the-art review. ar**v preprint ar**v:2012.06310
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437
Avots E, Sapiński T, Bachmann M, Kamińska D (2019) Audiovisual emotion recognition in wild. Mach Vis Appl 30(5):975–985
Mekruksavanich S, Jitpattanakul A, Hnoohom N (2020) Negative emotion recognition using deep learning for thai language. In: 2020 Joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON). IEEE, Piscataway, pp 71–74
Parsa S, Parah SA, Bhat GM, Khan M (2021) A security management framework for big data in smart healthcare. Big Data Res 25:100225
Assunção G, Menezes P, Perdigão F (2020) Speaker awareness for speech emotion recognition. Int J Online Biomed Eng 16(04):15–22
Kwon SJIJOIS (2021) Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int J Intell Syst 36:5116
Zhai Y-J, Yu D-W, Guo H-Y, Yu D-LJEAOAI (2010) Robust air/fuel ratio control with adaptive DRNN model and AD tuning. Eng Appl Artif Intell 23(2):283–289
Jackson P, Haq SJUOSG (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford
Anvarjon T, Kwon SJS (2020) Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20(18):5212
Kwon SJS (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183
Kwon SJM (2020) CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 8(12):2133
Duan M, Li K, Yang C, Li KJN (2018) A hybrid deep learning CNN–ELM for age and gender classification. Neurocomputing 275:448–461
van Kleef D (2012) Redesign of the control model of the catering distribution network of KLM. Delft University of Technology, Delft
Paz-Ares L et al (2019) Durvalumab plus platinum–etoposide versus platinum–etoposide in first-line treatment of extensive-stage small-cell lung cancer (CASPIAN): a randomised, controlled, open-label, phase 3 trial. Lancet 394(10212):1929–1939
Hestness J, Ardalani N, Diamos G (2019) Beyond human-level accuracy: computational challenges in deep learning. In: Proceedings of the 24th symposium on principles and practice of parallel programming. Association for Computing Machinery, New York, pp 1–14
Howard AG et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer KJAPA (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. ar**v
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 6848–6856
Zou Z, Shi Z, Guo Y, Ye JJAPA (2019) Object detection in 20 years: a survey. ar**v
Sebe N, Lew MS, Zhou X, Huang TS, Bakker EM (2003) The state of the art in image and video retrieval. In: International conference on image and video retrieval. Springer, Berlin, Heidelberg, pp 1–8
Apostolidis E, Adamantidou E, Metsai AI, Mezaris V, Patras IJPOTI (2021) Video summarization using deep neural networks: a survey. Proc IEEE 109(11):1838–1863
Acknowledgments
This research was made possible thanks to a grant from the National Research Foundation of Korea, which was sponsored by the Korean government through the Ministry of Science and ICT under Grant NRF-2020R1F1A1060659.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Khan, M., Ishaq, M., Swain, M., Kwon, S. (2023). Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-34873-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34872-3
Online ISBN: 978-3-031-34873-0
eBook Packages: Computer ScienceComputer Science (R0)