Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals

  • Chapter
  • First Online:
Intelligent Multimedia Signal Processing for Smart Ecosystems
  • 153 Accesses

Abstract

The most significant and successful method of human communication is speech, which can also serve as a channel for human-computer interaction (HCI). The use of sensors to identify auditory emotions is an emerging field of HCI research. In the ongoing scenario, emotion recognition is a persistent issue that is crucial to real-time applications. Recognizing human emotions is difficult when analysing and predicting a person’s behaviour from a collection of audio clips. This chapter is going to cover the concept of sequence learning for emotion recognition using LSTM, GRUs, and their modifications, such as multi-layer or deep LSTM and bi-directional LSTM networks. The successes and weaknesses of current LSTM/GRUs-based emotion recognition systems will be assessed and discussed. We’ll also discuss the drawbacks of traditional RNNs and why LSTM is better than RNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akçay MB, Oğuz KJSC (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76

    Article  Google Scholar 

  2. Guo L, Wang L, Dang J, Liu Z, Guan HJIA (2019) Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access 7:75798–75809

    Article  Google Scholar 

  3. Al Dujaili MJ, Ebrahimi-Moghadam A, Fatlawi AJIJOE, C. Engineering (2021) Speech emotion recognition based on SVM and KNN classifications fusion. Int J Electr Comput Eng 11(2):1259

    Google Scholar 

  4. Ali MS, Islam MS, Hossain MAJIJOCS, Engineering, and I. Technology (2012) Gender recognition system using speech signal. Int J Comput Sci Eng Inf Technol 2(1):1–9

    Google Scholar 

  5. Amodei D et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International conference on machine learning. PMLR, pp 173–182

    Google Scholar 

  6. Roh M-C, Shin H-K, Lee S-W (2010) View-independent human action recognition with volume motion template on single stereo camera. Pattern Recogn Lett 31(7):639–647

    Article  Google Scholar 

  7. **n M, Zhang H, Wang H, Sun M, Yuan D (2016) Arch: adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178:87–102

    Article  Google Scholar 

  8. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257

    Article  Google Scholar 

  9. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 154–159

    Google Scholar 

  10. Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, San Francisco, pp 2046–2053

    Chapter  Google Scholar 

  11. Sekma M, Mejdoub M, Amar CB (2015) Human action recognition based on multi-layer fisher vector encoding method. Pattern Recogn Lett 65:37–43

    Article  Google Scholar 

  12. Hou J, Wu X, Sun Y, Jia Y (2018) Content-attention representation by factorized action-scene network for action recognition. IEEE Trans Multimed 20(6):1537–1547

    Article  Google Scholar 

  13. Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) Violence detection using spatiotemporal features with 3D convolutional neural network. Sensors 19(11):2472

    Article  Google Scholar 

  14. Sherstinsky AJPDNP (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys D: Nonlinear Phenom 404:132306

    Article  MathSciNet  MATH  Google Scholar 

  15. Yu Y, Si X, Hu C, Zhang JJNC (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270

    Article  MathSciNet  MATH  Google Scholar 

  16. Huang Z, Xu W, Kai Y (2015) Bidirectional LSTM-CRF models for sequence tagging. ar**v preprint ar**v:1508.01991

    Google Scholar 

  17. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber JJITONN, l. Systems (2016) LSTM: a search space odyssey. IEEE Trans Neural Netw 28(10):2222–2232

    MathSciNet  Google Scholar 

  18. Gers FA, Schraudolph NN, Schmidhuber JJJOMLR (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3(1):115–143

    MathSciNet  MATH  Google Scholar 

  19. Zhang S, Liu X, **ao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, Santa Rosa, pp 148–157

    Chapter  Google Scholar 

  20. Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: International conference on artificial neural networks. Springer, Berlin, Heidelberg, pp 799–804

    Google Scholar 

  21. Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, Piscataway, pp 1597–1600

    Chapter  Google Scholar 

  22. Kwon SJESWA (2021) MLT-DNet: speech emotion recognition using 1D dilated based on multi-learning trick approach. Expert Syst Appl 167:114177

    Article  Google Scholar 

  23. Mustaqeem SJC-C, Kwon M (2021) 1D-CNN: speech emotion recognition system using a stacked network with dilated CNN features. Comput Mater Contin 67(3):4039–4059

    Google Scholar 

  24. Sajjad M, Kwon SJIA (2020) Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8:79861–79875

    Article  Google Scholar 

  25. Zhang Y, Li R, Tsai C-LJJOTASA (2010) Regularization parameter selections via generalized information criterion. J Am Stat Assoc 105(489):312–323

    Article  MathSciNet  MATH  Google Scholar 

  26. Muhammad K et al (2021) Human action recognition using attention based LSTM network with dilated CNN features. Future Gener Comput Syst 125:820–830

    Article  Google Scholar 

  27. Busso C et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  28. Livingstone SR, Russo FAJPO (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One 13(5):e0196391

    Article  Google Scholar 

  29. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Ninth european conference on speech communication and technology, Lisbon

    Google Scholar 

  30. Mustaqeem M, Kwon S (2021) Speech emotion recognition based on deep networks: a review. In: Proceedings of the Korea information processing society conference. Korea Information Processing Society, pp 331–334

    Google Scholar 

  31. Ishaq M, Son G, Kwon S (2021) Utterance-level speech emotion recognition using parallel convolutional neural network with self-attention module. In: 7th international conference on next generation computing 2021. Korean Institute of Next Generation Computing

    Google Scholar 

  32. Zheng W, Yu J, Zou Y (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 international conference on affective computing and intelligent interaction (ACII). IEEE, Piscataway, pp 827–831

    Chapter  Google Scholar 

  33. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association

    Google Scholar 

  34. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881

    Article  Google Scholar 

  35. Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444

    Article  Google Scholar 

  36. Zhao Z et al (2019) Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525

    Article  Google Scholar 

  37. Luo D, Zou Y, Huang D (2018) Investigation on joint representation learning for robust feature extraction in speech emotion recognition. Interspeech 2018:152–156

    Google Scholar 

  38. Bhat GM, Mustafa M, Parah SA, Ahmad J (2010) Field programmable gate array (FPGA) implementation of novel complex PN-code-generator-based data scrambler and descrambler. Maejo Int J Sci Technol 4(1):125–135

    Google Scholar 

  39. Badshah AM et al (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78(5):5571–5589

    Article  MathSciNet  Google Scholar 

  40. Sarosh P, Parah SA, Bhat GM, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the Internet of Health Things. Sustain Cities Soc 74:103129

    Article  Google Scholar 

  41. Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722

    Article  Google Scholar 

  42. Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition. Proc Interspeech 2019:1701–1705

    Google Scholar 

  43. Bhavan A, Chauhan P, Shah RR (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886

    Article  Google Scholar 

  44. Parah SA, Rashid M, Varadarajan V (2022) Artificial intelligence for innovative healthcare informatics. Springer, Cham

    Book  Google Scholar 

  45. Hafiz AM, Parah SA, Bhat RU (2021) Attention mechanisms and deep learning for machine vision: a survey of the state of the art. ar**v preprint ar**v:2106.07550

    Google Scholar 

  46. Sugan N, Srinivas NS, Kar N, Kumar L, Nath MK, Kanhe A (2018) Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET conference on control, communication, and computing (IC4). IEEE, Piscataway, pp 266–271

    Chapter  Google Scholar 

  47. Sarosh P, Parah SA, Mansur RF, Bhat GM (2020) Artificial intelligence for COVID-19 detection--a state-of-the-art review. ar**v preprint ar**v:2012.06310

    Google Scholar 

  48. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326

    Article  Google Scholar 

  49. Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437

    Chapter  Google Scholar 

  50. Avots E, Sapiński T, Bachmann M, Kamińska D (2019) Audiovisual emotion recognition in wild. Mach Vis Appl 30(5):975–985

    Article  Google Scholar 

  51. Mekruksavanich S, Jitpattanakul A, Hnoohom N (2020) Negative emotion recognition using deep learning for thai language. In: 2020 Joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON). IEEE, Piscataway, pp 71–74

    Chapter  Google Scholar 

  52. Parsa S, Parah SA, Bhat GM, Khan M (2021) A security management framework for big data in smart healthcare. Big Data Res 25:100225

    Article  Google Scholar 

  53. Assunção G, Menezes P, Perdigão F (2020) Speaker awareness for speech emotion recognition. Int J Online Biomed Eng 16(04):15–22

    Article  Google Scholar 

  54. Kwon SJIJOIS (2021) Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int J Intell Syst 36:5116

    Article  Google Scholar 

  55. Zhai Y-J, Yu D-W, Guo H-Y, Yu D-LJEAOAI (2010) Robust air/fuel ratio control with adaptive DRNN model and AD tuning. Eng Appl Artif Intell 23(2):283–289

    Article  Google Scholar 

  56. Jackson P, Haq SJUOSG (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford

    Google Scholar 

  57. Anvarjon T, Kwon SJS (2020) Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20(18):5212

    Article  Google Scholar 

  58. Kwon SJS (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183

    Google Scholar 

  59. Kwon SJM (2020) CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 8(12):2133

    Article  Google Scholar 

  60. Duan M, Li K, Yang C, Li KJN (2018) A hybrid deep learning CNN–ELM for age and gender classification. Neurocomputing 275:448–461

    Article  Google Scholar 

  61. van Kleef D (2012) Redesign of the control model of the catering distribution network of KLM. Delft University of Technology, Delft

    Google Scholar 

  62. Paz-Ares L et al (2019) Durvalumab plus platinum–etoposide versus platinum–etoposide in first-line treatment of extensive-stage small-cell lung cancer (CASPIAN): a randomised, controlled, open-label, phase 3 trial. Lancet 394(10212):1929–1939

    Article  Google Scholar 

  63. Hestness J, Ardalani N, Diamos G (2019) Beyond human-level accuracy: computational challenges in deep learning. In: Proceedings of the 24th symposium on principles and practice of parallel programming. Association for Computing Machinery, New York, pp 1–14

    Google Scholar 

  64. Howard AG et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ar**v

    Google Scholar 

  65. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer KJAPA (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. ar**v

    Google Scholar 

  66. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 6848–6856

    Google Scholar 

  67. Zou Z, Shi Z, Guo Y, Ye JJAPA (2019) Object detection in 20 years: a survey. ar**v

    Google Scholar 

  68. Sebe N, Lew MS, Zhou X, Huang TS, Bakker EM (2003) The state of the art in image and video retrieval. In: International conference on image and video retrieval. Springer, Berlin, Heidelberg, pp 1–8

    MATH  Google Scholar 

  69. Apostolidis E, Adamantidou E, Metsai AI, Mezaris V, Patras IJPOTI (2021) Video summarization using deep neural networks: a survey. Proc IEEE 109(11):1838–1863

    Article  Google Scholar 

Download references

Acknowledgments

This research was made possible thanks to a grant from the National Research Foundation of Korea, which was sponsored by the Korean government through the Ministry of Science and ICT under Grant NRF-2020R1F1A1060659.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mustaqeem Khan or Soonil Kwon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Khan, M., Ishaq, M., Swain, M., Kwon, S. (2023). Advanced Sequence Learning Approaches for Emotion Recognition Using Speech Signals. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34873-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34872-3

  • Online ISBN: 978-3-031-34873-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation