Log in

TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

  • Application of soft computing
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Speech Emotion Recognition (SER) is a challenging task, and the typical convolutional neural network (CNN) cannot well handle the speech data directly. Because CNN tends to understand local information and ignores the overall characteristics. This paper proposes a Capsule Network with Two-Way Attention MechanismTWACapsNet for short) for the SER problem. TWACapsNet accepts the spatial and spectral features as inputs, and the convolutional layer and the capsule layer are deployed to process these two types of features in two ways separately. After that, two attention mechanisms are designed to enhance the information obtained from the spatial and spectral features. Finally, the results of these two ways are combined to form the final decision. The advantage of TWACapsNet is verified by experiments on multiple SER data sets, and experimental results show that the proposed method outperforms the widely-deployed neural network models on three typical SER data sets. Furthermore, the combination of the two ways contributes to the higher and more stable performance of TWACapsNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  • Abdel-Hamid L (2020) Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun 122:19–30

    Article  Google Scholar 

  • Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech Lang Process 22:1533–1545

    Article  Google Scholar 

  • Albornoz E, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570

    Article  Google Scholar 

  • Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79:20483–20518

    Article  Google Scholar 

  • Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for Covid-19 lesion segmentation from CT scans. Signal, Image and Video Processing, pp 1–8

    Google Scholar 

  • Bandela SR, Kumar TK (2021) Unsupervised feature selection and NMF de-noising for robust speech emotion recognition. Appl Acoust 172:107645

    Article  Google Scholar 

  • Burgan H (2022) Comparison of different ANN (FFBP GRNN RBF) algorithms and multiple linear regression for daily streamflow prediction in kocasu river-turkey. Fresenius Environ Bull 31:4699–4708

    Google Scholar 

  • Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of german emotional speech, In: INTERSPEECH 2005 - Eurospeech, 9th European conference on speech communication and technology, Lisbon, Portugal, 2005

  • Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database, Springer. pp 335–359

  • Chaudhari PR, Alex JSR (2016) Selection of features for emotion recognition from speech. Indian J Sci Technol 9:1–5

    Google Scholar 

  • George ML, Lakshmi NVSSR, Nagarajan SM, Mahapatra RP, Muthukumaran V, Sivaram M (2022) Intelligent recognition system for viewpoint variations on gait and speech using CNN-Capsnet. Int J Intell Comput Cybern 15:363–382

    Article  Google Scholar 

  • Göçeri E (2020) Capsnet topology to classify tumours from brain images and comparative evaluation. IET Image Process 14:882–889

    Article  Google Scholar 

  • Gudmalwar AP, Rama Rao CV, Dutta A (2018) Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22:521–531

    Article  Google Scholar 

  • Jackson P, Haq S (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford

    Google Scholar 

  • Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, In: Proceedings of interspeech 2019, ISCA. pp 1701–1705

  • Li D, Zhou Y, Wang Z, Gao D (2021) Exploiting the potentialities of features for speech emotion recognition. Inf Sci 548:328–343

    Article  Google Scholar 

  • Liu J, Zhang C, Jiang X (2022) Imbalanced fault diagnosis of rolling bearing using improved MSR-GAN and feature enhancement-driven Capsnet. Mech Syst Signal Process 168:108664

    Article  Google Scholar 

  • McFee B, Raffel C, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python, pp 18–24

  • Menghan S, Baochen J, **g Y (2011) Vocal emotion recognition based on HMM and GMM for mandarin speech. IEEE Computer Society, USA, pp 27–30

    Google Scholar 

  • Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention, In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, MIT Press, Cambridge, MA, USA. pp 2204-2212

  • Mustaqeem Kwon S (2020) MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst Appl 167:114177

    Article  Google Scholar 

  • Nagarajan S, Nettimi SSS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763

    Article  Google Scholar 

  • Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326

    Article  Google Scholar 

  • Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. ar**v preprint ar**v:1710.09829

  • Subhashree R, Rathna G (2016) Speech emotion recognition: performance analysis based on fused algorithms and GMM modelling. Indian J Sci Technol 9:1–18

    Article  Google Scholar 

  • Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37

    Article  Google Scholar 

  • Tao J, Liu F, Zhang M, Jia H (2008) Design of speech corpus for mandarin text to speech, In: The Blizzard Challenge 2008 workshop

  • Wen X, Ye J, Luo Y, Xu Y, Wang X, Wu C, Liu K (2022) CTL-MTNet: a novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition. IJCAI 2022. Austria, Vienna, pp 2305–2311

  • Wen XC, Liu KH, Zhang WM, Jiang K (2021) The application of capsule neural network based cnn for speech emotion recognition, In: 2020 25th international conference on pattern recognition (ICPR), pp 9356–9362. https://doi.org/10.1109/ICPR48806.2021.9412360

  • Wu X, Cao Y, Lu H, Liu S, Wang D, Wu Z, Liu X, Meng HM (2021) Speech emotion recognition using sequential capsule networks, pp 1–1. https://doi.org/10.1109/TASLP.2021.3120586

  • Wu X, Liu S, Cao Y, Li X, Yu J, Dai D, Ma X, Hu S, Wu Z, Liu X, Meng H (2019) Speech emotion recognition using capsule networks, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6695–6699. https://doi.org/10.1109/ICASSP.2019.8683163

  • Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Select Topics Signal Process 4:867–881

  • **e Y, Liang R, Liang Z, Huang C, Schuller B (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) PP, pp 1–1

    Google Scholar 

  • **e Y, Zhu F, Wang J, Liang R, Zhao L, Tang G (2018) Long-short term memory for emotional recognition with variable length speech, In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia), IEEE. pp 1–4

  • Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification, In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

  • Ye J, Wen X, Wang X, Xu Y, Luo Y, Wu C, Chen L, Liu K (2022) GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun 145:21–35

    Article  Google Scholar 

  • Ye J, Wen X, Wei Y, Xu Y, Liu K, Shan H (2023) Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition, In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Rhodes Island, Greece, 2023, pp 1–5

  • Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP)

  • Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61772023), National Key Research and Development Program of China (No. 2019QY1803), and Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No.2021H6015).

Funding

This work is supported by the National Natural Science Foundation of China (No. 61772023), Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No. 2021H6015), and Fujian Province Social Science Planning General Project (FJ2020B062).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kun-Hong Liu or Liyan Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Human and animal rights

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, XC., Liu, KH., Luo, Y. et al. TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08957-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00500-023-08957-5

Keywords

Navigation