Abstract
Speech Emotion Recognition (SER) is a challenging task, and the typical convolutional neural network (CNN) cannot well handle the speech data directly. Because CNN tends to understand local information and ignores the overall characteristics. This paper proposes a Capsule Network with Two-Way Attention MechanismTWACapsNet for short) for the SER problem. TWACapsNet accepts the spatial and spectral features as inputs, and the convolutional layer and the capsule layer are deployed to process these two types of features in two ways separately. After that, two attention mechanisms are designed to enhance the information obtained from the spatial and spectral features. Finally, the results of these two ways are combined to form the final decision. The advantage of TWACapsNet is verified by experiments on multiple SER data sets, and experimental results show that the proposed method outperforms the widely-deployed neural network models on three typical SER data sets. Furthermore, the combination of the two ways contributes to the higher and more stable performance of TWACapsNet.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Abdel-Hamid L (2020) Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun 122:19–30
Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech Lang Process 22:1533–1545
Albornoz E, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79:20483–20518
Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for Covid-19 lesion segmentation from CT scans. Signal, Image and Video Processing, pp 1–8
Bandela SR, Kumar TK (2021) Unsupervised feature selection and NMF de-noising for robust speech emotion recognition. Appl Acoust 172:107645
Burgan H (2022) Comparison of different ANN (FFBP GRNN RBF) algorithms and multiple linear regression for daily streamflow prediction in kocasu river-turkey. Fresenius Environ Bull 31:4699–4708
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of german emotional speech, In: INTERSPEECH 2005 - Eurospeech, 9th European conference on speech communication and technology, Lisbon, Portugal, 2005
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database, Springer. pp 335–359
Chaudhari PR, Alex JSR (2016) Selection of features for emotion recognition from speech. Indian J Sci Technol 9:1–5
George ML, Lakshmi NVSSR, Nagarajan SM, Mahapatra RP, Muthukumaran V, Sivaram M (2022) Intelligent recognition system for viewpoint variations on gait and speech using CNN-Capsnet. Int J Intell Comput Cybern 15:363–382
Göçeri E (2020) Capsnet topology to classify tumours from brain images and comparative evaluation. IET Image Process 14:882–889
Gudmalwar AP, Rama Rao CV, Dutta A (2018) Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22:521–531
Jackson P, Haq S (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford
Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, In: Proceedings of interspeech 2019, ISCA. pp 1701–1705
Li D, Zhou Y, Wang Z, Gao D (2021) Exploiting the potentialities of features for speech emotion recognition. Inf Sci 548:328–343
Liu J, Zhang C, Jiang X (2022) Imbalanced fault diagnosis of rolling bearing using improved MSR-GAN and feature enhancement-driven Capsnet. Mech Syst Signal Process 168:108664
McFee B, Raffel C, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python, pp 18–24
Menghan S, Baochen J, **g Y (2011) Vocal emotion recognition based on HMM and GMM for mandarin speech. IEEE Computer Society, USA, pp 27–30
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention, In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, MIT Press, Cambridge, MA, USA. pp 2204-2212
Mustaqeem Kwon S (2020) MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst Appl 167:114177
Nagarajan S, Nettimi SSS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. ar**v preprint ar**v:1710.09829
Subhashree R, Rathna G (2016) Speech emotion recognition: performance analysis based on fused algorithms and GMM modelling. Indian J Sci Technol 9:1–18
Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37
Tao J, Liu F, Zhang M, Jia H (2008) Design of speech corpus for mandarin text to speech, In: The Blizzard Challenge 2008 workshop
Wen X, Ye J, Luo Y, Xu Y, Wang X, Wu C, Liu K (2022) CTL-MTNet: a novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition. IJCAI 2022. Austria, Vienna, pp 2305–2311
Wen XC, Liu KH, Zhang WM, Jiang K (2021) The application of capsule neural network based cnn for speech emotion recognition, In: 2020 25th international conference on pattern recognition (ICPR), pp 9356–9362. https://doi.org/10.1109/ICPR48806.2021.9412360
Wu X, Cao Y, Lu H, Liu S, Wang D, Wu Z, Liu X, Meng HM (2021) Speech emotion recognition using sequential capsule networks, pp 1–1. https://doi.org/10.1109/TASLP.2021.3120586
Wu X, Liu S, Cao Y, Li X, Yu J, Dai D, Ma X, Hu S, Wu Z, Liu X, Meng H (2019) Speech emotion recognition using capsule networks, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6695–6699. https://doi.org/10.1109/ICASSP.2019.8683163
Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Select Topics Signal Process 4:867–881
**e Y, Liang R, Liang Z, Huang C, Schuller B (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) PP, pp 1–1
**e Y, Zhu F, Wang J, Liang R, Zhao L, Tang G (2018) Long-short term memory for emotional recognition with variable length speech, In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia), IEEE. pp 1–4
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification, In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Ye J, Wen X, Wang X, Xu Y, Luo Y, Wu C, Chen L, Liu K (2022) GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun 145:21–35
Ye J, Wen X, Wei Y, Xu Y, Liu K, Shan H (2023) Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition, In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Rhodes Island, Greece, 2023, pp 1–5
Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No. 61772023), National Key Research and Development Program of China (No. 2019QY1803), and Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No.2021H6015).
Funding
This work is supported by the National Natural Science Foundation of China (No. 61772023), Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No. 2021H6015), and Fujian Province Social Science Planning General Project (FJ2020B062).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Human and animal rights
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, XC., Liu, KH., Luo, Y. et al. TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08957-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s00500-023-08957-5