TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

Wen, **n-Cheng; Liu, Kun-Hong; Luo, Yan; Ye, Jiaxin; Chen, Liyan

doi:10.1007/s00500-023-08957-5

TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

Application of soft computing
Published: 17 August 2023

(2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

**n-Cheng Wen¹^na1,
Kun-Hong Liu²^na1,
Yan Luo³,
Jiaxin Ye⁴ &
…
Liyan Chen ORCID: orcid.org/0000-0002-1222-8876²

249 Accesses
Explore all metrics

Abstract

Speech Emotion Recognition (SER) is a challenging task, and the typical convolutional neural network (CNN) cannot well handle the speech data directly. Because CNN tends to understand local information and ignores the overall characteristics. This paper proposes a Capsule Network with Two-Way Attention MechanismTWACapsNet for short) for the SER problem. TWACapsNet accepts the spatial and spectral features as inputs, and the convolutional layer and the capsule layer are deployed to process these two types of features in two ways separately. After that, two attention mechanisms are designed to enhance the information obtained from the spatial and spectral features. Finally, the results of these two ways are combined to form the final decision. The advantage of TWACapsNet is verified by experiments on multiple SER data sets, and experimental results show that the proposed method outperforms the widely-deployed neural network models on three typical SER data sets. Furthermore, the combination of the two ways contributes to the higher and more stable performance of TWACapsNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Improved Capsule Network for Speech Emotion Recognition

Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Article 04 March 2022

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Article 28 August 2022

Data availability

Enquiries about data availability should be directed to the authors.

References

Abdel-Hamid L (2020) Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun 122:19–30
Article Google Scholar
Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech Lang Process 22:1533–1545
Article Google Scholar
Albornoz E, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570
Article Google Scholar
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79:20483–20518
Article Google Scholar
Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for Covid-19 lesion segmentation from CT scans. Signal, Image and Video Processing, pp 1–8
Google Scholar
Bandela SR, Kumar TK (2021) Unsupervised feature selection and NMF de-noising for robust speech emotion recognition. Appl Acoust 172:107645
Article Google Scholar
Burgan H (2022) Comparison of different ANN (FFBP GRNN RBF) algorithms and multiple linear regression for daily streamflow prediction in kocasu river-turkey. Fresenius Environ Bull 31:4699–4708
Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of german emotional speech, In: INTERSPEECH 2005 - Eurospeech, 9th European conference on speech communication and technology, Lisbon, Portugal, 2005
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database, Springer. pp 335–359
Chaudhari PR, Alex JSR (2016) Selection of features for emotion recognition from speech. Indian J Sci Technol 9:1–5
Google Scholar
George ML, Lakshmi NVSSR, Nagarajan SM, Mahapatra RP, Muthukumaran V, Sivaram M (2022) Intelligent recognition system for viewpoint variations on gait and speech using CNN-Capsnet. Int J Intell Comput Cybern 15:363–382
Article Google Scholar
Göçeri E (2020) Capsnet topology to classify tumours from brain images and comparative evaluation. IET Image Process 14:882–889
Article Google Scholar
Gudmalwar AP, Rama Rao CV, Dutta A (2018) Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22:521–531
Article Google Scholar
Jackson P, Haq S (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford
Google Scholar
Jalal MA, Loweimi E, Moore RK, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, In: Proceedings of interspeech 2019, ISCA. pp 1701–1705
Li D, Zhou Y, Wang Z, Gao D (2021) Exploiting the potentialities of features for speech emotion recognition. Inf Sci 548:328–343
Article Google Scholar
Liu J, Zhang C, Jiang X (2022) Imbalanced fault diagnosis of rolling bearing using improved MSR-GAN and feature enhancement-driven Capsnet. Mech Syst Signal Process 168:108664
Article Google Scholar
McFee B, Raffel C, Liang D, Ellis D, Mcvicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python, pp 18–24
Menghan S, Baochen J, **g Y (2011) Vocal emotion recognition based on HMM and GMM for mandarin speech. IEEE Computer Society, USA, pp 27–30
Google Scholar
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention, In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, MIT Press, Cambridge, MA, USA. pp 2204-2212
Mustaqeem Kwon S (2020) MLT-DNet: speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Syst Appl 167:114177
Article Google Scholar
Nagarajan S, Nettimi SSS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763
Article Google Scholar
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Article Google Scholar
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. ar**v preprint ar**v:1710.09829
Subhashree R, Rathna G (2016) Speech emotion recognition: performance analysis based on fused algorithms and GMM modelling. Indian J Sci Technol 9:1–18
Article Google Scholar
Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37
Article Google Scholar
Tao J, Liu F, Zhang M, Jia H (2008) Design of speech corpus for mandarin text to speech, In: The Blizzard Challenge 2008 workshop
Wen X, Ye J, Luo Y, Xu Y, Wang X, Wu C, Liu K (2022) CTL-MTNet: a novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition. IJCAI 2022. Austria, Vienna, pp 2305–2311
Wen XC, Liu KH, Zhang WM, Jiang K (2021) The application of capsule neural network based cnn for speech emotion recognition, In: 2020 25th international conference on pattern recognition (ICPR), pp 9356–9362. https://doi.org/10.1109/ICPR48806.2021.9412360
Wu X, Cao Y, Lu H, Liu S, Wang D, Wu Z, Liu X, Meng HM (2021) Speech emotion recognition using sequential capsule networks, pp 1–1. https://doi.org/10.1109/TASLP.2021.3120586
Wu X, Liu S, Cao Y, Li X, Yu J, Dai D, Ma X, Hu S, Wu Z, Liu X, Meng H (2019) Speech emotion recognition using capsule networks, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6695–6699. https://doi.org/10.1109/ICASSP.2019.8683163
Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Select Topics Signal Process 4:867–881
**e Y, Liang R, Liang Z, Huang C, Schuller B (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) PP, pp 1–1
Google Scholar
**e Y, Zhu F, Wang J, Liang R, Zhao L, Tang G (2018) Long-short term memory for emotional recognition with variable length speech, In: 2018 First Asian conference on affective computing and intelligent interaction (ACII Asia), IEEE. pp 1–4
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification, In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Ye J, Wen X, Wang X, Xu Y, Luo Y, Wu C, Chen L, Liu K (2022) GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun 145:21–35
Article Google Scholar
Ye J, Wen X, Wei Y, Xu Y, Liu K, Shan H (2023) Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition, In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Rhodes Island, Greece, 2023, pp 1–5
Yeh SL, Lin YS, Lee CC (2019) An interaction-aware attention network for speech emotion recognition in spoken dialogs, In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61772023), National Key Research and Development Program of China (No. 2019QY1803), and Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No.2021H6015).

Funding

This work is supported by the National Natural Science Foundation of China (No. 61772023), Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No. 2021H6015), and Fujian Province Social Science Planning General Project (FJ2020B062).

Author information

Kun-Hong Liu and Liyan Chen are equally contribute to this paper.

Authors and Affiliations

Department of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
**n-Cheng Wen
School of Film, **amen University, **amen, China
Kun-Hong Liu & Liyan Chen
School of Software and Microelectronics, Peking University, Bei**g, China
Yan Luo
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, China
Jiaxin Ye

Authors

**n-Cheng Wen
View author publications
You can also search for this author in PubMed Google Scholar
Kun-Hong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Liyan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kun-Hong Liu or Liyan Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Human and animal rights

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wen, XC., Liu, KH., Luo, Y. et al. TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08957-5

Download citation

Accepted: 16 June 2023
Published: 17 August 2023
DOI: https://doi.org/10.1007/s00500-023-08957-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Capsule Network for Speech Emotion Recognition

Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Capsule Network for Speech Emotion Recognition

Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation