Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Wang, **g; Saleem, Nasir; Gunawan, Teddy Surya

doi:10.1007/s12559-024-10288-y

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Published: 30 April 2024

Volume 16, pages 1221–1236, (2024)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

188 Accesses
Explore all metrics

Abstract

Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

Deep learning for time series classification: a review

Article 02 March 2019

A review on the long short-term memory model

Article 13 May 2020

Data Augmentation techniques in time series domain: a survey and taxonomy

Article Open access 24 March 2023

Data Availability

The dataset used in the research is LibriSpeech which is available at Weblink: https://www.openslr.org/12.

References

Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process. 1979;27(2):113–20.
Article Google Scholar
Nasir S, Sher A, Usman K, Farman U. Speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation. Res J Appl Sci Eng Technol. 2013;6(6):1081–7.
Article Google Scholar
Lim J, Oppenheim A. All-pole modeling of degraded speech. IEEE Trans Acoust Speech Signal Process. 1978;26(3):197–210.
Article Google Scholar
Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21.
Article Google Scholar
Mohammadiha N, Smaragdis P, Leijon A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process. 2013;21(10):2140–51.
Article Google Scholar
Xu Y, Du J, Dai L-R, Lee C-H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 2013;21(1):65–8.
Article Google Scholar
Xu Y, Du J, Dai L-R, Lee C-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process. 2014;23(1):7–19.
Article Google Scholar
Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process. 2014;22(12):1849–58.
Article Google Scholar
Saleem N, Khattak MI. Deep neural networks for speech enhancement in complex-noisy environments. Int J Interactive Multimed Artif Intell. 2020;6(1):84.
Google Scholar
Saleem N, Khattak MI. Multi-scale decomposition based supervised single channel deep speech enhancement. Appl Soft Comput. 2020;95: 106666.
Article Google Scholar
Soni MH, Shah N, Patil HA. Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 5039–43.
Chapter Google Scholar
Yu W, Zhou J, Wang H, et al. SETransformer: speech enhancement transformer. Cogn Comput. 2022;14:1152–8. https://doi.org/10.1007/s12559-020-09817-2.
Article Google Scholar
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27.
Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models, vol. 30, no. 1. Proceedings of the AAAI conference on artificial intelligence; 2016.
Google Scholar
Zhu QS, Zhang J, Zhang ZQ, Dai LR. A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1927–39.
Article Google Scholar
Kolbæk M, Tan Z-H, Jensen J. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM transactions on audio, speech, and language processing. 2016;25(1):153–67.
Article Google Scholar
Chen J, Wang D. Long short-term memory for speaker generalization in supervised speech separation. The Journal of the Acoustical Society of America. 2017;141(6):4705–14.
Article Google Scholar
Sundermeyer M, Ney H, Schl¨uter R. From feedforward to recurrent lstm neural networks for language modeling. IEEE/ACM transactions on audio, speech, and language processing. 2015;23(3):517–29.
Article Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Article Google Scholar
Fern´andez-D´ıaz M, Gallardo-Antol´ın A. An attention long short-term memory based system for automatic classification of speech intelligibility. Eng Appl Artif Intell. 2020;96:103976.
Article Google Scholar
Saleem N, Gao J, Khattak MI, Rauf HT, Kadry S, Shafi M. Deepresgru: residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl-Based Syst. 2022;238: 107914.
Article Google Scholar
El-Moneim SA, Nassar M, Dessouky MI, Ismail NA, El-Fishawy AS, Abd El-Samie FE. Text-independent speaker recognition using lstm-rnn and speech enhancement. Multimedia tools and applications. 2020;79:24013–28.
Article Google Scholar
Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level residual networks from dynamical systems view. ar**v preprint; 2017. ar**v:171010348.
Strake M, Defraene B, Fluyt K, Tirry W, Fingscheidt T. Speech enhancement by lstm-based noise suppression followed by cnn-based speech restoration. EURASIP Journal on Advances in Signal Processing. 2020;2020:1–26.
Article Google Scholar
Wang Z, Zhang T, Shao Y, Ding B. Lstm-convolutional-blstm encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust. 2021;172: 107647.
Article Google Scholar
Liang R, Kong F, **e Y, Tang G, Cheng J. Real-time speech enhancement algorithm based on attention lstm. IEEE Access. 2020;8:48464–76.
Article Google Scholar
Li X, Horaud R. Online monaural speech enhancement using delayed subband LSTM. Interspeech; 2020. p. 2462–6. ar**v:2005.05037.
Zhang S, Kong Y, Lv S, Hu Y, **e L. FT-LSTM based complex network for joint acoustic echo cancellation and speech enhancement. ar**v preprint; 2021. ar**v:2106.07577.
Fedorov I, Stamenovic M, Jensen C, Yang LC, Mandell A, Gan Y, Mattina M, Whatmough PN. TinyLSTMs: efficient neural speech enhancement for hearing aids. ar**v preprint; 2020. ar**v:2005.11138.
Li X, Li Y, Dong Y, Xu S, Zhang Z, Wang D, **ong S. Bidirectional LSTM network with ordered neurons for speech enhancement. Inter Speech; 2020. p. 2702–6.
Google Scholar
Saleem N, Khattak MI, Al-Hasan M, Jan A. Multi-objective long-short term memory recurrent neural networks for speech enhancement. J Ambient Intell Humaniz Comput. 2021;12(10):9037–52.
Article Google Scholar
Goswami RG, Andhavarapu S, Murty K. Phase aware speech enhancement using realisation of complex-valued LSTM. ar**v preprint; 2020. ar**v:2010.14122.
Westhausen NL, Meyer BT. Dual-signal transformation LSTM network for real-time noise suppression. Proc. Interspeech; 2020. p. 2477–81. ar**v:2005.07551.
Garg A. Speech enhancement using long short term memory with trained speech features and adaptive wiener filter. Multimedia tools and applications. 2023;82(3):3647–75.
Article Google Scholar
Yu J, Luo Y. Efficient monaural speech enhancement with universal sample rate band-split rnn. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.
Google Scholar
Korkmaz Y, Boyacı A. Hybrid voice activity detection system based on lstm and auditory speech features. Biomed Signal Process Control. 2023;80: 104408.
Article Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1, vol. 93. NASA STI/Recon technical report n; 1993. p. 27403.
Book Google Scholar
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics speech and signal processing (ICASSP); 2015. p. 5206–10.
Google Scholar
Pearce D, Picone J. Aurora working group: DSR front end LVCSR evaluation AU/384/02. Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep.; 2002.
Google Scholar
Varga A, Steeneken H, et al. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993;12(3):247–53.
Article Google Scholar
Rix AW, Hollier MP, Hekstra AP, Beerends JG. Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay compensation. Journal of the Audio Engineering Society. 2002;50(10):755–64.
Google Scholar
Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE international conference on acoustics, speech and signal processing. IEEE; 2010. p. 4214–7.
Chapter Google Scholar
Yi H. Evaluation of objective measures for speech enhancement. Pittsburgh, Pennsylvania: Interspeech; 2006. p. 1447–50.
Google Scholar
Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network. In: 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM). IEEE; 2017. p. 1–5.
Google Scholar
Sun P, Qin J. Low-rank and sparsity analysis applied to speech enhancement via online estimated dictionary. IEEE Signal Process Lett. 2016;23(12):1862–6.
Article Google Scholar
Shi W, Zhang X, Zou X, Han W, Min G. Auditory mask estimation by RPCA for monaural speech enhancement. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS). IEEE; 2017. p. 179–84.
Chapter Google Scholar
Tan K, Wang D. A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech. 2018;2018:3229–33.
Google Scholar
Zhou L, Gao Y, Wang Z, Li J, Zhang W. Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint; 2021. ar**v:2104.05267.
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society; 2011.
Google Scholar
Pascual S, Bonafonte A, Serrà J. SEGAN: speech enhancement generative adversarial network. Interspeech; 2017.
Google Scholar
Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 106–10.
Chapter Google Scholar
Hu Y, Liu Y, Lv S, **ng M, Zhang S, Fu Y, Wu J, Zhang B, **e L. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. Interspeech; 2020.
Google Scholar
Lv S, Fu Y, **ng M, Sun J, **e L, Huang J, Wang Y, Yu T. S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7767–71.
Chapter Google Scholar
Defossez A, Synnaeve G, Adi Y. Real time speech enhancement in the waveform domain. ar**v preprint ar**v:2006.12847. 2020.
Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H. Fullsubnet+: channel attention fullsubnet with complex spectrograms for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7857–61.
Chapter Google Scholar
Passos LA, Papa JP, Hussain A, Adeel A. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing. 2023;527:196–203.
Article Google Scholar
Hussain T, Wang W-C, Gogate M, Dashtipour K, Tsao Y, Lu X, Ahsan A, Hussain A. A novel temporal attentive-pooling based convolutional recurrent architecture for acoustic signal enhancement. IEEE transactions on artificial intelligence. 2022;3(5):833–42.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Materials Science and Engineering, Yunnan University, Kunming City, Yunnan Province, China
**g Wang
Department of Electrical Engineering, Faculty of Engineering and Technology, Gomal University, Dera Ismail Khan, 29050, Pakistan
Nasir Saleem
Electrical and Computer Engineering Department, International Islamic University Malaysia (IIUM), Kuala Lumpur, Malaysia
Nasir Saleem & Teddy Surya Gunawan

Authors

**g Wang
View author publications
You can also search for this author in PubMed Google Scholar
Nasir Saleem
View author publications
You can also search for this author in PubMed Google Scholar
Teddy Surya Gunawan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nasir Saleem.

Ethics declarations

Ethics Approval and Consent to Participate

This article does not contain any studies on human participants or animals performed by any of the authors.

Competing Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Saleem, N. & Gunawan, T.S. Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition. Cogn Comput 16, 1221–1236 (2024). https://doi.org/10.1007/s12559-024-10288-y

Download citation

Received: 12 June 2023
Accepted: 10 April 2024
Published: 30 April 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s12559-024-10288-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep learning for time series classification: a review

A review on the long short-term memory model

Data Augmentation techniques in time series domain: a survey and taxonomy

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval and Consent to Participate

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep learning for time series classification: a review

A review on the long short-term memory model

Data Augmentation techniques in time series domain: a survey and taxonomy

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval and Consent to Participate

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation