Abstract
Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.
Similar content being viewed by others
Data Availability
The dataset used in the research is LibriSpeech which is available at Weblink: https://www.openslr.org/12.
References
Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process. 1979;27(2):113–20.
Nasir S, Sher A, Usman K, Farman U. Speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation. Res J Appl Sci Eng Technol. 2013;6(6):1081–7.
Lim J, Oppenheim A. All-pole modeling of degraded speech. IEEE Trans Acoust Speech Signal Process. 1978;26(3):197–210.
Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21.
Mohammadiha N, Smaragdis P, Leijon A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process. 2013;21(10):2140–51.
Xu Y, Du J, Dai L-R, Lee C-H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 2013;21(1):65–8.
Xu Y, Du J, Dai L-R, Lee C-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process. 2014;23(1):7–19.
Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process. 2014;22(12):1849–58.
Saleem N, Khattak MI. Deep neural networks for speech enhancement in complex-noisy environments. Int J Interactive Multimed Artif Intell. 2020;6(1):84.
Saleem N, Khattak MI. Multi-scale decomposition based supervised single channel deep speech enhancement. Appl Soft Comput. 2020;95: 106666.
Soni MH, Shah N, Patil HA. Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 5039–43.
Yu W, Zhou J, Wang H, et al. SETransformer: speech enhancement transformer. Cogn Comput. 2022;14:1152–8. https://doi.org/10.1007/s12559-020-09817-2.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27.
Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models, vol. 30, no. 1. Proceedings of the AAAI conference on artificial intelligence; 2016.
Zhu QS, Zhang J, Zhang ZQ, Dai LR. A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1927–39.
Kolbæk M, Tan Z-H, Jensen J. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM transactions on audio, speech, and language processing. 2016;25(1):153–67.
Chen J, Wang D. Long short-term memory for speaker generalization in supervised speech separation. The Journal of the Acoustical Society of America. 2017;141(6):4705–14.
Sundermeyer M, Ney H, Schl¨uter R. From feedforward to recurrent lstm neural networks for language modeling. IEEE/ACM transactions on audio, speech, and language processing. 2015;23(3):517–29.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Fern´andez-D´ıaz M, Gallardo-Antol´ın A. An attention long short-term memory based system for automatic classification of speech intelligibility. Eng Appl Artif Intell. 2020;96:103976.
Saleem N, Gao J, Khattak MI, Rauf HT, Kadry S, Shafi M. Deepresgru: residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl-Based Syst. 2022;238: 107914.
El-Moneim SA, Nassar M, Dessouky MI, Ismail NA, El-Fishawy AS, Abd El-Samie FE. Text-independent speaker recognition using lstm-rnn and speech enhancement. Multimedia tools and applications. 2020;79:24013–28.
Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level residual networks from dynamical systems view. ar**v preprint; 2017. ar**v:171010348.
Strake M, Defraene B, Fluyt K, Tirry W, Fingscheidt T. Speech enhancement by lstm-based noise suppression followed by cnn-based speech restoration. EURASIP Journal on Advances in Signal Processing. 2020;2020:1–26.
Wang Z, Zhang T, Shao Y, Ding B. Lstm-convolutional-blstm encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust. 2021;172: 107647.
Liang R, Kong F, **e Y, Tang G, Cheng J. Real-time speech enhancement algorithm based on attention lstm. IEEE Access. 2020;8:48464–76.
Li X, Horaud R. Online monaural speech enhancement using delayed subband LSTM. Interspeech; 2020. p. 2462–6. ar**v:2005.05037.
Zhang S, Kong Y, Lv S, Hu Y, **e L. FT-LSTM based complex network for joint acoustic echo cancellation and speech enhancement. ar**v preprint; 2021. ar**v:2106.07577.
Fedorov I, Stamenovic M, Jensen C, Yang LC, Mandell A, Gan Y, Mattina M, Whatmough PN. TinyLSTMs: efficient neural speech enhancement for hearing aids. ar**v preprint; 2020. ar**v:2005.11138.
Li X, Li Y, Dong Y, Xu S, Zhang Z, Wang D, **ong S. Bidirectional LSTM network with ordered neurons for speech enhancement. Inter Speech; 2020. p. 2702–6.
Saleem N, Khattak MI, Al-Hasan M, Jan A. Multi-objective long-short term memory recurrent neural networks for speech enhancement. J Ambient Intell Humaniz Comput. 2021;12(10):9037–52.
Goswami RG, Andhavarapu S, Murty K. Phase aware speech enhancement using realisation of complex-valued LSTM. ar**v preprint; 2020. ar**v:2010.14122.
Westhausen NL, Meyer BT. Dual-signal transformation LSTM network for real-time noise suppression. Proc. Interspeech; 2020. p. 2477–81. ar**v:2005.07551.
Garg A. Speech enhancement using long short term memory with trained speech features and adaptive wiener filter. Multimedia tools and applications. 2023;82(3):3647–75.
Yu J, Luo Y. Efficient monaural speech enhancement with universal sample rate band-split rnn. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.
Korkmaz Y, Boyacı A. Hybrid voice activity detection system based on lstm and auditory speech features. Biomed Signal Process Control. 2023;80: 104408.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1, vol. 93. NASA STI/Recon technical report n; 1993. p. 27403.
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics speech and signal processing (ICASSP); 2015. p. 5206–10.
Pearce D, Picone J. Aurora working group: DSR front end LVCSR evaluation AU/384/02. Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep.; 2002.
Varga A, Steeneken H, et al. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993;12(3):247–53.
Rix AW, Hollier MP, Hekstra AP, Beerends JG. Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay compensation. Journal of the Audio Engineering Society. 2002;50(10):755–64.
Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE international conference on acoustics, speech and signal processing. IEEE; 2010. p. 4214–7.
Yi H. Evaluation of objective measures for speech enhancement. Pittsburgh, Pennsylvania: Interspeech; 2006. p. 1447–50.
Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network. In: 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM). IEEE; 2017. p. 1–5.
Sun P, Qin J. Low-rank and sparsity analysis applied to speech enhancement via online estimated dictionary. IEEE Signal Process Lett. 2016;23(12):1862–6.
Shi W, Zhang X, Zou X, Han W, Min G. Auditory mask estimation by RPCA for monaural speech enhancement. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS). IEEE; 2017. p. 179–84.
Tan K, Wang D. A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech. 2018;2018:3229–33.
Zhou L, Gao Y, Wang Z, Li J, Zhang W. Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint; 2021. ar**v:2104.05267.
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society; 2011.
Pascual S, Bonafonte A, Serrà J. SEGAN: speech enhancement generative adversarial network. Interspeech; 2017.
Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 106–10.
Hu Y, Liu Y, Lv S, **ng M, Zhang S, Fu Y, Wu J, Zhang B, **e L. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. Interspeech; 2020.
Lv S, Fu Y, **ng M, Sun J, **e L, Huang J, Wang Y, Yu T. S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7767–71.
Defossez A, Synnaeve G, Adi Y. Real time speech enhancement in the waveform domain. ar**v preprint ar**v:2006.12847. 2020.
Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H. Fullsubnet+: channel attention fullsubnet with complex spectrograms for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7857–61.
Passos LA, Papa JP, Hussain A, Adeel A. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing. 2023;527:196–203.
Hussain T, Wang W-C, Gogate M, Dashtipour K, Tsao Y, Lu X, Ahsan A, Hussain A. A novel temporal attentive-pooling based convolutional recurrent architecture for acoustic signal enhancement. IEEE transactions on artificial intelligence. 2022;3(5):833–42.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval and Consent to Participate
This article does not contain any studies on human participants or animals performed by any of the authors.
Competing Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Saleem, N. & Gunawan, T.S. Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition. Cogn Comput 16, 1221–1236 (2024). https://doi.org/10.1007/s12559-024-10288-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-024-10288-y