Log in

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The dataset used in the research is LibriSpeech which is available at Weblink: https://www.openslr.org/12.

References

  1. Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process. 1979;27(2):113–20.

    Article  Google Scholar 

  2. Nasir S, Sher A, Usman K, Farman U. Speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation. Res J Appl Sci Eng Technol. 2013;6(6):1081–7.

    Article  Google Scholar 

  3. Lim J, Oppenheim A. All-pole modeling of degraded speech. IEEE Trans Acoust Speech Signal Process. 1978;26(3):197–210.

    Article  Google Scholar 

  4. Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process. 1984;32(6):1109–21.

    Article  Google Scholar 

  5. Mohammadiha N, Smaragdis P, Leijon A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process. 2013;21(10):2140–51.

    Article  Google Scholar 

  6. Xu Y, Du J, Dai L-R, Lee C-H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 2013;21(1):65–8.

    Article  Google Scholar 

  7. Xu Y, Du J, Dai L-R, Lee C-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process. 2014;23(1):7–19.

    Article  Google Scholar 

  8. Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process. 2014;22(12):1849–58.

    Article  Google Scholar 

  9. Saleem N, Khattak MI. Deep neural networks for speech enhancement in complex-noisy environments. Int J Interactive Multimed Artif Intell. 2020;6(1):84.

    Google Scholar 

  10. Saleem N, Khattak MI. Multi-scale decomposition based supervised single channel deep speech enhancement. Appl Soft Comput. 2020;95: 106666.

    Article  Google Scholar 

  11. Soni MH, Shah N, Patil HA. Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 5039–43.

    Chapter  Google Scholar 

  12. Yu W, Zhou J, Wang H, et al. SETransformer: speech enhancement transformer. Cogn Comput. 2022;14:1152–8. https://doi.org/10.1007/s12559-020-09817-2.

    Article  Google Scholar 

  13. Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27.

  14. Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models, vol. 30, no. 1. Proceedings of the AAAI conference on artificial intelligence; 2016.

    Google Scholar 

  15. Zhu QS, Zhang J, Zhang ZQ, Dai LR. A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1927–39.

    Article  Google Scholar 

  16. Kolbæk M, Tan Z-H, Jensen J. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM transactions on audio, speech, and language processing. 2016;25(1):153–67.

    Article  Google Scholar 

  17. Chen J, Wang D. Long short-term memory for speaker generalization in supervised speech separation. The Journal of the Acoustical Society of America. 2017;141(6):4705–14.

    Article  Google Scholar 

  18. Sundermeyer M, Ney H, Schl¨uter R. From feedforward to recurrent lstm neural networks for language modeling. IEEE/ACM transactions on audio, speech, and language processing. 2015;23(3):517–29.

    Article  Google Scholar 

  19. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  Google Scholar 

  20. Fern´andez-D´ıaz M, Gallardo-Antol´ın A. An attention long short-term memory based system for automatic classification of speech intelligibility. Eng Appl Artif Intell. 2020;96:103976.

    Article  Google Scholar 

  21. Saleem N, Gao J, Khattak MI, Rauf HT, Kadry S, Shafi M. Deepresgru: residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl-Based Syst. 2022;238: 107914.

    Article  Google Scholar 

  22. El-Moneim SA, Nassar M, Dessouky MI, Ismail NA, El-Fishawy AS, Abd El-Samie FE. Text-independent speaker recognition using lstm-rnn and speech enhancement. Multimedia tools and applications. 2020;79:24013–28.

    Article  Google Scholar 

  23. Chang B, Meng L, Haber E, Tung F, Begert D. Multi-level residual networks from dynamical systems view. ar**v preprint; 2017. ar**v:171010348.

  24. Strake M, Defraene B, Fluyt K, Tirry W, Fingscheidt T. Speech enhancement by lstm-based noise suppression followed by cnn-based speech restoration. EURASIP Journal on Advances in Signal Processing. 2020;2020:1–26.

    Article  Google Scholar 

  25. Wang Z, Zhang T, Shao Y, Ding B. Lstm-convolutional-blstm encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust. 2021;172: 107647.

    Article  Google Scholar 

  26. Liang R, Kong F, **e Y, Tang G, Cheng J. Real-time speech enhancement algorithm based on attention lstm. IEEE Access. 2020;8:48464–76.

    Article  Google Scholar 

  27. Li X, Horaud R. Online monaural speech enhancement using delayed subband LSTM. Interspeech; 2020. p. 2462–6. ar**v:2005.05037.

  28. Zhang S, Kong Y, Lv S, Hu Y, **e L. FT-LSTM based complex network for joint acoustic echo cancellation and speech enhancement. ar**v preprint; 2021. ar**v:2106.07577.

  29. Fedorov I, Stamenovic M, Jensen C, Yang LC, Mandell A, Gan Y, Mattina M, Whatmough PN. TinyLSTMs: efficient neural speech enhancement for hearing aids. ar**v preprint; 2020. ar**v:2005.11138.

  30. Li X, Li Y, Dong Y, Xu S, Zhang Z, Wang D, **ong S. Bidirectional LSTM network with ordered neurons for speech enhancement. Inter Speech; 2020. p. 2702–6.

    Google Scholar 

  31. Saleem N, Khattak MI, Al-Hasan M, Jan A. Multi-objective long-short term memory recurrent neural networks for speech enhancement. J Ambient Intell Humaniz Comput. 2021;12(10):9037–52.

    Article  Google Scholar 

  32. Goswami RG, Andhavarapu S, Murty K. Phase aware speech enhancement using realisation of complex-valued LSTM. ar**v preprint; 2020. ar**v:2010.14122.

  33. Westhausen NL, Meyer BT. Dual-signal transformation LSTM network for real-time noise suppression. Proc. Interspeech; 2020. p. 2477–81. ar**v:2005.07551.

  34. Garg A. Speech enhancement using long short term memory with trained speech features and adaptive wiener filter. Multimedia tools and applications. 2023;82(3):3647–75.

    Article  Google Scholar 

  35. Yu J, Luo Y. Efficient monaural speech enhancement with universal sample rate band-split rnn. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.

    Google Scholar 

  36. Korkmaz Y, Boyacı A. Hybrid voice activity detection system based on lstm and auditory speech features. Biomed Signal Process Control. 2023;80: 104408.

    Article  Google Scholar 

  37. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1, vol. 93. NASA STI/Recon technical report n; 1993. p. 27403.

    Book  Google Scholar 

  38. Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics speech and signal processing (ICASSP); 2015. p. 5206–10.

    Google Scholar 

  39. Pearce D, Picone J. Aurora working group: DSR front end LVCSR evaluation AU/384/02. Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep.; 2002.

    Google Scholar 

  40. Varga A, Steeneken H, et al. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993;12(3):247–53.

    Article  Google Scholar 

  41. Rix AW, Hollier MP, Hekstra AP, Beerends JG. Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay compensation. Journal of the Audio Engineering Society. 2002;50(10):755–64.

    Google Scholar 

  42. Taal CH, Hendriks RC, Heusdens R, Jensen J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE international conference on acoustics, speech and signal processing. IEEE; 2010. p. 4214–7.

    Chapter  Google Scholar 

  43. Yi H. Evaluation of objective measures for speech enhancement. Pittsburgh, Pennsylvania: Interspeech; 2006. p. 1447–50.

    Google Scholar 

  44. Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network. In: 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM). IEEE; 2017. p. 1–5.

    Google Scholar 

  45. Sun P, Qin J. Low-rank and sparsity analysis applied to speech enhancement via online estimated dictionary. IEEE Signal Process Lett. 2016;23(12):1862–6.

    Article  Google Scholar 

  46. Shi W, Zhang X, Zou X, Han W, Min G. Auditory mask estimation by RPCA for monaural speech enhancement. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS). IEEE; 2017. p. 179–84.

    Chapter  Google Scholar 

  47. Tan K, Wang D. A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech. 2018;2018:3229–33.

    Google Scholar 

  48. Zhou L, Gao Y, Wang Z, Li J, Zhang W. Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint; 2021. ar**v:2104.05267.

  49. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society; 2011.

    Google Scholar 

  50. Pascual S, Bonafonte A, Serrà J. SEGAN: speech enhancement generative adversarial network. Interspeech; 2017.

    Google Scholar 

  51. Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 106–10.

    Chapter  Google Scholar 

  52. Hu Y, Liu Y, Lv S, **ng M, Zhang S, Fu Y, Wu J, Zhang B, **e L. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. Interspeech; 2020.

    Google Scholar 

  53. Lv S, Fu Y, **ng M, Sun J, **e L, Huang J, Wang Y, Yu T. S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7767–71.

    Chapter  Google Scholar 

  54. Defossez A, Synnaeve G, Adi Y. Real time speech enhancement in the waveform domain. ar**v preprint ar**v:2006.12847. 2020.

  55. Chen J, Wang Z, Tuo D, Wu Z, Kang S, Meng H. Fullsubnet+: channel attention fullsubnet with complex spectrograms for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 7857–61.

    Chapter  Google Scholar 

  56. Passos LA, Papa JP, Hussain A, Adeel A. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing. 2023;527:196–203.

    Article  Google Scholar 

  57. Hussain T, Wang W-C, Gogate M, Dashtipour K, Tsao Y, Lu X, Ahsan A, Hussain A. A novel temporal attentive-pooling based convolutional recurrent architecture for acoustic signal enhancement. IEEE transactions on artificial intelligence. 2022;3(5):833–42.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nasir Saleem.

Ethics declarations

Ethics Approval and Consent to Participate

This article does not contain any studies on human participants or animals performed by any of the authors.

Competing Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Saleem, N. & Gunawan, T.S. Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition. Cogn Comput 16, 1221–1236 (2024). https://doi.org/10.1007/s12559-024-10288-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-024-10288-y

Keywords

Navigation