A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2006))

Included in the following conference series:

  • 221 Accesses

Abstract

Conventional single-channel speech enhancement methodologies have predominantly emphasized the enhancement of the amplitude spectrum while preserving the original phase spectrum. Nonetheless, this may introduce speech distortion. While the intricate nature of the multifaceted spectra and waveform characteristics presents formidable challenges in training. In this paper, we introduce a novel framework with the Mel-spectrogram serving as an intermediary feature for speech enhancement. It integrates a denoising network and a deep generative network vocoder, allowing the reconstruction of the speech without using the phase. The denoising network, constituting a recurrent convolutional autoencoder, is meticulously trained to align with the Mel-spectrogram representations of both clean and noisy speech, resulting in an enhanced spectral output. This enhanced spectrum serves as the input for a high-fidelity, high-generation speed vocoder, which synthesizes the improved speech waveform. Following the pre-training of these two modules, they are stacked for joint training. Experimental results show the superiority of this approach in terms of speech quality, surpassing the performance of conventional models. Notably, our method demonstrates commendable adaptability across both the Chinese dataset CSMSC and the English language speech dataset VoiceBank+DEMAND, underscoring its considerable promise for real-world applications and beyond.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baker, D.: Chinese standard mandarin speech copus. https://www.data-baker.com/open_source.html

  2. Cao, R., Abdulatif, S., Yang, B.: CMGAN: conformer-based metric GAN for speech enhancement. ar**v preprint ar**v:2203.15149 (2022)

  3. Choi, H.S., Kim, J.H., Huh, J., Kim, A., Ha, J.W., Lee, K.: Phase-aware speech enhancement with deep complex U-Net. In: International Conference on Learning Representations (2018)

    Google Scholar 

  4. Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain (2020)

    Google Scholar 

  5. Du, Z., Zhang, X., Han, J.: A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1493–1505 (2020)

    Article  Google Scholar 

  6. Fu, S.W., et al.: Boosting objective scores of a speech enhancement model by MetricGAN post-processing. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 455–459. IEEE (2020)

    Google Scholar 

  7. Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)

    Article  Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. ar**v preprint ar**v:2008.00264 (2020)

  10. Jiang, W., Liu, Z., Yu, K., Wen, F.: Speech enhancement with neural homomorphic synthesis. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 376–380. IEEE (2022)

    Google Scholar 

  11. Kawanaka, M., Koizumi, Y., Miyazaki, R., Yatabe, K.: Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020, pp. 7524–7528. IEEE (2020)

    Google Scholar 

  12. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033 (2020)

    Google Scholar 

  13. Li, A., Zheng, C., Zhang, L., Li, X.: Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108–499 (2022)

    Article  Google Scholar 

  14. Li, H., Fu, S.W., Tsao, Y., Yamagishi, J.: iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning. ar**v preprint ar**v:2004.00932 (2020)

  15. Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  16. van den Oord, A., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)

  17. Pascual, S., Bonafonte, A., Serra, J.: SEGAN: speech enhancement generative adversarial network. ar**v preprint ar**v:1703.09452 (2017)

  18. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3617–3621. IEEE (2019)

    Google Scholar 

  19. Ravanelli, M., et al.: SpeechBrain: a general-purpose speech toolkit. ar**v preprint ar**v:2106.04624

  20. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 01CH37221), pp. 749–752. IEEE (2001)

    Google Scholar 

  21. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)

    Google Scholar 

  22. Valentini-Botinhao, C., et al.: Noisy speech database for training speech enhancement algorithms and TTS models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR) (2017)

    Google Scholar 

  23. Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)

    Article  MathSciNet  Google Scholar 

  24. Wang, P., Wang, D.: Enhanced spectral features for distortion-independent acoustic modeling. In: INTERSPEECH, pp. 476–480 (2019)

    Google Scholar 

  25. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. ar**v preprint ar**v:1703.10135 (2017)

  26. Wang, Z.Q., Wichern, G., Le Roux, J.: On the compensation between magnitude and phase in speech separation. IEEE Signal Process. Lett. 28, 2018–2022 (2021)

    Article  Google Scholar 

  27. Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)

    Article  Google Scholar 

  28. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  29. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for joint enhancement of magnitude and phase. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2016)

    Google Scholar 

  30. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)

    Article  Google Scholar 

  31. Yu, G., Li, A., Zheng, C., Guo, Y., Wang, Y., Wang, H.: Dual-branch attention-in-attention transformer for single-channel speech enhancement. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 7847–7851. IEEE (2022)

    Google Scholar 

  32. Zhou, L., Gao, Y., Wang, Z., Li, J., Zhang, W.: Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint ar**v:2104.05267 (2021)

Download references

Acknowledgements

I would like to express my deepest gratitude to my supervisor, Fei Wen, for his guidance throughout this project. This research was funded by Scientific and Technological Innovation 2030 under Grant 2021ZD0110900 and the Key Research and Development Program of Jiangsu Province under Grant BE2022059. This work was done in X-LANCE lab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbing Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pan, Q., Jiang, W., Zhuo, Q., Yu, K. (2024). A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0601-3_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0600-6

  • Online ISBN: 978-981-97-0601-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation