A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement

Pan, Qiaoyi; Jiang, Wenbing; Zhuo, Qing; Yu, Kai

doi:10.1007/978-981-97-0601-3_16

Qiaoyi Pan¹⁰,
Wenbing Jiang¹¹,
Qing Zhuo¹² &
…
Kai Yu¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2006))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

221 Accesses

Abstract

Conventional single-channel speech enhancement methodologies have predominantly emphasized the enhancement of the amplitude spectrum while preserving the original phase spectrum. Nonetheless, this may introduce speech distortion. While the intricate nature of the multifaceted spectra and waveform characteristics presents formidable challenges in training. In this paper, we introduce a novel framework with the Mel-spectrogram serving as an intermediary feature for speech enhancement. It integrates a denoising network and a deep generative network vocoder, allowing the reconstruction of the speech without using the phase. The denoising network, constituting a recurrent convolutional autoencoder, is meticulously trained to align with the Mel-spectrogram representations of both clean and noisy speech, resulting in an enhanced spectral output. This enhanced spectrum serves as the input for a high-fidelity, high-generation speed vocoder, which synthesizes the improved speech waveform. Following the pre-training of these two modules, they are stacked for joint training. Experimental results show the superiority of this approach in terms of speech quality, surpassing the performance of conventional models. Notably, our method demonstrates commendable adaptability across both the Chinese dataset CSMSC and the English language speech dataset VoiceBank+DEMAND, underscoring its considerable promise for real-world applications and beyond.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baker, D.: Chinese standard mandarin speech copus. https://www.data-baker.com/open_source.html
Cao, R., Abdulatif, S., Yang, B.: CMGAN: conformer-based metric GAN for speech enhancement. ar**v preprint ar**v:2203.15149 (2022)
Choi, H.S., Kim, J.H., Huh, J., Kim, A., Ha, J.W., Lee, K.: Phase-aware speech enhancement with deep complex U-Net. In: International Conference on Learning Representations (2018)
Google Scholar
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain (2020)
Google Scholar
Du, Z., Zhang, X., Han, J.: A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1493–1505 (2020)
Article Google Scholar
Fu, S.W., et al.: Boosting objective scores of a speech enhancement model by MetricGAN post-processing. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 455–459. IEEE (2020)
Google Scholar
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. ar**v preprint ar**v:2008.00264 (2020)
Jiang, W., Liu, Z., Yu, K., Wen, F.: Speech enhancement with neural homomorphic synthesis. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 376–380. IEEE (2022)
Google Scholar
Kawanaka, M., Koizumi, Y., Miyazaki, R., Yatabe, K.: Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020, pp. 7524–7528. IEEE (2020)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033 (2020)
Google Scholar
Li, A., Zheng, C., Zhang, L., Li, X.: Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108–499 (2022)
Article Google Scholar
Li, H., Fu, S.W., Tsao, Y., Yamagishi, J.: iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning. ar**v preprint ar**v:2004.00932 (2020)
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)
Pascual, S., Bonafonte, A., Serra, J.: SEGAN: speech enhancement generative adversarial network. ar**v preprint ar**v:1703.09452 (2017)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3617–3621. IEEE (2019)
Google Scholar
Ravanelli, M., et al.: SpeechBrain: a general-purpose speech toolkit. ar**v preprint ar**v:2106.04624
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 01CH37221), pp. 749–752. IEEE (2001)
Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)
Google Scholar
Valentini-Botinhao, C., et al.: Noisy speech database for training speech enhancement algorithms and TTS models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR) (2017)
Google Scholar
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Article MathSciNet Google Scholar
Wang, P., Wang, D.: Enhanced spectral features for distortion-independent acoustic modeling. In: INTERSPEECH, pp. 476–480 (2019)
Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. ar**v preprint ar**v:1703.10135 (2017)
Wang, Z.Q., Wichern, G., Le Roux, J.: On the compensation between magnitude and phase in speech separation. IEEE Signal Process. Lett. 28, 2018–2022 (2021)
Article Google Scholar
Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)
Article Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for joint enhancement of magnitude and phase. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2016)
Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)
Article Google Scholar
Yu, G., Li, A., Zheng, C., Guo, Y., Wang, Y., Wang, H.: Dual-branch attention-in-attention transformer for single-channel speech enhancement. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 7847–7851. IEEE (2022)
Google Scholar
Zhou, L., Gao, Y., Wang, Z., Li, J., Zhang, W.: Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint ar**v:2104.05267 (2021)

Download references

Acknowledgements

I would like to express my deepest gratitude to my supervisor, Fei Wen, for his guidance throughout this project. This research was funded by Scientific and Technological Innovation 2030 under Grant 2021ZD0110900 and the Key Research and Development Program of Jiangsu Province under Grant BE2022059. This work was done in X-LANCE lab.

Author information

Authors and Affiliations

Paris Elite Institute of Technology (SPEIT), Shanghai Jiao Tong University, Shanghai, China
Qiaoyi Pan
School of Communication Engineering, Hangzhou Dianzi University, Hangzhou, China
Wenbing Jiang
Department of Automation, Tsinghua University, Bei**g, China
Qing Zhuo
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Kai Yu

Authors

Qiaoyi Pan
View author publications
You can also search for this author in PubMed Google Scholar
Wenbing Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zhuo
View author publications
You can also search for this author in PubMed Google Scholar
Kai Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbing Jiang .

Editor information

Editors and Affiliations

Tsinghua University, Bei**g, China
Jia Jia
University of Science and Technology of China, Anhui, China
Zhenhua Ling
Shanghai Jiao Tong University, Shanghai, China
**e Chen
Bei**g University of Posts and Telecommunications, Bei**g, China
Ya Li
Hunan University, Hunan, China
Zixing Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, Q., Jiang, W., Zhuo, Q., Yu, K. (2024). A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_16

Download citation

DOI: https://doi.org/10.1007/978-981-97-0601-3_16
Published: 15 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement