Abstract
Conventional single-channel speech enhancement methodologies have predominantly emphasized the enhancement of the amplitude spectrum while preserving the original phase spectrum. Nonetheless, this may introduce speech distortion. While the intricate nature of the multifaceted spectra and waveform characteristics presents formidable challenges in training. In this paper, we introduce a novel framework with the Mel-spectrogram serving as an intermediary feature for speech enhancement. It integrates a denoising network and a deep generative network vocoder, allowing the reconstruction of the speech without using the phase. The denoising network, constituting a recurrent convolutional autoencoder, is meticulously trained to align with the Mel-spectrogram representations of both clean and noisy speech, resulting in an enhanced spectral output. This enhanced spectrum serves as the input for a high-fidelity, high-generation speed vocoder, which synthesizes the improved speech waveform. Following the pre-training of these two modules, they are stacked for joint training. Experimental results show the superiority of this approach in terms of speech quality, surpassing the performance of conventional models. Notably, our method demonstrates commendable adaptability across both the Chinese dataset CSMSC and the English language speech dataset VoiceBank+DEMAND, underscoring its considerable promise for real-world applications and beyond.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baker, D.: Chinese standard mandarin speech copus. https://www.data-baker.com/open_source.html
Cao, R., Abdulatif, S., Yang, B.: CMGAN: conformer-based metric GAN for speech enhancement. ar**v preprint ar**v:2203.15149 (2022)
Choi, H.S., Kim, J.H., Huh, J., Kim, A., Ha, J.W., Lee, K.: Phase-aware speech enhancement with deep complex U-Net. In: International Conference on Learning Representations (2018)
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain (2020)
Du, Z., Zhang, X., Han, J.: A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1493–1505 (2020)
Fu, S.W., et al.: Boosting objective scores of a speech enhancement model by MetricGAN post-processing. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 455–459. IEEE (2020)
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. ar**v preprint ar**v:2008.00264 (2020)
Jiang, W., Liu, Z., Yu, K., Wen, F.: Speech enhancement with neural homomorphic synthesis. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 376–380. IEEE (2022)
Kawanaka, M., Koizumi, Y., Miyazaki, R., Yatabe, K.: Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020, pp. 7524–7528. IEEE (2020)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033 (2020)
Li, A., Zheng, C., Zhang, L., Li, X.: Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 187, 108–499 (2022)
Li, H., Fu, S.W., Tsao, Y., Yamagishi, J.: iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning. ar**v preprint ar**v:2004.00932 (2020)
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)
Pascual, S., Bonafonte, A., Serra, J.: SEGAN: speech enhancement generative adversarial network. ar**v preprint ar**v:1703.09452 (2017)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3617–3621. IEEE (2019)
Ravanelli, M., et al.: SpeechBrain: a general-purpose speech toolkit. ar**v preprint ar**v:2106.04624
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 01CH37221), pp. 749–752. IEEE (2001)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)
Valentini-Botinhao, C., et al.: Noisy speech database for training speech enhancement algorithms and TTS models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR) (2017)
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Wang, P., Wang, D.: Enhanced spectral features for distortion-independent acoustic modeling. In: INTERSPEECH, pp. 476–480 (2019)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. ar**v preprint ar**v:1703.10135 (2017)
Wang, Z.Q., Wichern, G., Le Roux, J.: On the compensation between magnitude and phase in speech separation. IEEE Signal Process. Lett. 28, 2018–2022 (2021)
Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for joint enhancement of magnitude and phase. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2016)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)
Yu, G., Li, A., Zheng, C., Guo, Y., Wang, Y., Wang, H.: Dual-branch attention-in-attention transformer for single-channel speech enhancement. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 7847–7851. IEEE (2022)
Zhou, L., Gao, Y., Wang, Z., Li, J., Zhang, W.: Complex spectral map** with attention based convolution recurrent neural network for speech enhancement. ar**v preprint ar**v:2104.05267 (2021)
Acknowledgements
I would like to express my deepest gratitude to my supervisor, Fei Wen, for his guidance throughout this project. This research was funded by Scientific and Technological Innovation 2030 under Grant 2021ZD0110900 and the Key Research and Development Program of Jiangsu Province under Grant BE2022059. This work was done in X-LANCE lab.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pan, Q., Jiang, W., Zhuo, Q., Yu, K. (2024). A Framework Combining Separate and Joint Training for Neural Vocoder-Based Monaural Speech Enhancement. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_16
Download citation
DOI: https://doi.org/10.1007/978-981-97-0601-3_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)