MelMAE-VC: Extending Masked Autoencoders to Voice Conversion

Wang, Yuhao; Gu, Yuantao

doi:10.1007/978-981-99-8145-8_37

Yuhao Wang¹⁰ &
Yuantao Gu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1965))

Included in the following conference series:

International Conference on Neural Information Processing

549 Accesses

Abstract

Voice conversion is a technique that generates speeches with text contents identical to source speeches and timbre features similar to reference speeches. This paper proposes MelMAE-VC, a neural network for non-parallel many-to-many voice conversion that utilizes pre-trained Masked Autoencoders (MAEs) for representation learning. Our neural network mainly consists of transformer layers and no recurrent units, aiming to achieve better scalability and parallel computing capability. We follow a similar scheme of image-based MAE in the pre-training phase that conceals a portion of the input spectrogram; then we set up a vanilla autoencoding task for training. The encoder yields latent representation from the visible subset of the full spectrogram; then the decoder reconstructs the full spectrogram from the representation of only visible patches. To achieve voice conversion, we adopt the pre-trained encoder to extract preliminary features, and then use a speaker embedder to control timbre information of synthesized spectrograms. The style transfer decoder could be either a simple autoencoder or a conditional variational autoencoder (CVAE) that mixes timbre and text information from different utterances. The optimization goal of voice conversion model training is a hybrid loss function that combines reconstruction loss, style loss, and stochastic similarity. Results show that our model speeds up and simplifies the training process, and has better modularity and scalability while achieving similar performance compared with other models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 63.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 79.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Article Open access 08 May 2024

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

1.
In our configuration, this duration is approximately 5.94 s. It is neither too short for the context-based ASR method to process nor too long to exceed the average duration of audio files.

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. ar**v preprint ar**v:1806.05622 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Huang, P.Y., et al.: Masked autoencoders that listen. Adv. Neural. Inf. Process. Syst. 35, 28708–28720 (2022)
Google Scholar
Kahn, J.D., et al.: Flashlight: enabling innovation in tools for machine learning. In: International Conference on Machine Learning, pp. 10557–10574. PMLR (2022)
Google Scholar
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved cycleGAN-based non-parallel voice conversion. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824. IEEE (2019)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. ar**v preprint ar**v:2010.11672 (2020)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: MaskcycleGAN-VC: learning non-parallel voice conversion with filling in frames. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5919–5923. IEEE (2021)
Google Scholar
Kashkin, A., Karpukhin, I., Shishkin, S.: Hifi-VC: high quality ASR-based voice conversion. ar**v preprint ar**v:2203.16937 (2022)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Kong, J., Kim, J., Bae, J.: Hifi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. ar**v preprint ar**v:1706.08612 (2017)
Nguyen, B., Cardinaux, F.: NVC-Net: End-to-end adversarial voice conversion. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7012–7016. IEEE (2022)
Google Scholar
Oord, A.v.d., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: International Conference on Machine Learning, pp. 5210–5219. PMLR (2019)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Audio Speech Lang. Process. 20(9), 2505–2517 (2012)
Article Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7237–7241. IEEE (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua University, Bei**g, China
Yuhao Wang & Yuantao Gu

Authors

Yuhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuantao Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhao Wang .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Bei**g, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Gu, Y. (2024). MelMAE-VC: Extending Masked Autoencoders to Voice Conversion. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1965. Springer, Singapore. https://doi.org/10.1007/978-981-99-8145-8_37

Download citation

DOI: https://doi.org/10.1007/978-981-99-8145-8_37
Published: 27 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8144-1
Online ISBN: 978-981-99-8145-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MelMAE-VC: Extending Masked Autoencoders to Voice Conversion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MelMAE-VC: Extending Masked Autoencoders to Voice Conversion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation