MelMAE-VC: Extending Masked Autoencoders to Voice Conversion

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1965))

Included in the following conference series:

  • 549 Accesses

Abstract

Voice conversion is a technique that generates speeches with text contents identical to source speeches and timbre features similar to reference speeches. This paper proposes MelMAE-VC, a neural network for non-parallel many-to-many voice conversion that utilizes pre-trained Masked Autoencoders (MAEs) for representation learning. Our neural network mainly consists of transformer layers and no recurrent units, aiming to achieve better scalability and parallel computing capability. We follow a similar scheme of image-based MAE in the pre-training phase that conceals a portion of the input spectrogram; then we set up a vanilla autoencoding task for training. The encoder yields latent representation from the visible subset of the full spectrogram; then the decoder reconstructs the full spectrogram from the representation of only visible patches. To achieve voice conversion, we adopt the pre-trained encoder to extract preliminary features, and then use a speaker embedder to control timbre information of synthesized spectrograms. The style transfer decoder could be either a simple autoencoder or a conditional variational autoencoder (CVAE) that mixes timbre and text information from different utterances. The optimization goal of voice conversion model training is a hybrid loss function that combines reconstruction loss, style loss, and stochastic similarity. Results show that our model speeds up and simplifies the training process, and has better modularity and scalability while achieving similar performance compared with other models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 63.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 79.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In our configuration, this duration is approximately 5.94 s. It is neither too short for the context-based ASR method to process nor too long to exceed the average duration of audio files.

References

  1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)

    Google Scholar 

  2. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. ar**v preprint ar**v:1806.05622 (2018)

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)

  4. He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  5. Huang, P.Y., et al.: Masked autoencoders that listen. Adv. Neural. Inf. Process. Syst. 35, 28708–28720 (2022)

    Google Scholar 

  6. Kahn, J.D., et al.: Flashlight: enabling innovation in tools for machine learning. In: International Conference on Machine Learning, pp. 10557–10574. PMLR (2022)

    Google Scholar 

  7. Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)

    Google Scholar 

  8. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved cycleGAN-based non-parallel voice conversion. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824. IEEE (2019)

    Google Scholar 

  9. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. ar**v preprint ar**v:2010.11672 (2020)

  10. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: MaskcycleGAN-VC: learning non-parallel voice conversion with filling in frames. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5919–5923. IEEE (2021)

    Google Scholar 

  11. Kashkin, A., Karpukhin, I., Shishkin, S.: Hifi-VC: high quality ASR-based voice conversion. ar**v preprint ar**v:2203.16937 (2022)

  12. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)

    Google Scholar 

  13. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)

  14. Kong, J., Kim, J., Bae, J.: Hifi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)

    Google Scholar 

  15. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)

  16. Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. ar**v preprint ar**v:1706.08612 (2017)

  17. Nguyen, B., Cardinaux, F.: NVC-Net: End-to-end adversarial voice conversion. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7012–7016. IEEE (2022)

    Google Scholar 

  18. Oord, A.v.d., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)

  19. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)

    Google Scholar 

  20. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: International Conference on Machine Learning, pp. 5210–5219. PMLR (2019)

    Google Scholar 

  21. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  22. Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Audio Speech Lang. Process. 20(9), 2505–2517 (2012)

    Article  Google Scholar 

  23. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

    Google Scholar 

  24. Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7237–7241. IEEE (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Gu, Y. (2024). MelMAE-VC: Extending Masked Autoencoders to Voice Conversion. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1965. Springer, Singapore. https://doi.org/10.1007/978-981-99-8145-8_37

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8145-8_37

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8144-1

  • Online ISBN: 978-981-99-8145-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation