Abstract
Voice conversion is a technique that generates speeches with text contents identical to source speeches and timbre features similar to reference speeches. This paper proposes MelMAE-VC, a neural network for non-parallel many-to-many voice conversion that utilizes pre-trained Masked Autoencoders (MAEs) for representation learning. Our neural network mainly consists of transformer layers and no recurrent units, aiming to achieve better scalability and parallel computing capability. We follow a similar scheme of image-based MAE in the pre-training phase that conceals a portion of the input spectrogram; then we set up a vanilla autoencoding task for training. The encoder yields latent representation from the visible subset of the full spectrogram; then the decoder reconstructs the full spectrogram from the representation of only visible patches. To achieve voice conversion, we adopt the pre-trained encoder to extract preliminary features, and then use a speaker embedder to control timbre information of synthesized spectrograms. The style transfer decoder could be either a simple autoencoder or a conditional variational autoencoder (CVAE) that mixes timbre and text information from different utterances. The optimization goal of voice conversion model training is a hybrid loss function that combines reconstruction loss, style loss, and stochastic similarity. Results show that our model speeds up and simplifies the training process, and has better modularity and scalability while achieving similar performance compared with other models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In our configuration, this duration is approximately 5.94 s. It is neither too short for the context-based ASR method to process nor too long to exceed the average duration of audio files.
References
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. ar**v preprint ar**v:1806.05622 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 (2018)
He, K., Chen, X., **e, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Huang, P.Y., et al.: Masked autoencoders that listen. Adv. Neural. Inf. Process. Syst. 35, 28708–28720 (2022)
Kahn, J.D., et al.: Flashlight: enabling innovation in tools for machine learning. In: International Conference on Machine Learning, pp. 10557–10574. PMLR (2022)
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC2: improved cycleGAN-based non-parallel voice conversion. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824. IEEE (2019)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. ar**v preprint ar**v:2010.11672 (2020)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: MaskcycleGAN-VC: learning non-parallel voice conversion with filling in frames. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5919–5923. IEEE (2021)
Kashkin, A., Karpukhin, I., Shishkin, S.: Hifi-VC: high quality ASR-based voice conversion. ar**v preprint ar**v:2203.16937 (2022)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Kong, J., Kim, J., Bae, J.: Hifi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. ar**v preprint ar**v:1706.08612 (2017)
Nguyen, B., Cardinaux, F.: NVC-Net: End-to-end adversarial voice conversion. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7012–7016. IEEE (2022)
Oord, A.v.d., et al.: WaveNet: a generative model for raw audio. ar**v preprint ar**v:1609.03499 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: International Conference on Machine Learning, pp. 5210–5219. PMLR (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Toda, T., Nakagiri, M., Shikano, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE Trans. Audio Speech Lang. Process. 20(9), 2505–2517 (2012)
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7237–7241. IEEE (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Gu, Y. (2024). MelMAE-VC: Extending Masked Autoencoders to Voice Conversion. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1965. Springer, Singapore. https://doi.org/10.1007/978-981-99-8145-8_37
Download citation
DOI: https://doi.org/10.1007/978-981-99-8145-8_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8144-1
Online ISBN: 978-981-99-8145-8
eBook Packages: Computer ScienceComputer Science (R0)