MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion

  • Conference paper
  • First Online:
Proceedings of 3rd International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC 2023)

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1172))

  • 26 Accesses

Abstract

Voice conversion aims to change the timber of the source speaker to that of the target speaker without changing the speech content. The cross-lingual voice conversion requires non-parallel training data in two different languages, and the prosody and pronunciation of different languages bring challenges to the cross-lingual voice conversion. Previous voice conversion studies based on CycleGAN only used a single pipeline for spectrum map**. We train two effective pipelines based on the CycleGAN-VC2 separately, in order to get better speech spectrum map** and speech prosody map** and decompose F0 (an important prosodic factor) into different time scales by continuous wavelet transform (CWT), so as to better adapt to the hierarchical characteristics of F0 and retain speaker prosody characteristics. In addition, we use MaskMel as the processing features of the spectrum map** pipeline to improve the sound quality after speech conversion. We also use the self-trained MelGAN to combine F0 features and mel features to synthesize speech again. In this way, we achieve high similarity and naturalness of cross-lingual voice conversion. The MOS test results show that MaskMel-Prosody-CycleGAN framework, we proposed, is superior to the CycleGAN-VC2 baseline in our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sisman B, Yamagishi J, King S et al (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning[J]. IEEE/ACM Transact Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2020.3038524

  2. Liu R et al (2020) WaveTTS: Tacotron-based tts with joint time-frequency domain loss

    Google Scholar 

  3. Kain AB, Hosom JP, Niu X et al (2023) Improving the intelligibility of dysarthric speech

    Google Scholar 

  4. Nakamura K, Toda T, Saruwatari H et al (2012) Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech[J].Speech Commun 54(1):134–146.https://doi.org/10.1016/j.specom.2011.07.007

  5. Kaneko T, Kameoka H, Hiramatsu K et al (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks[C]. Interspeech. https://doi.org/10.21437/Interspeech.2017-970

  6. Benisty H, Malah D (2011) Voice conversion using GMM with enhanced global variance. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011 DBLP

    Google Scholar 

  7. Desai S (2010) Spectral map** using artificial neural networks for intra-lingual and cross-lingual voice conversion [J]

    Google Scholar 

  8. Mohammadi Seyed H, Kain A (2015) Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE spoken language technology workshop (SLT) IEEE

    Google Scholar 

  9. Oyamada K et al (2016) Non-native speech conversion with consistency-aware recursive network and generative adversarial network. In: 2017 Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC)

    Google Scholar 

  10. Takuhiro K, Hirokazu K, Kaoru H, Kunio K (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Procceding Interspeech, pp 1283–1287

    Google Scholar 

  11. Nakashika T, Takiguchi T, Ariki Y (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion [J]

    Google Scholar 

  12. Sun L et al (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE

    Google Scholar 

  13. Zhou Y, Tian X, Das RK et al (2020) Many-to-many cross-lingual voice conversion with a jointly trained speaker embedding network. IEEE

    Google Scholar 

  14. Ding S, Zhao G, Gutierrez-Osuna R (2020) Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition. In: Proceeding INTERSPEECH 2020, pp 776–780

    Google Scholar 

  15. Sun L, Wang H, Kang S et al (2016) Personalized, cross-lingual TTS using phonetic posteriorgrams [J]. https://doi.org/10.21437/Interspeech.2016-1043

  16. Hsu CC, Hwang HT, Wu YC et al (2016) Voice conversion from non-parallel corpora using variational auto-encoder[C]// IEEE. IEEE

    Google Scholar 

  17. Kaneko T, Kameoka H (2018) CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks[C]. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE

    Google Scholar 

  18. Lee S, Ko B G, Lee K et al (2020) Many-to-many voice conversion using conditional cycle-consistent adversarial networks [C]. https://doi.org/10.1109/ICASSP40776.2020.9053726

  19. Kaneko T, Kameoka H, Tanaka K et al (2019) CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion [J]. IEEE

    Google Scholar 

  20. Zhou Y, Tian X, Xu H et al (2019) Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling[C]. In: International conference on acoustic, speech and signal processing (ICASSP)

    Google Scholar 

  21. Zhou Y, Tian X, Yılmaz E, Das RK, Li H (2019) A modularized neural network with language-specific output layers for cross-lingual voice conversion. In: IEEE ASRU, pp 160–167

    Google Scholar 

  22. Zhou Y, Tian X, Li H (2020) Multi-Task WaveRNN With an integrated architecture for cross-lingual voice conversion. IEEE Signal Process Lett 27:1310–1314

    Article  Google Scholar 

  23. Sisman B, Zhang M, Dong M, Li H (2019) On the study of generative adversarial networks for cross-lingual voice conversion. IEEE Autom Speech Recogn Understand Workshop (ASRU) 2019:144–151. https://doi.org/10.1109/ASRU46091.2019.9003939

    Article  Google Scholar 

  24. Ho TV, Akagi M (2021) Cross-lingual voice conversion with controllable speaker individuality using variational autoencoder and star generative adversarial network. IEEE Access 9:47503–47515. https://doi.org/10.1109/ACCESS.2021.3063519

    Article  Google Scholar 

  25. Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICETRANSACTIONS on Inform Syst 99(7):1877–1884

    Google Scholar 

  26. Kumar K, Kumar R, De Boissiere T et al (2019) MelGAN: generative adversarial networks for conditional waveform synthesis [J]. https://doi.org/10.48550/ar**v.1910.06711

  27. Hel E, Nurminen J, Gabbouj M (2009) 651 Analysis of LSF frame selection in voice conversion [J]

    Google Scholar 

  28. Zhou K, Sisman B, Li H (2020) Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [J]. Speech Commun. https://doi.org/10.1016/j.specom.2020.05.004

    Article  Google Scholar 

  29. Sisman B, Li H (2018) Wavelet analysis of speaker dependent and independent prosody for voice conversion. INTERSPEECH

    Google Scholar 

  30. Vainio M, Suni A, Aalto D (2013) Continuous wavelet transform for analysis of speech prosody [J]. Corros Eng, Sci Technol. https://doi.org/10.1179/1743278215Y.0000000007

    Article  Google Scholar 

  31. Tokuda K (2017) Flexible speech synthesis based on hidden markov models[J]. https://doi.org/10.1109/jproc.2013.2251852

  32. Suni A, Aalto D, Raitio T et al (2013) Wavelets for intonation modeling in hmm speech synthesis [C]//Speech synthesis workshop (SSW8)

    Google Scholar 

  33. Kaneko T, Kameoka H (2017) Parallel-data-free voice conversion using cycle-consistent adversarial networks. ar**v preprint ar**v:1711.11293

  34. Zhao Y, Huang W C, Tian X et al (2020) Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion[J]

    Google Scholar 

  35. Kinga S, Karaiskosb V (2009) The blizzard challenge 2010

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanyan Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, S., Chen, S., Xu, Y., Ke, D. (2024). MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion. In: Yadav, S., Arya, Y., Pandey, S.M., Gherabi, N., Karras, D.A. (eds) Proceedings of 3rd International Conference on Artificial Intelligence, Robotics, and Communication. ICAIRC 2023. Lecture Notes in Electrical Engineering, vol 1172. Springer, Singapore. https://doi.org/10.1007/978-981-97-2200-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2200-6_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2199-3

  • Online ISBN: 978-981-97-2200-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation