MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion

Yan, Siqi; Chen, Senda; Xu, Yanyan; Ke, Dengfeng

doi:10.1007/978-981-97-2200-6_2

Siqi Yan⁴⁰,
Senda Chen⁴⁰,
Yanyan Xu^40,40 &
…
Dengfeng Ke⁴¹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1172))

Included in the following conference series:

International Conference on Artificial Intelligence, Robotics, and Communication

26 Accesses

Abstract

Voice conversion aims to change the timber of the source speaker to that of the target speaker without changing the speech content. The cross-lingual voice conversion requires non-parallel training data in two different languages, and the prosody and pronunciation of different languages bring challenges to the cross-lingual voice conversion. Previous voice conversion studies based on CycleGAN only used a single pipeline for spectrum map**. We train two effective pipelines based on the CycleGAN-VC2 separately, in order to get better speech spectrum map** and speech prosody map** and decompose F0 (an important prosodic factor) into different time scales by continuous wavelet transform (CWT), so as to better adapt to the hierarchical characteristics of F0 and retain speaker prosody characteristics. In addition, we use MaskMel as the processing features of the spectrum map** pipeline to improve the sound quality after speech conversion. We also use the self-trained MelGAN to combine F0 features and mel features to synthesize speech again. In this way, we achieve high similarity and naturalness of cross-lingual voice conversion. The MOS test results show that MaskMel-Prosody-CycleGAN framework, we proposed, is superior to the CycleGAN-VC2 baseline in our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sisman B, Yamagishi J, King S et al (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning[J]. IEEE/ACM Transact Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2020.3038524
Liu R et al (2020) WaveTTS: Tacotron-based tts with joint time-frequency domain loss
Google Scholar
Kain AB, Hosom JP, Niu X et al (2023) Improving the intelligibility of dysarthric speech
Google Scholar
Nakamura K, Toda T, Saruwatari H et al (2012) Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech[J].Speech Commun 54(1):134–146.https://doi.org/10.1016/j.specom.2011.07.007
Kaneko T, Kameoka H, Hiramatsu K et al (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks[C]. Interspeech. https://doi.org/10.21437/Interspeech.2017-970
Benisty H, Malah D (2011) Voice conversion using GMM with enhanced global variance. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011 DBLP
Google Scholar
Desai S (2010) Spectral map** using artificial neural networks for intra-lingual and cross-lingual voice conversion [J]
Google Scholar
Mohammadi Seyed H, Kain A (2015) Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE spoken language technology workshop (SLT) IEEE
Google Scholar
Oyamada K et al (2016) Non-native speech conversion with consistency-aware recursive network and generative adversarial network. In: 2017 Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC)
Google Scholar
Takuhiro K, Hirokazu K, Kaoru H, Kunio K (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Procceding Interspeech, pp 1283–1287
Google Scholar
Nakashika T, Takiguchi T, Ariki Y (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion [J]
Google Scholar
Sun L et al (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE
Google Scholar
Zhou Y, Tian X, Das RK et al (2020) Many-to-many cross-lingual voice conversion with a jointly trained speaker embedding network. IEEE
Google Scholar
Ding S, Zhao G, Gutierrez-Osuna R (2020) Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition. In: Proceeding INTERSPEECH 2020, pp 776–780
Google Scholar
Sun L, Wang H, Kang S et al (2016) Personalized, cross-lingual TTS using phonetic posteriorgrams [J]. https://doi.org/10.21437/Interspeech.2016-1043
Hsu CC, Hwang HT, Wu YC et al (2016) Voice conversion from non-parallel corpora using variational auto-encoder[C]// IEEE. IEEE
Google Scholar
Kaneko T, Kameoka H (2018) CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks[C]. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE
Google Scholar
Lee S, Ko B G, Lee K et al (2020) Many-to-many voice conversion using conditional cycle-consistent adversarial networks [C]. https://doi.org/10.1109/ICASSP40776.2020.9053726
Kaneko T, Kameoka H, Tanaka K et al (2019) CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion [J]. IEEE
Google Scholar
Zhou Y, Tian X, Xu H et al (2019) Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling[C]. In: International conference on acoustic, speech and signal processing (ICASSP)
Google Scholar
Zhou Y, Tian X, Yılmaz E, Das RK, Li H (2019) A modularized neural network with language-specific output layers for cross-lingual voice conversion. In: IEEE ASRU, pp 160–167
Google Scholar
Zhou Y, Tian X, Li H (2020) Multi-Task WaveRNN With an integrated architecture for cross-lingual voice conversion. IEEE Signal Process Lett 27:1310–1314
Article Google Scholar
Sisman B, Zhang M, Dong M, Li H (2019) On the study of generative adversarial networks for cross-lingual voice conversion. IEEE Autom Speech Recogn Understand Workshop (ASRU) 2019:144–151. https://doi.org/10.1109/ASRU46091.2019.9003939
Article Google Scholar
Ho TV, Akagi M (2021) Cross-lingual voice conversion with controllable speaker individuality using variational autoencoder and star generative adversarial network. IEEE Access 9:47503–47515. https://doi.org/10.1109/ACCESS.2021.3063519
Article Google Scholar
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICETRANSACTIONS on Inform Syst 99(7):1877–1884
Google Scholar
Kumar K, Kumar R, De Boissiere T et al (2019) MelGAN: generative adversarial networks for conditional waveform synthesis [J]. https://doi.org/10.48550/ar**v.1910.06711
Hel E, Nurminen J, Gabbouj M (2009) 651 Analysis of LSF frame selection in voice conversion [J]
Google Scholar
Zhou K, Sisman B, Li H (2020) Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [J]. Speech Commun. https://doi.org/10.1016/j.specom.2020.05.004
Article Google Scholar
Sisman B, Li H (2018) Wavelet analysis of speaker dependent and independent prosody for voice conversion. INTERSPEECH
Google Scholar
Vainio M, Suni A, Aalto D (2013) Continuous wavelet transform for analysis of speech prosody [J]. Corros Eng, Sci Technol. https://doi.org/10.1179/1743278215Y.0000000007
Article Google Scholar
Tokuda K (2017) Flexible speech synthesis based on hidden markov models[J]. https://doi.org/10.1109/jproc.2013.2251852
Suni A, Aalto D, Raitio T et al (2013) Wavelets for intonation modeling in hmm speech synthesis [C]//Speech synthesis workshop (SSW8)
Google Scholar
Kaneko T, Kameoka H (2017) Parallel-data-free voice conversion using cycle-consistent adversarial networks. ar**v preprint ar**v:1711.11293
Zhao Y, Huang W C, Tian X et al (2020) Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion[J]
Google Scholar
Kinga S, Karaiskosb V (2009) The blizzard challenge 2010
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Technology, Bei**g Forestry University, Bei**g, 100000, China
Siqi Yan, Senda Chen, Yanyan Xu & Yanyan Xu
School of Information Science, Bei**g Language and Culture University, Bei**g, 100000, China
Dengfeng Ke

Authors

Siqi Yan
View author publications
You can also search for this author in PubMed Google Scholar
Senda Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yanyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dengfeng Ke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanyan Xu .

Editor information

Editors and Affiliations

Formerly with CSIR-National Physical Lab, New Delhi, Delhi, India
Sanjay Yadav
Department of Electrical Engineering, J.C. Bose University of Science and Technology, Faridabad, Haryana, India
Yogendra Arya
Department of Mechanical Engineering, National Institute of Technology Patna, Patna, Bihar, India
Shailesh M. Pandey
National School of Applied Sciences, Sultan Moulay Slimane University, Beni-Mellal, Khouribga, Morocco
Noredine Gherabi
School of Sciences, University of Athens, Psachna, Athens, Greece
Dimitrios A. Karras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, S., Chen, S., Xu, Y., Ke, D. (2024). MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion. In: Yadav, S., Arya, Y., Pandey, S.M., Gherabi, N., Karras, D.A. (eds) Proceedings of 3rd International Conference on Artificial Intelligence, Robotics, and Communication. ICAIRC 2023. Lecture Notes in Electrical Engineering, vol 1172. Springer, Singapore. https://doi.org/10.1007/978-981-97-2200-6_2

Download citation

DOI: https://doi.org/10.1007/978-981-97-2200-6_2
Published: 19 June 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2199-3
Online ISBN: 978-981-97-2200-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics