Abstract
Voice conversion aims to change the timber of the source speaker to that of the target speaker without changing the speech content. The cross-lingual voice conversion requires non-parallel training data in two different languages, and the prosody and pronunciation of different languages bring challenges to the cross-lingual voice conversion. Previous voice conversion studies based on CycleGAN only used a single pipeline for spectrum map**. We train two effective pipelines based on the CycleGAN-VC2 separately, in order to get better speech spectrum map** and speech prosody map** and decompose F0 (an important prosodic factor) into different time scales by continuous wavelet transform (CWT), so as to better adapt to the hierarchical characteristics of F0 and retain speaker prosody characteristics. In addition, we use MaskMel as the processing features of the spectrum map** pipeline to improve the sound quality after speech conversion. We also use the self-trained MelGAN to combine F0 features and mel features to synthesize speech again. In this way, we achieve high similarity and naturalness of cross-lingual voice conversion. The MOS test results show that MaskMel-Prosody-CycleGAN framework, we proposed, is superior to the CycleGAN-VC2 baseline in our experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sisman B, Yamagishi J, King S et al (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning[J]. IEEE/ACM Transact Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2020.3038524
Liu R et al (2020) WaveTTS: Tacotron-based tts with joint time-frequency domain loss
Kain AB, Hosom JP, Niu X et al (2023) Improving the intelligibility of dysarthric speech
Nakamura K, Toda T, Saruwatari H et al (2012) Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech[J].Speech Commun 54(1):134–146.https://doi.org/10.1016/j.specom.2011.07.007
Kaneko T, Kameoka H, Hiramatsu K et al (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks[C]. Interspeech. https://doi.org/10.21437/Interspeech.2017-970
Benisty H, Malah D (2011) Voice conversion using GMM with enhanced global variance. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011 DBLP
Desai S (2010) Spectral map** using artificial neural networks for intra-lingual and cross-lingual voice conversion [J]
Mohammadi Seyed H, Kain A (2015) Voice conversion using deep neural networks with speaker-independent pre-training. In: 2014 IEEE spoken language technology workshop (SLT) IEEE
Oyamada K et al (2016) Non-native speech conversion with consistency-aware recursive network and generative adversarial network. In: 2017 Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC)
Takuhiro K, Hirokazu K, Kaoru H, Kunio K (2017) Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Procceding Interspeech, pp 1283–1287
Nakashika T, Takiguchi T, Ariki Y (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion [J]
Sun L et al (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE
Zhou Y, Tian X, Das RK et al (2020) Many-to-many cross-lingual voice conversion with a jointly trained speaker embedding network. IEEE
Ding S, Zhao G, Gutierrez-Osuna R (2020) Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition. In: Proceeding INTERSPEECH 2020, pp 776–780
Sun L, Wang H, Kang S et al (2016) Personalized, cross-lingual TTS using phonetic posteriorgrams [J]. https://doi.org/10.21437/Interspeech.2016-1043
Hsu CC, Hwang HT, Wu YC et al (2016) Voice conversion from non-parallel corpora using variational auto-encoder[C]// IEEE. IEEE
Kaneko T, Kameoka H (2018) CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks[C]. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE
Lee S, Ko B G, Lee K et al (2020) Many-to-many voice conversion using conditional cycle-consistent adversarial networks [C]. https://doi.org/10.1109/ICASSP40776.2020.9053726
Kaneko T, Kameoka H, Tanaka K et al (2019) CycleGAN-VC2: improved CycleGAN-based non-parallel voice conversion [J]. IEEE
Zhou Y, Tian X, Xu H et al (2019) Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling[C]. In: International conference on acoustic, speech and signal processing (ICASSP)
Zhou Y, Tian X, Yılmaz E, Das RK, Li H (2019) A modularized neural network with language-specific output layers for cross-lingual voice conversion. In: IEEE ASRU, pp 160–167
Zhou Y, Tian X, Li H (2020) Multi-Task WaveRNN With an integrated architecture for cross-lingual voice conversion. IEEE Signal Process Lett 27:1310–1314
Sisman B, Zhang M, Dong M, Li H (2019) On the study of generative adversarial networks for cross-lingual voice conversion. IEEE Autom Speech Recogn Understand Workshop (ASRU) 2019:144–151. https://doi.org/10.1109/ASRU46091.2019.9003939
Ho TV, Akagi M (2021) Cross-lingual voice conversion with controllable speaker individuality using variational autoencoder and star generative adversarial network. IEEE Access 9:47503–47515. https://doi.org/10.1109/ACCESS.2021.3063519
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICETRANSACTIONS on Inform Syst 99(7):1877–1884
Kumar K, Kumar R, De Boissiere T et al (2019) MelGAN: generative adversarial networks for conditional waveform synthesis [J]. https://doi.org/10.48550/ar**v.1910.06711
Hel E, Nurminen J, Gabbouj M (2009) 651 Analysis of LSF frame selection in voice conversion [J]
Zhou K, Sisman B, Li H (2020) Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [J]. Speech Commun. https://doi.org/10.1016/j.specom.2020.05.004
Sisman B, Li H (2018) Wavelet analysis of speaker dependent and independent prosody for voice conversion. INTERSPEECH
Vainio M, Suni A, Aalto D (2013) Continuous wavelet transform for analysis of speech prosody [J]. Corros Eng, Sci Technol. https://doi.org/10.1179/1743278215Y.0000000007
Tokuda K (2017) Flexible speech synthesis based on hidden markov models[J]. https://doi.org/10.1109/jproc.2013.2251852
Suni A, Aalto D, Raitio T et al (2013) Wavelets for intonation modeling in hmm speech synthesis [C]//Speech synthesis workshop (SSW8)
Kaneko T, Kameoka H (2017) Parallel-data-free voice conversion using cycle-consistent adversarial networks. ar**v preprint ar**v:1711.11293
Zhao Y, Huang W C, Tian X et al (2020) Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion[J]
Kinga S, Karaiskosb V (2009) The blizzard challenge 2010
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yan, S., Chen, S., Xu, Y., Ke, D. (2024). MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion. In: Yadav, S., Arya, Y., Pandey, S.M., Gherabi, N., Karras, D.A. (eds) Proceedings of 3rd International Conference on Artificial Intelligence, Robotics, and Communication. ICAIRC 2023. Lecture Notes in Electrical Engineering, vol 1172. Springer, Singapore. https://doi.org/10.1007/978-981-97-2200-6_2
Download citation
DOI: https://doi.org/10.1007/978-981-97-2200-6_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2199-3
Online ISBN: 978-981-97-2200-6
eBook Packages: Computer ScienceComputer Science (R0)