Abstract
In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.
D. Piotrowski, R. Korzeniowski and A. Falai—Equal contribution.
S. Cygert—Work done while at Amazon.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We use boldface for the highest score per aspect if the gap between baseline and proposed system is statistically significant.
- 2.
We use the ISO 639-1 nomenclature to denote locales.
References
Bilinski, P., et al.: Creating new voices using normalizing flows. In: Proceedings of Interspeech 2022, pp. 2958–2962 (2022). https://doi.org/10.21437/Interspeech.2022-10195
Comini, G., Huybrechts, G., Ribeiro, M.S., Gabryś, A., Lorenzo-Trueba, J.: Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. In: Proceedings of Interspeech 2022, pp. 1946–1950 (2022). https://doi.org/10.21437/Interspeech.2022-10338
Ellinas, N., et al.: Cross-lingual text-to-speech with flow-based voice conversion for improved pronunciation (2022). https://doi.org/10.48550/ARXIV.2210.17264. https://arxiv.org/abs/2210.17264
Finkelstein, L., et al.: Training text-to-speech systems from synthetic data: a practical approach for accent transfer tasks. In: Proceedings of Interspeech 2022, pp. 4571–4575 (2022). https://doi.org/10.21437/Interspeech.2022-10115
Hwang, M.J., Yamamoto, R., Song, E., Kim, J.M.: TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6598–6602 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414408
ITUR Recommendation: BS.1534-1. method for the subjective assessment of intermediate sound quality (MUSHRA). International Telecommunications Union, Geneva (2001)
Jiao, Y., Gabryś, A., Tinchev, G., Putrycz, B., Korzekwa, D., Klimkov, V.: Universal neural vocoding with parallel wavenet. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6044–6048 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414444
Karlapati, S., et al.: CopyCat2: a single model for multi-speaker TTS and many-to-many fine-grained prosody transfer. In: Proceedings of Interspeech 2022, pp. 3363–3367 (2022). https://doi.org/10.21437/Interspeech.2022-67
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033. Curran Associates, Inc. (2020)
Latorre, J., Bailleul, C., Morrill, T., Conkie, A., Stylianou, Y.: Combining speakers of multiple languages to improve quality of neural voices. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 37–42 (2021). https://doi.org/10.21437/SSW.2021-7
Luo, R., et al.: Lightspeech: lightweight and fast text to speech with neural architecture search. In: ICASSP 2021, pp. 5699–5703 (2021)
Merritt, T., et al.: Text-free non-parallel many-to-many voice conversion using normalising flow. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6782–6786 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746368
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., **ao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054484
Nachmani, E., Wolf, L.: Unsupervised polyglot text-to-speech. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7055–7059 (2019). https://doi.org/10.1109/ICASSP.2019.8683519
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.S., Wei, J.: Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In: International Conference on Learning Representations (2022)
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of Machine Learning Research, Long Beach, California, USA, vol. 97, pp. 5210–5219. PMLR (2019)
Ramani, B., Actlin Jeeva, M.P., Vijayalakshmi, P., Nagarajan, T.: Voice conversion-based multilingual to polyglot speech synthesizer for Indian languages. In: 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013), pp. 1–4 (2013). https://doi.org/10.1109/TENCON.2013.6719019
Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Sanchez, A., Falai, A., Zhang, Z., Angelini, O., Yanagisawa, K.: Unify and conquer: how phonetic feature representation affects polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2963–2967 (2022). https://doi.org/10.21437/Interspeech.2022-233
Shah, R., et al.: Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 96–101 (2021). https://doi.org/10.21437/SSW.2021-17
Sharma, M., Kenter, T., Clark, R.: StrawNet: self-training WaveNet for TTS in low-data regimes. In: Proceedings of Interspeech 2020, pp. 3550–3554 (2020). https://doi.org/10.21437/Interspeech.2020-1437
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Sun, L., Wang, H., Kang, S., Li, K., Meng, H.: Personalized, cross-lingual TTS using phonetic posteriorgrams. In: Interspeech 2016, pp. 322–326 (2016). https://doi.org/10.21437/Interspeech.2016-1043
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665
Wang, Z., et al.: Accent and speaker disentanglement in many-to-many voice conversion. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5 (2021). https://doi.org/10.1109/ISCSLP49672.2021.9362120
Wu, J., Polyak, A., Taigman, Y., Fong, J., Agrawal, P., He, Q.: Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8017–8021 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746282
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., **e, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech.2019-2668
Zhang, Z., Falai, A., Sanchez, A., Angelini, O., Yanagisawa, K.: Mix and match: an empirical study on training corpus composition for polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2353–2357 (2022). https://doi.org/10.21437/Interspeech.2022-242
Zhao, G., Sonsaat, S., Levis, J., Chukharev-Hudilainen, E., Gutierrez-Osuna, R.: Accent conversion using phonetic posteriorgrams. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5314–5318 (2018). https://doi.org/10.1109/ICASSP.2018.8462258
Zhao, S., Nguyen, T.H., Wang, H., Ma, B.: Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. In: Proceedings of Interspeech 2020, pp. 2927–2931 (2020). https://doi.org/10.21437/Interspeech.2020-1163
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Piotrowski, D. et al. (2024). Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_20
Download citation
DOI: https://doi.org/10.1007/978-981-99-8126-7_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)