Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Abstract

In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

D. Piotrowski, R. Korzeniowski and A. Falai—Equal contribution.

S. Cygert—Work done while at Amazon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use boldface for the highest score per aspect if the gap between baseline and proposed system is statistically significant.

  2. 2.

    We use the ISO 639-1 nomenclature to denote locales.

References

  1. Bilinski, P., et al.: Creating new voices using normalizing flows. In: Proceedings of Interspeech 2022, pp. 2958–2962 (2022). https://doi.org/10.21437/Interspeech.2022-10195

  2. Comini, G., Huybrechts, G., Ribeiro, M.S., Gabryś, A., Lorenzo-Trueba, J.: Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. In: Proceedings of Interspeech 2022, pp. 1946–1950 (2022). https://doi.org/10.21437/Interspeech.2022-10338

  3. Ellinas, N., et al.: Cross-lingual text-to-speech with flow-based voice conversion for improved pronunciation (2022). https://doi.org/10.48550/ARXIV.2210.17264. https://arxiv.org/abs/2210.17264

  4. Finkelstein, L., et al.: Training text-to-speech systems from synthetic data: a practical approach for accent transfer tasks. In: Proceedings of Interspeech 2022, pp. 4571–4575 (2022). https://doi.org/10.21437/Interspeech.2022-10115

  5. Hwang, M.J., Yamamoto, R., Song, E., Kim, J.M.: TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6598–6602 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414408

  6. ITUR Recommendation: BS.1534-1. method for the subjective assessment of intermediate sound quality (MUSHRA). International Telecommunications Union, Geneva (2001)

    Google Scholar 

  7. Jiao, Y., Gabryś, A., Tinchev, G., Putrycz, B., Korzekwa, D., Klimkov, V.: Universal neural vocoding with parallel wavenet. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6044–6048 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414444

  8. Karlapati, S., et al.: CopyCat2: a single model for multi-speaker TTS and many-to-many fine-grained prosody transfer. In: Proceedings of Interspeech 2022, pp. 3363–3367 (2022). https://doi.org/10.21437/Interspeech.2022-67

  9. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033. Curran Associates, Inc. (2020)

    Google Scholar 

  10. Latorre, J., Bailleul, C., Morrill, T., Conkie, A., Stylianou, Y.: Combining speakers of multiple languages to improve quality of neural voices. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 37–42 (2021). https://doi.org/10.21437/SSW.2021-7

  11. Luo, R., et al.: Lightspeech: lightweight and fast text to speech with neural architecture search. In: ICASSP 2021, pp. 5699–5703 (2021)

    Google Scholar 

  12. Merritt, T., et al.: Text-free non-parallel many-to-many voice conversion using normalising flow. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6782–6786 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746368

  13. Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., **ao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054484

  14. Nachmani, E., Wolf, L.: Unsupervised polyglot text-to-speech. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7055–7059 (2019). https://doi.org/10.1109/ICASSP.2019.8683519

  15. Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.S., Wei, J.: Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In: International Conference on Learning Representations (2022)

    Google Scholar 

  16. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of Machine Learning Research, Long Beach, California, USA, vol. 97, pp. 5210–5219. PMLR (2019)

    Google Scholar 

  17. Ramani, B., Actlin Jeeva, M.P., Vijayalakshmi, P., Nagarajan, T.: Voice conversion-based multilingual to polyglot speech synthesizer for Indian languages. In: 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013), pp. 1–4 (2013). https://doi.org/10.1109/TENCON.2013.6719019

  18. Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  19. Sanchez, A., Falai, A., Zhang, Z., Angelini, O., Yanagisawa, K.: Unify and conquer: how phonetic feature representation affects polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2963–2967 (2022). https://doi.org/10.21437/Interspeech.2022-233

  20. Shah, R., et al.: Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 96–101 (2021). https://doi.org/10.21437/SSW.2021-17

  21. Sharma, M., Kenter, T., Clark, R.: StrawNet: self-training WaveNet for TTS in low-data regimes. In: Proceedings of Interspeech 2020, pp. 3550–3554 (2020). https://doi.org/10.21437/Interspeech.2020-1437

  22. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368

  23. Sun, L., Wang, H., Kang, S., Li, K., Meng, H.: Personalized, cross-lingual TTS using phonetic posteriorgrams. In: Interspeech 2016, pp. 322–326 (2016). https://doi.org/10.21437/Interspeech.2016-1043

  24. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665

  25. Wang, Z., et al.: Accent and speaker disentanglement in many-to-many voice conversion. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5 (2021). https://doi.org/10.1109/ISCSLP49672.2021.9362120

  26. Wu, J., Polyak, A., Taigman, Y., Fong, J., Agrawal, P., He, Q.: Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8017–8021 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746282

  27. Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., **e, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551

  28. Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech.2019-2668

  29. Zhang, Z., Falai, A., Sanchez, A., Angelini, O., Yanagisawa, K.: Mix and match: an empirical study on training corpus composition for polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2353–2357 (2022). https://doi.org/10.21437/Interspeech.2022-242

  30. Zhao, G., Sonsaat, S., Levis, J., Chukharev-Hudilainen, E., Gutierrez-Osuna, R.: Accent conversion using phonetic posteriorgrams. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5314–5318 (2018). https://doi.org/10.1109/ICASSP.2018.8462258

  31. Zhao, S., Nguyen, T.H., Wang, H., Ma, B.: Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. In: Proceedings of Interspeech 2020, pp. 2927–2931 (2020). https://doi.org/10.21437/Interspeech.2020-1163

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessio Falai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piotrowski, D. et al. (2024). Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8126-7_20

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8125-0

  • Online ISBN: 978-981-99-8126-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation