Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech

Piotrowski, Dariusz; Korzeniowski, Renard; Falai, Alessio; Cygert, Sebastian; Pokora, Kamil; Tinchev, Georgi; Zhang, Ziyao; Yanagisawa, Kayoko

doi:10.1007/978-981-99-8126-7_20

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1961))

Included in the following conference series:

International Conference on Neural Information Processing

473 Accesses

Abstract

In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

D. Piotrowski, R. Korzeniowski and A. Falai—Equal contribution.

S. Cygert—Work done while at Amazon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use boldface for the highest score per aspect if the gap between baseline and proposed system is statistically significant.
2.
We use the ISO 639-1 nomenclature to denote locales.

References

Bilinski, P., et al.: Creating new voices using normalizing flows. In: Proceedings of Interspeech 2022, pp. 2958–2962 (2022). https://doi.org/10.21437/Interspeech.2022-10195
Comini, G., Huybrechts, G., Ribeiro, M.S., Gabryś, A., Lorenzo-Trueba, J.: Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. In: Proceedings of Interspeech 2022, pp. 1946–1950 (2022). https://doi.org/10.21437/Interspeech.2022-10338
Ellinas, N., et al.: Cross-lingual text-to-speech with flow-based voice conversion for improved pronunciation (2022). https://doi.org/10.48550/ARXIV.2210.17264. https://arxiv.org/abs/2210.17264
Finkelstein, L., et al.: Training text-to-speech systems from synthetic data: a practical approach for accent transfer tasks. In: Proceedings of Interspeech 2022, pp. 4571–4575 (2022). https://doi.org/10.21437/Interspeech.2022-10115
Hwang, M.J., Yamamoto, R., Song, E., Kim, J.M.: TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6598–6602 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414408
ITUR Recommendation: BS.1534-1. method for the subjective assessment of intermediate sound quality (MUSHRA). International Telecommunications Union, Geneva (2001)
Google Scholar
Jiao, Y., Gabryś, A., Tinchev, G., Putrycz, B., Korzekwa, D., Klimkov, V.: Universal neural vocoding with parallel wavenet. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6044–6048 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414444
Karlapati, S., et al.: CopyCat2: a single model for multi-speaker TTS and many-to-many fine-grained prosody transfer. In: Proceedings of Interspeech 2022, pp. 3363–3367 (2022). https://doi.org/10.21437/Interspeech.2022-67
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033. Curran Associates, Inc. (2020)
Google Scholar
Latorre, J., Bailleul, C., Morrill, T., Conkie, A., Stylianou, Y.: Combining speakers of multiple languages to improve quality of neural voices. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 37–42 (2021). https://doi.org/10.21437/SSW.2021-7
Luo, R., et al.: Lightspeech: lightweight and fast text to speech with neural architecture search. In: ICASSP 2021, pp. 5699–5703 (2021)
Google Scholar
Merritt, T., et al.: Text-free non-parallel many-to-many voice conversion using normalising flow. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6782–6786 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746368
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., **ao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054484
Nachmani, E., Wolf, L.: Unsupervised polyglot text-to-speech. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7055–7059 (2019). https://doi.org/10.1109/ICASSP.2019.8683519
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.S., Wei, J.: Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In: International Conference on Learning Representations (2022)
Google Scholar
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: Proceedings of Machine Learning Research, Long Beach, California, USA, vol. 97, pp. 5210–5219. PMLR (2019)
Google Scholar
Ramani, B., Actlin Jeeva, M.P., Vijayalakshmi, P., Nagarajan, T.: Voice conversion-based multilingual to polyglot speech synthesizer for Indian languages. In: 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013), pp. 1–4 (2013). https://doi.org/10.1109/TENCON.2013.6719019
Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Google Scholar
Sanchez, A., Falai, A., Zhang, Z., Angelini, O., Yanagisawa, K.: Unify and conquer: how phonetic feature representation affects polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2963–2967 (2022). https://doi.org/10.21437/Interspeech.2022-233
Shah, R., et al.: Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech. In: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 2011), pp. 96–101 (2021). https://doi.org/10.21437/SSW.2021-17
Sharma, M., Kenter, T., Clark, R.: StrawNet: self-training WaveNet for TTS in low-data regimes. In: Proceedings of Interspeech 2020, pp. 3550–3554 (2020). https://doi.org/10.21437/Interspeech.2020-1437
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Sun, L., Wang, H., Kang, S., Li, K., Meng, H.: Personalized, cross-lingual TTS using phonetic posteriorgrams. In: Interspeech 2016, pp. 322–326 (2016). https://doi.org/10.21437/Interspeech.2016-1043
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665
Wang, Z., et al.: Accent and speaker disentanglement in many-to-many voice conversion. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5 (2021). https://doi.org/10.1109/ISCSLP49672.2021.9362120
Wu, J., Polyak, A., Taigman, Y., Fong, J., Agrawal, P., He, Q.: Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8017–8021 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746282
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., **e, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceedings of Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech.2019-2668
Zhang, Z., Falai, A., Sanchez, A., Angelini, O., Yanagisawa, K.: Mix and match: an empirical study on training corpus composition for polyglot text-to-speech (TTS). In: Proceedings of Interspeech 2022, pp. 2353–2357 (2022). https://doi.org/10.21437/Interspeech.2022-242
Zhao, G., Sonsaat, S., Levis, J., Chukharev-Hudilainen, E., Gutierrez-Osuna, R.: Accent conversion using phonetic posteriorgrams. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5314–5318 (2018). https://doi.org/10.1109/ICASSP.2018.8462258
Zhao, S., Nguyen, T.H., Wang, H., Ma, B.: Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. In: Proceedings of Interspeech 2020, pp. 2927–2931 (2020). https://doi.org/10.21437/Interspeech.2020-1163

Download references

Author information

Authors and Affiliations

Alexa AI, Amazon, Seattle, USA
Kamil Pokora
Gdańsk University of Technology, Gdańsk, Poland
Sebastian Cygert
Alexa AI, Amazon, Gdańsk, Poland
Dariusz Piotrowski
Alexa AI, Amazon, Cambridge, UK
Renard Korzeniowski, Alessio Falai, Georgi Tinchev, Ziyao Zhang & Kayoko Yanagisawa

Authors

Dariusz Piotrowski
View author publications
You can also search for this author in PubMed Google Scholar
Renard Korzeniowski
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Falai
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Cygert
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Pokora
View author publications
You can also search for this author in PubMed Google Scholar
Georgi Tinchev
View author publications
You can also search for this author in PubMed Google Scholar
Ziyao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kayoko Yanagisawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessio Falai .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Bei**g, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piotrowski, D. et al. (2024). Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1961. Springer, Singapore. https://doi.org/10.1007/978-981-99-8126-7_20

Download citation

DOI: https://doi.org/10.1007/978-981-99-8126-7_20
Published: 13 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8125-0
Online ISBN: 978-981-99-8126-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Lingual Knowledge Distillation via Flow-Based Voice Conversion for Robust Polyglot Text-to-Speech