Accent-VITS: Accent Transfer for End-to-End TTS

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2006))

Included in the following conference series:

  • 251 Accesses

Abstract

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave map** in VITS is decomposed into text-to-accent and accent-to-wave map**s in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 64.19
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 80.24
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.data-baker.com/open_source.html.

References

  1. Acuna, D., Law, M.T., Zhang, G., Fidler, S.: Domain adversarial training: a game perspective. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)

    Google Scholar 

  2. Dai, D., et al.: Cloning one’s voice using very limited data in the wild. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 8322–8326. IEEE (2022)

    Google Scholar 

  3. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 3830–3834. ISCA (2020)

    Google Scholar 

  4. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017)

    Google Scholar 

  5. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 59:1–59:35 (2016)

    Google Scholar 

  6. Goodfellow, I.J., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 2672–2680 (2014)

    Google Scholar 

  7. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 5530–5540. PMLR (2021)

    Google Scholar 

  8. Kolluru, B., Wan, V., Latorre, J., Yanagisawa, K., Gales, M.J.F.: Generating multiple-accent pronunciations for TTS using joint sequence model interpolation. In: Li, H., Meng, H.M., Ma, B., Chng, E., **e, L. (eds.) INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014, pp. 1273–1277. ISCA (2014)

    Google Scholar 

  9. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)

    Google Scholar 

  10. Lee, S., Kim, S., Lee, J., Song, E., Hwang, M., Lee, S.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In: NeurIPS (2022)

    Google Scholar 

  11. Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014). https://doi.org/10.1109/TASLP.2014.2304637

  12. Liu, R., Sisman, B., Gao, G., Li, H.: Controllable accented text-to-speech synthesis. CoRR abs/2209.10804 (2022). https://doi.org/10.48550/ar**v.2209.10804

  13. Liu, S., Yang, S., Su, D., Yu, D.: Referee: towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6307–6311. IEEE (2022)

    Google Scholar 

  14. Loots, L., Niesler, T.: Automatic conversion between pronunciations of different English accents. Speech Commun. 53(1), 75–84 (2011)

    Article  Google Scholar 

  15. de Mareüil, P.B., Vieru-Dimulescu, B.: The contribution of prosody to the perception of foreign accent. Phonetica 63(4), 247–267 (2006)

    Article  Google Scholar 

  16. Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019, pp. 3165–3174 (2019)

    Google Scholar 

  17. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538. JMLR.org (2015)

    Google Scholar 

  18. Shu, R., Bui, H.H., Narui, H., Ermon, S.: A DIRT-T approach to unsupervised domain adaptation. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018)

    Google Scholar 

  19. Sun, L., Li, K., Wang, H., Kang, S., Meng, H.M.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: IEEE International Conference on Multimedia and Expo, ICME 2016, Seattle, WA, USA, 11–15 July 2016, pp. 1–6. IEEE Computer Society (2016)

    Google Scholar 

  20. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 5998–6008 (2017)

    Google Scholar 

  21. Yao, Z., et al.: WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 4054–4058. ISCA (2021)

    Google Scholar 

  22. Zhang, B., et al.: WENETSPEECH: a 10000+ hours multi-domain mandarin corpus for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6182–6186. IEEE (2022)

    Google Scholar 

  23. Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 7237–7241. IEEE (2022)

    Google Scholar 

  24. Zhang, Y., Wang, Z., Yang, P., Sun, H., Wang, Z., **e, L.: AccentSpeech: learning accent from crowd-sourced data for target speaker TTS with accents. In: Lee, K.A., Lee, H., Lu, Y., Dong, M. (eds.) 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, 11–14 December 2022, pp. 76–80. IEEE (2022)

    Google Scholar 

  25. Zhou, X., Zhang, M., Zhou, Y., Wu, Z., Li, H.: Accented text-to-speech synthesis with limited data. CoRR abs/2305.04816 (2023). https://doi.org/10.48550/ar**v.2305.04816

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei **e .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, L. et al. (2024). Accent-VITS: Accent Transfer for End-to-End TTS. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0601-3_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0600-6

  • Online ISBN: 978-981-97-0601-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation