Accent-VITS: Accent Transfer for End-to-End TTS

Ma, Linhan; Zhang, Yongmao; Zhu, **nfa; Lei, Yi; Ning, Ziqian; Zhu, Pengcheng; **e, Lei

doi:10.1007/978-981-97-0601-3_17

Linhan Ma¹⁰,
Yongmao Zhang¹⁰,
**nfa Zhu¹⁰,
Yi Lei¹⁰,
Ziqian Ning¹⁰,
Pengcheng Zhu¹¹ &
…
Lei **e¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2006))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

251 Accesses

Abstract

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave map** in VITS is decomposed into text-to-accent and accent-to-wave map**s in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 64.19; Price includes VAT (Germany)

Softcover Book: EUR 80.24; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.data-baker.com/open_source.html.

References

Acuna, D., Law, M.T., Zhang, G., Fidler, S.: Domain adversarial training: a game perspective. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)
Google Scholar
Dai, D., et al.: Cloning one’s voice using very limited data in the wild. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 8322–8326. IEEE (2022)
Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 3830–3834. ISCA (2020)
Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 59:1–59:35 (2016)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 2672–2680 (2014)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 5530–5540. PMLR (2021)
Google Scholar
Kolluru, B., Wan, V., Latorre, J., Yanagisawa, K., Gales, M.J.F.: Generating multiple-accent pronunciations for TTS using joint sequence model interpolation. In: Li, H., Meng, H.M., Ma, B., Chng, E., **e, L. (eds.) INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014, pp. 1273–1277. ISCA (2014)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)
Google Scholar
Lee, S., Kim, S., Lee, J., Song, E., Hwang, M., Lee, S.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In: NeurIPS (2022)
Google Scholar
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014). https://doi.org/10.1109/TASLP.2014.2304637
Liu, R., Sisman, B., Gao, G., Li, H.: Controllable accented text-to-speech synthesis. CoRR abs/2209.10804 (2022). https://doi.org/10.48550/ar**v.2209.10804
Liu, S., Yang, S., Su, D., Yu, D.: Referee: towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6307–6311. IEEE (2022)
Google Scholar
Loots, L., Niesler, T.: Automatic conversion between pronunciations of different English accents. Speech Commun. 53(1), 75–84 (2011)
Article Google Scholar
de Mareüil, P.B., Vieru-Dimulescu, B.: The contribution of prosody to the perception of foreign accent. Phonetica 63(4), 247–267 (2006)
Article Google Scholar
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019, pp. 3165–3174 (2019)
Google Scholar
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538. JMLR.org (2015)
Google Scholar
Shu, R., Bui, H.H., Narui, H., Ermon, S.: A DIRT-T approach to unsupervised domain adaptation. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018)
Google Scholar
Sun, L., Li, K., Wang, H., Kang, S., Meng, H.M.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: IEEE International Conference on Multimedia and Expo, ICME 2016, Seattle, WA, USA, 11–15 July 2016, pp. 1–6. IEEE Computer Society (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 5998–6008 (2017)
Google Scholar
Yao, Z., et al.: WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 4054–4058. ISCA (2021)
Google Scholar
Zhang, B., et al.: WENETSPEECH: a 10000+ hours multi-domain mandarin corpus for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6182–6186. IEEE (2022)
Google Scholar
Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 7237–7241. IEEE (2022)
Google Scholar
Zhang, Y., Wang, Z., Yang, P., Sun, H., Wang, Z., **e, L.: AccentSpeech: learning accent from crowd-sourced data for target speaker TTS with accents. In: Lee, K.A., Lee, H., Lu, Y., Dong, M. (eds.) 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, 11–14 December 2022, pp. 76–80. IEEE (2022)
Google Scholar
Zhou, X., Zhang, M., Zhou, Y., Wu, Z., Li, H.: Accented text-to-speech synthesis with limited data. CoRR abs/2305.04816 (2023). https://doi.org/10.48550/ar**v.2305.04816

Download references

Author information

Authors and Affiliations

Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, **’an, China
Linhan Ma, Yongmao Zhang, **nfa Zhu, Yi Lei, Ziqian Ning & Lei **e
Fuxi AI Lab, NetEase Inc., Hangzhou, China
Pengcheng Zhu

Authors

Linhan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yongmao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
**nfa Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Lei
View author publications
You can also search for this author in PubMed Google Scholar
Ziqian Ning
View author publications
You can also search for this author in PubMed Google Scholar
Pengcheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei **e
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei **e .

Editor information

Editors and Affiliations

Tsinghua University, Bei**g, China
Jia Jia
University of Science and Technology of China, Anhui, China
Zhenhua Ling
Shanghai Jiao Tong University, Shanghai, China
**e Chen
Bei**g University of Posts and Telecommunications, Bei**g, China
Ya Li
Hunan University, Hunan, China
Zixing Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L. et al. (2024). Accent-VITS: Accent Transfer for End-to-End TTS. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_17

Download citation

DOI: https://doi.org/10.1007/978-981-97-0601-3_17
Published: 15 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Accent-VITS: Accent Transfer for End-to-End TTS