Abstract
Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave map** in VITS is decomposed into text-to-accent and accent-to-wave map**s in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acuna, D., Law, M.T., Zhang, G., Fidler, S.: Domain adversarial training: a game perspective. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)
Dai, D., et al.: Cloning one’s voice using very limited data in the wild. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 8322–8326. IEEE (2022)
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 3830–3834. ISCA (2020)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017)
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 59:1–59:35 (2016)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 2672–2680 (2014)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 5530–5540. PMLR (2021)
Kolluru, B., Wan, V., Latorre, J., Yanagisawa, K., Gales, M.J.F.: Generating multiple-accent pronunciations for TTS using joint sequence model interpolation. In: Li, H., Meng, H.M., Ma, B., Chng, E., **e, L. (eds.) INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014, pp. 1273–1277. ISCA (2014)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)
Lee, S., Kim, S., Lee, J., Song, E., Hwang, M., Lee, S.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In: NeurIPS (2022)
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014). https://doi.org/10.1109/TASLP.2014.2304637
Liu, R., Sisman, B., Gao, G., Li, H.: Controllable accented text-to-speech synthesis. CoRR abs/2209.10804 (2022). https://doi.org/10.48550/ar**v.2209.10804
Liu, S., Yang, S., Su, D., Yu, D.: Referee: towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6307–6311. IEEE (2022)
Loots, L., Niesler, T.: Automatic conversion between pronunciations of different English accents. Speech Commun. 53(1), 75–84 (2011)
de Mareüil, P.B., Vieru-Dimulescu, B.: The contribution of prosody to the perception of foreign accent. Phonetica 63(4), 247–267 (2006)
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019, pp. 3165–3174 (2019)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538. JMLR.org (2015)
Shu, R., Bui, H.H., Narui, H., Ermon, S.: A DIRT-T approach to unsupervised domain adaptation. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018)
Sun, L., Li, K., Wang, H., Kang, S., Meng, H.M.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: IEEE International Conference on Multimedia and Expo, ICME 2016, Seattle, WA, USA, 11–15 July 2016, pp. 1–6. IEEE Computer Society (2016)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 5998–6008 (2017)
Yao, Z., et al.: WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 4054–4058. ISCA (2021)
Zhang, B., et al.: WENETSPEECH: a 10000+ hours multi-domain mandarin corpus for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 6182–6186. IEEE (2022)
Zhang, Y., Cong, J., Xue, H., **e, L., Zhu, P., Bi, M.: Visinger: variational inference with adversarial learning for end-to-end singing voice synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 7237–7241. IEEE (2022)
Zhang, Y., Wang, Z., Yang, P., Sun, H., Wang, Z., **e, L.: AccentSpeech: learning accent from crowd-sourced data for target speaker TTS with accents. In: Lee, K.A., Lee, H., Lu, Y., Dong, M. (eds.) 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, 11–14 December 2022, pp. 76–80. IEEE (2022)
Zhou, X., Zhang, M., Zhou, Y., Wu, Z., Li, H.: Accented text-to-speech synthesis with limited data. CoRR abs/2305.04816 (2023). https://doi.org/10.48550/ar**v.2305.04816
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ma, L. et al. (2024). Accent-VITS: Accent Transfer for End-to-End TTS. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_17
Download citation
DOI: https://doi.org/10.1007/978-981-97-0601-3_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)