Abstract
Pronunciation assessment remains a subjective task which depends on a pronunciation reference hold as canonical. Whether a second language (L2) speaker is able to replicate said reference is decided by an assessor who perceives the identity of the sounds produced. It is known that the assessor has a bias caused by the perception of the speaker, hence the definition of a standard for L2 pronunciation is crucial in a formal assessment. In Computer Assisted Pronunciation Assessment (CAPA), the definition of a pronunciation standard for L2 is not trivial due to limited L2 data annotated for mispronunciations. Inspired on the assessor’s bias, this work explores an alternative to a conventional Automatic Speech Recognition approach for CAPA by using speaker metadata along with acoustic observations for mispronunciation detection. A combination of Bidirectional Long-Short Memory with self-attention was used to detect pronunciation errors in short speech segments. It was found that the use of categorical metadata can have a positive effect in the classification of mispronounced segments depending on the sparsity and balance of the classes. It was also found that different assessors can be influenced differently by information about the speaker’s linguistic background. The effect of the metadata was tested on data from Dutch children learners of English as L2 in schools across the Netherlands. The limited speaker diversity of the corpus made the task a challenge worth keep exploring.
Jose Antonio Lopez Saenz is a doctoral student from Programa de Becas en el Extranjero from CONACYT with the fellowship number 661687 at the University of Sheffield. We also want to thank ITSLanguage BV for the data facilitated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15 (2015)
Chen, L., Gao, Q., Liang, Q., Yuan, J., Liu, Y., China, L.I.S.: Automatic scoring minimal-pair pronunciation drills by using recognition likelihood scores and phonological features. In: SLaTE, pp. 25–29 (2019)
Chen, L., Tao, J., Ghaffarzadegan, S., Qian, Y.: End-to-end neural network based automated speech scoring. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238. IEEE (2018)
Chen, L., et al.: End-to-end neural network based automated speech scoring Midea America Corporation, 250 W Tasman Dr, San Jose, CA 95134, USA Robert Bosch Corporation, 4005 Miranda Ave, Palo Alto, CA 94304, USA Educational Testing Service (ETS), 90 New Montgomer. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238 (2018)
Cheng, S., Liu, Z., Li, L., Tang, Z., Wang, D., Zheng, T.F.: ASR-free pronunciation assessment. ar**v pp. 3047–3051 (2020)
Chu, W., Liu, Y., Zhou, J.: Recognize mispronunciations to improve non-native acoustic modeling through a phone decoder built from one edit distance finite state automaton. In: INTERSPEECH, pp. 3062–3066 (2020)
Dudy, S., Bedrick, S., Asgari, M., Kain, A.: Automatic analysis of pronunciations for children with speech sound disorders. Comput. Speech Lang. 50, 62–84 (2018)
Fu, K., Lin, J., Ke, D., **e, Y., Zhang, J., Lin, B.: A full text-dependent end to end mispronunciation detection and diagnosis with easy data augmentation techniques (2021)
Harding, L.: What do raters need in a pronunciation scale? The user’s view. In: Isaacs, T., Trofimovich, P. (eds.) Second Language Pronunciation Assessment: Interdisciplinary Perspectives, chap. 2, pp. 12–34. Multilingual Matters/Channel View Publications (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, G., Ye, J., Shen, Y., Zhou, Y.: A evaluating model of English pronunciation for Chinese students. In: 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), pp. 1062–1065. IEEE (2017)
Lindemann, S.: Variation or ‘error’? perception of pronunciation variation and implications for assessment. Second language pronunciation assessment, p. 193 (2017)
Milner, R., Jalal, M.A., Ng, R.W., Hain, T.: A cross-corpus study on speech emotion recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 304–311. IEEE (2019)
Moore, R.K., Skidmore, L.: On the use/misuse of the term’phoneme’. ar**v preprint ar**v:1907.11640 (2019)
Nicolao, M., Beeston, A.V., Hain, T.: Automatic assessment of English learner pronunciation using discriminative classifiers. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5351–5355. IEEE (2015)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 81–84. IEEE (1995)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition (2014)
Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K.: An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering hmm transition probabilities. In: INTERSPEECH, pp. 954–958 (2019)
Trofimovich, P., Isaacs, T.: Second language pronunciation assessment: a look at the present and the future. Second Language Pronunciation Assessment, p. 259 (2017)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wei, J., Llosa, L.: Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. Lang. Assess. Q. 12(3), 283–304 (2015)
Witt, S.M., Young, S.J.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30(2–3), 95–108 (2000)
Witteman, M.J., Weber, A., McQueen, J.M.: Tolerance for inconsistency in foreign-accented speech. Psychon. Bull. Rev. 21(2), 512–519 (2014). https://doi.org/10.3758/s13423-013-0519-8, http://springer.longhoe.net/10.3758/s13423-013-0519-8
Zeyer, A., Doetsch, P., Voigtlaender, P., Schluter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 2462–2466 (2017). doi: https://doi.org/10.1109/ICASSP.2017.7952599
Zhang, L., et al.: End-to-end automatic pronunciation error detection based on improved hybrid ctc/attention architecture. Sensors 20(7), 1809 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Saenz, J.A.L., Hain, T. (2021). Use of Speaker Metadata for Improving Automatic Pronunciation Assessment. In: Espinosa-Anke, L., MartÃn-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-89579-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89578-5
Online ISBN: 978-3-030-89579-2
eBook Packages: Computer ScienceComputer Science (R0)