Use of Speaker Metadata for Improving Automatic Pronunciation Assessment

Saenz, Jose Antonio Lopez; Hain, Thomas

doi:10.1007/978-3-030-89579-2_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13062))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

248 Accesses

Abstract

Pronunciation assessment remains a subjective task which depends on a pronunciation reference hold as canonical. Whether a second language (L2) speaker is able to replicate said reference is decided by an assessor who perceives the identity of the sounds produced. It is known that the assessor has a bias caused by the perception of the speaker, hence the definition of a standard for L2 pronunciation is crucial in a formal assessment. In Computer Assisted Pronunciation Assessment (CAPA), the definition of a pronunciation standard for L2 is not trivial due to limited L2 data annotated for mispronunciations. Inspired on the assessor’s bias, this work explores an alternative to a conventional Automatic Speech Recognition approach for CAPA by using speaker metadata along with acoustic observations for mispronunciation detection. A combination of Bidirectional Long-Short Memory with self-attention was used to detect pronunciation errors in short speech segments. It was found that the use of categorical metadata can have a positive effect in the classification of mispronounced segments depending on the sparsity and balance of the classes. It was also found that different assessors can be influenced differently by information about the speaker’s linguistic background. The effect of the metadata was tested on data from Dutch children learners of English as L2 in schools across the Netherlands. The limited speaker diversity of the corpus made the task a challenge worth keep exploring.

Jose Antonio Lopez Saenz is a doctoral student from Programa de Becas en el Extranjero from CONACYT with the fellowship number 661687 at the University of Sheffield. We also want to thank ITSLanguage BV for the data facilitated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (Brazil)

eBook: USD 44.99; Price excludes VAT (Brazil)

Softcover Book: USD 59.99; Price excludes VAT (Brazil)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Assessing Second Language Pronunciation

pROnounce: Automatic Pronunciation Assessment for Romanian

Using Speech-to-Text Applications for Assessing English Language Learners’ Pronunciation: A Comparison with Human Raters

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)
Google Scholar
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15 (2015)
Google Scholar
Chen, L., Gao, Q., Liang, Q., Yuan, J., Liu, Y., China, L.I.S.: Automatic scoring minimal-pair pronunciation drills by using recognition likelihood scores and phonological features. In: SLaTE, pp. 25–29 (2019)
Google Scholar
Chen, L., Tao, J., Ghaffarzadegan, S., Qian, Y.: End-to-end neural network based automated speech scoring. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238. IEEE (2018)
Google Scholar
Chen, L., et al.: End-to-end neural network based automated speech scoring Midea America Corporation, 250 W Tasman Dr, San Jose, CA 95134, USA Robert Bosch Corporation, 4005 Miranda Ave, Palo Alto, CA 94304, USA Educational Testing Service (ETS), 90 New Montgomer. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238 (2018)
Google Scholar
Cheng, S., Liu, Z., Li, L., Tang, Z., Wang, D., Zheng, T.F.: ASR-free pronunciation assessment. ar**v pp. 3047–3051 (2020)
Google Scholar
Chu, W., Liu, Y., Zhou, J.: Recognize mispronunciations to improve non-native acoustic modeling through a phone decoder built from one edit distance finite state automaton. In: INTERSPEECH, pp. 3062–3066 (2020)
Google Scholar
Dudy, S., Bedrick, S., Asgari, M., Kain, A.: Automatic analysis of pronunciations for children with speech sound disorders. Comput. Speech Lang. 50, 62–84 (2018)
Article Google Scholar
Fu, K., Lin, J., Ke, D., **e, Y., Zhang, J., Lin, B.: A full text-dependent end to end mispronunciation detection and diagnosis with easy data augmentation techniques (2021)
Google Scholar
Harding, L.: What do raters need in a pronunciation scale? The user’s view. In: Isaacs, T., Trofimovich, P. (eds.) Second Language Pronunciation Assessment: Interdisciplinary Perspectives, chap. 2, pp. 12–34. Multilingual Matters/Channel View Publications (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, G., Ye, J., Shen, Y., Zhou, Y.: A evaluating model of English pronunciation for Chinese students. In: 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), pp. 1062–1065. IEEE (2017)
Google Scholar
Lindemann, S.: Variation or ‘error’? perception of pronunciation variation and implications for assessment. Second language pronunciation assessment, p. 193 (2017)
Google Scholar
Milner, R., Jalal, M.A., Ng, R.W., Hain, T.: A cross-corpus study on speech emotion recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 304–311. IEEE (2019)
Google Scholar
Moore, R.K., Skidmore, L.: On the use/misuse of the term’phoneme’. ar**v preprint ar**v:1907.11640 (2019)
Nicolao, M., Beeston, A.V., Hain, T.: Automatic assessment of English learner pronunciation using discriminative classifiers. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5351–5355. IEEE (2015)
Google Scholar
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 81–84. IEEE (1995)
Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition (2014)
Google Scholar
Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K.: An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering hmm transition probabilities. In: INTERSPEECH, pp. 954–958 (2019)
Google Scholar
Trofimovich, P., Isaacs, T.: Second language pronunciation assessment: a look at the present and the future. Second Language Pronunciation Assessment, p. 259 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wei, J., Llosa, L.: Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. Lang. Assess. Q. 12(3), 283–304 (2015)
Article Google Scholar
Witt, S.M., Young, S.J.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30(2–3), 95–108 (2000)
Article Google Scholar
Witteman, M.J., Weber, A., McQueen, J.M.: Tolerance for inconsistency in foreign-accented speech. Psychon. Bull. Rev. 21(2), 512–519 (2014). https://doi.org/10.3758/s13423-013-0519-8, http://springer.longhoe.net/10.3758/s13423-013-0519-8
Zeyer, A., Doetsch, P., Voigtlaender, P., Schluter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 2462–2466 (2017). doi: https://doi.org/10.1109/ICASSP.2017.7952599
Zhang, L., et al.: End-to-end automatic pronunciation error detection based on improved hybrid ctc/attention architecture. Sensors 20(7), 1809 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Speech and Hearing Research, Department of Computer Science, University of Sheffield, Sheffield, S10 2TN, UK
Jose Antonio Lopez Saenz & Thomas Hain

Authors

Jose Antonio Lopez Saenz
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Antonio Lopez Saenz .

Editor information

Editors and Affiliations

Cardiff University, Cardiff, UK
Luis Espinosa-Anke
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
School of Computer Science and Informatics, Cardiff University, Cardiff, UK
Irena Spasić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saenz, J.A.L., Hain, T. (2021). Use of Speaker Metadata for Improving Automatic Pronunciation Assessment. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-89579-2_6
Published: 17 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89578-5
Online ISBN: 978-3-030-89579-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Use of Speaker Metadata for Improving Automatic Pronunciation Assessment

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing Second Language Pronunciation

pROnounce: Automatic Pronunciation Assessment for Romanian

Using Speech-to-Text Applications for Assessing English Language Learners’ Pronunciation: A Comparison with Human Raters

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Use of Speaker Metadata for Improving Automatic Pronunciation Assessment

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing Second Language Pronunciation

pROnounce: Automatic Pronunciation Assessment for Romanian

Using Speech-to-Text Applications for Assessing English Language Learners’ Pronunciation: A Comparison with Human Raters

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation