Use of Speaker Metadata for Improving Automatic Pronunciation Assessment

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2021)

Abstract

Pronunciation assessment remains a subjective task which depends on a pronunciation reference hold as canonical. Whether a second language (L2) speaker is able to replicate said reference is decided by an assessor who perceives the identity of the sounds produced. It is known that the assessor has a bias caused by the perception of the speaker, hence the definition of a standard for L2 pronunciation is crucial in a formal assessment. In Computer Assisted Pronunciation Assessment (CAPA), the definition of a pronunciation standard for L2 is not trivial due to limited L2 data annotated for mispronunciations. Inspired on the assessor’s bias, this work explores an alternative to a conventional Automatic Speech Recognition approach for CAPA by using speaker metadata along with acoustic observations for mispronunciation detection. A combination of Bidirectional Long-Short Memory with self-attention was used to detect pronunciation errors in short speech segments. It was found that the use of categorical metadata can have a positive effect in the classification of mispronounced segments depending on the sparsity and balance of the classes. It was also found that different assessors can be influenced differently by information about the speaker’s linguistic background. The effect of the metadata was tested on data from Dutch children learners of English as L2 in schools across the Netherlands. The limited speaker diversity of the corpus made the task a challenge worth keep exploring.

Jose Antonio Lopez Saenz is a doctoral student from Programa de Becas en el Extranjero from CONACYT with the fellowship number 661687 at the University of Sheffield. We also want to thank ITSLanguage BV for the data facilitated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Brazil)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (Brazil)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (Brazil)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)

    Google Scholar 

  2. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15 (2015)

    Google Scholar 

  3. Chen, L., Gao, Q., Liang, Q., Yuan, J., Liu, Y., China, L.I.S.: Automatic scoring minimal-pair pronunciation drills by using recognition likelihood scores and phonological features. In: SLaTE, pp. 25–29 (2019)

    Google Scholar 

  4. Chen, L., Tao, J., Ghaffarzadegan, S., Qian, Y.: End-to-end neural network based automated speech scoring. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238. IEEE (2018)

    Google Scholar 

  5. Chen, L., et al.: End-to-end neural network based automated speech scoring Midea America Corporation, 250 W Tasman Dr, San Jose, CA 95134, USA Robert Bosch Corporation, 4005 Miranda Ave, Palo Alto, CA 94304, USA Educational Testing Service (ETS), 90 New Montgomer. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6234–6238 (2018)

    Google Scholar 

  6. Cheng, S., Liu, Z., Li, L., Tang, Z., Wang, D., Zheng, T.F.: ASR-free pronunciation assessment. ar**v pp. 3047–3051 (2020)

    Google Scholar 

  7. Chu, W., Liu, Y., Zhou, J.: Recognize mispronunciations to improve non-native acoustic modeling through a phone decoder built from one edit distance finite state automaton. In: INTERSPEECH, pp. 3062–3066 (2020)

    Google Scholar 

  8. Dudy, S., Bedrick, S., Asgari, M., Kain, A.: Automatic analysis of pronunciations for children with speech sound disorders. Comput. Speech Lang. 50, 62–84 (2018)

    Article  Google Scholar 

  9. Fu, K., Lin, J., Ke, D., **e, Y., Zhang, J., Lin, B.: A full text-dependent end to end mispronunciation detection and diagnosis with easy data augmentation techniques (2021)

    Google Scholar 

  10. Harding, L.: What do raters need in a pronunciation scale? The user’s view. In: Isaacs, T., Trofimovich, P. (eds.) Second Language Pronunciation Assessment: Interdisciplinary Perspectives, chap. 2, pp. 12–34. Multilingual Matters/Channel View Publications (2017)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  12. Huang, G., Ye, J., Shen, Y., Zhou, Y.: A evaluating model of English pronunciation for Chinese students. In: 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), pp. 1062–1065. IEEE (2017)

    Google Scholar 

  13. Lindemann, S.: Variation or ‘error’? perception of pronunciation variation and implications for assessment. Second language pronunciation assessment, p. 193 (2017)

    Google Scholar 

  14. Milner, R., Jalal, M.A., Ng, R.W., Hain, T.: A cross-corpus study on speech emotion recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 304–311. IEEE (2019)

    Google Scholar 

  15. Moore, R.K., Skidmore, L.: On the use/misuse of the term’phoneme’. ar**v preprint ar**v:1907.11640 (2019)

  16. Nicolao, M., Beeston, A.V., Hain, T.: Automatic assessment of English learner pronunciation using discriminative classifiers. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5351–5355. IEEE (2015)

    Google Scholar 

  17. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 81–84. IEEE (1995)

    Google Scholar 

  18. Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition (2014)

    Google Scholar 

  19. Sudhakara, S., Ramanathi, M.K., Yarra, C., Ghosh, P.K.: An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering hmm transition probabilities. In: INTERSPEECH, pp. 954–958 (2019)

    Google Scholar 

  20. Trofimovich, P., Isaacs, T.: Second language pronunciation assessment: a look at the present and the future. Second Language Pronunciation Assessment, p. 259 (2017)

    Google Scholar 

  21. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  22. Wei, J., Llosa, L.: Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. Lang. Assess. Q. 12(3), 283–304 (2015)

    Article  Google Scholar 

  23. Witt, S.M., Young, S.J.: Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30(2–3), 95–108 (2000)

    Article  Google Scholar 

  24. Witteman, M.J., Weber, A., McQueen, J.M.: Tolerance for inconsistency in foreign-accented speech. Psychon. Bull. Rev. 21(2), 512–519 (2014). https://doi.org/10.3758/s13423-013-0519-8, http://springer.longhoe.net/10.3758/s13423-013-0519-8

  25. Zeyer, A., Doetsch, P., Voigtlaender, P., Schluter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings pp. 2462–2466 (2017). doi: https://doi.org/10.1109/ICASSP.2017.7952599

  26. Zhang, L., et al.: End-to-end automatic pronunciation error detection based on improved hybrid ctc/attention architecture. Sensors 20(7), 1809 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Antonio Lopez Saenz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saenz, J.A.L., Hain, T. (2021). Use of Speaker Metadata for Improving Automatic Pronunciation Assessment. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89579-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89578-5

  • Online ISBN: 978-3-030-89579-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation