Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Abstract

The Malach Project [6] verified the possibility of using automatic speech recognition (ASR) methods to search for information in large multilingual archives of Holocaust testimonies. After the end of the MALACH project, in which we participated, we continued to work on the completion and implementation of the project’s objectives with priority for two languages - Czech and English. We have developed and implemented a full-text search system that can be used by experts and by the general public in the MALACH Centre for Visual History and Jewish Museum in Prague. ASR is a key technology that ensures the functioning of the whole information retrieval process. To ensure the highest quality searches, we are constantly striving to develop this technology using the state-of-the-art methods. The article presents the latest results obtained in extensive experiments using various DNN architectures in the ASR of the English and Czech MALACH archives. The paper is therefore one of the first responses to M. Picheny’s call [10] to the speech community to reconsider this very difficult task of recognizing strongly emotional and heavily accented speech of Holocaust survivors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 96.29
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 128.39
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Speakers were selected according to several parameters, not only according to nationality. Place of birth (better than nationality), length of transcribed speech and gender formed the basis for the selection of a representative test set. It is well known that the place where the speaker spent his childhood fundamentally influences the way he speaks. This process resulted in these 10 testimonies: 00026, 00055 (only 3th tape), 01032, 19894 (only 3th and 10th tapes), 20806, 22984, 28430, 32907, 33414 and 34024 (MALACH IntCode) [21].

References

  1. Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736

    Article  Google Scholar 

  2. Byrne, W., et al.: Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. Speech Audio Process. 12(4), 420–435 (2004). https://doi.org/10.1109/TSA.2004.828702

    Article  Google Scholar 

  3. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  4. Ghahremani, P., Manohar, V., Povey, D., Khudanpur, S.: Acoustic modelling from the signal domain using CNNs. In: Interspeech 2016, pp. 3434–3438 (2016). https://doi.org/10.21437/Interspeech.2016-1495

  5. Hadian, H., Sameti, H., Povey, D., Khudanpur, S.: Flat-start single-stage discriminatively trained HMM-based models for ASR. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 1949–1961 (2018). https://doi.org/10.1109/TASLP.2018.2848701

  6. MALACH project (2006). https://malach.umiacs.umd.edu/

  7. Mihajlik, P., Fegyó, T., Németh, B., Tüske, Z., Trón, V.: Towards automatic transcription of large spoken archives in agglutinating languages – Hungarian ASR for the MALACH project. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 342–349. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_45

    Chapter  Google Scholar 

  8. Novak, J.R., Nobuaki, M., Keikichi, H.: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 22(6), 907–938 (2016). https://doi.org/10.1017/S1351324915000315

  9. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech 2015, pp. 3214–3218 (2015)

    Google Scholar 

  10. Picheny, M., Tüske, Z., Kingsbury, B., Audhkhasi, K., Cui, X., Saon, G.: Challenging the boundaries of speech recognition: the MALACH corpus. In: Interspeech 2019, pp. 326–330 (2019). https://doi.org/10.21437/Interspeech.2019-1907

  11. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech 2018, pp. 3743–3747 (2018). https://doi.org/10.21437/Interspeech.2018-1417

  12. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding 01 (2011)

    Google Scholar 

  13. Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech 2016, pp. 2751–2755 (2016). https://doi.org/10.21437/Interspeech.2016-595

  14. Psutka, J., Hoidekr, J., Ircing, P., Psutka, J.V.: Recognition of spontaneous speech - some problems and their solutions. In: CITSA 2006, pp. 169–172. IIIS (2006)

    Google Scholar 

  15. Psutka, J., Ircing, P., Psutka, J.V., Hajič, J., Byrne, W., Mírovský, J.: Automatic transcription of Czech, Russian and Slovak spontaneous speech in the MALACH project. In: Eurospeech 2005, pp. 1349–1352. ISCA (2005)

    Google Scholar 

  16. Psutka, J., et al.: Large vocabulary ASR for spontaneous Czech in the MALACH project. In: Eurospeech 2003, pp. 1821–1824. ISCA (2003)

    Google Scholar 

  17. Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L.: Fast Phonetic/Lexical searching in the archives of the Czech holocaust testimonies: advancing towards the MALACH project visions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 385–391. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_49

    Chapter  Google Scholar 

  18. Psutka, J., et al.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 2011(1), 1–10 (2011). https://doi.org/10.1186/1687-4722-2011-10

    Article  Google Scholar 

  19. Psutka, J.V., et al.: USC-SFI MALACH interviews and transcripts Czech (2014). https://catalog.ldc.upenn.edu/LDC2014S04

  20. Ramabhadran, B., Huang, J., Picheny, M.: Towards automatic transcription of large spoken archives - English ASR for the MALACH project. In: ICASSP 2003, p. I (2003). https://doi.org/10.1109/ICASSP.2003.1198756

  21. Ramabhadran, B., et al.: USC-SFI MALACH interviews and transcripts English (2012). https://catalog.ldc.upenn.edu/LDC2012S05

  22. Stanislav, P., Švec, J., Ircing, P.: An engine for online video search in large archives of the holocaust testimonies. In: Interspeech 2016, pp. 2352–2353 (2016)

    Google Scholar 

  23. Švec, J., Psutka, J., Trmal, J., Šmídl, L., Ircing, P., Sedmidubský, J.: On the use of grapheme models for searching in large spoken archives. In: ICASSP 2018, pp. 6259–6263 (2018). https://doi.org/10.1109/ICASSP.2018.8461774

  24. Vaněk, J., Trmal, J., Psutka, J.V., Psutka, J.: Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors. IEEE Trans. Audio Speech Lang. Process. 20(6), 1818–1828 (2012). https://doi.org/10.1109/TASL.2012.2190928

    Article  Google Scholar 

  25. Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Interspeech 2013, pp. 2345–2349 (2013)

    Google Scholar 

  26. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989). https://doi.org/10.1109/29.21701

    Article  Google Scholar 

  27. Wang, D., Wang, X., LV, S.: An overview of end-to-end automatic speech recognition. Symmetry 11(8) (2019). https://doi.org/10.3390/sym11081018

  28. Young, S.: The HTK hidden Markov model toolkit: design and philosophy, vol. 2, pp. 2–44. Entropic Cambridge Research Laboratory, Ltd. (1994)

    Google Scholar 

  29. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: ICASSP 2014, pp. 215–219 (2014). https://doi.org/10.1109/ICASSP.2014.6853589

Download references

Acknowledgements

This paper was supported by the Technology Agency of the Czech Republic, project no. TN01000024.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josef V. Psutka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Psutka, J.V., Pražák, A., Vaněk, J. (2021). Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation