Abstract
The Malach Project [6] verified the possibility of using automatic speech recognition (ASR) methods to search for information in large multilingual archives of Holocaust testimonies. After the end of the MALACH project, in which we participated, we continued to work on the completion and implementation of the project’s objectives with priority for two languages - Czech and English. We have developed and implemented a full-text search system that can be used by experts and by the general public in the MALACH Centre for Visual History and Jewish Museum in Prague. ASR is a key technology that ensures the functioning of the whole information retrieval process. To ensure the highest quality searches, we are constantly striving to develop this technology using the state-of-the-art methods. The article presents the latest results obtained in extensive experiments using various DNN architectures in the ASR of the English and Czech MALACH archives. The paper is therefore one of the first responses to M. Picheny’s call [10] to the speech community to reconsider this very difficult task of recognizing strongly emotional and heavily accented speech of Holocaust survivors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Speakers were selected according to several parameters, not only according to nationality. Place of birth (better than nationality), length of transcribed speech and gender formed the basis for the selection of a representative test set. It is well known that the place where the speaker spent his childhood fundamentally influences the way he speaks. This process resulted in these 10 testimonies: 00026, 00055 (only 3th tape), 01032, 19894 (only 3th and 10th tapes), 20806, 22984, 28430, 32907, 33414 and 34024 (MALACH IntCode) [21].
References
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736
Byrne, W., et al.: Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. Speech Audio Process. 12(4), 420–435 (2004). https://doi.org/10.1109/TSA.2004.828702
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Ghahremani, P., Manohar, V., Povey, D., Khudanpur, S.: Acoustic modelling from the signal domain using CNNs. In: Interspeech 2016, pp. 3434–3438 (2016). https://doi.org/10.21437/Interspeech.2016-1495
Hadian, H., Sameti, H., Povey, D., Khudanpur, S.: Flat-start single-stage discriminatively trained HMM-based models for ASR. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 1949–1961 (2018). https://doi.org/10.1109/TASLP.2018.2848701
MALACH project (2006). https://malach.umiacs.umd.edu/
Mihajlik, P., Fegyó, T., Németh, B., Tüske, Z., Trón, V.: Towards automatic transcription of large spoken archives in agglutinating languages – Hungarian ASR for the MALACH project. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 342–349. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_45
Novak, J.R., Nobuaki, M., Keikichi, H.: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 22(6), 907–938 (2016). https://doi.org/10.1017/S1351324915000315
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech 2015, pp. 3214–3218 (2015)
Picheny, M., Tüske, Z., Kingsbury, B., Audhkhasi, K., Cui, X., Saon, G.: Challenging the boundaries of speech recognition: the MALACH corpus. In: Interspeech 2019, pp. 326–330 (2019). https://doi.org/10.21437/Interspeech.2019-1907
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech 2018, pp. 3743–3747 (2018). https://doi.org/10.21437/Interspeech.2018-1417
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding 01 (2011)
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech 2016, pp. 2751–2755 (2016). https://doi.org/10.21437/Interspeech.2016-595
Psutka, J., Hoidekr, J., Ircing, P., Psutka, J.V.: Recognition of spontaneous speech - some problems and their solutions. In: CITSA 2006, pp. 169–172. IIIS (2006)
Psutka, J., Ircing, P., Psutka, J.V., Hajič, J., Byrne, W., Mírovský, J.: Automatic transcription of Czech, Russian and Slovak spontaneous speech in the MALACH project. In: Eurospeech 2005, pp. 1349–1352. ISCA (2005)
Psutka, J., et al.: Large vocabulary ASR for spontaneous Czech in the MALACH project. In: Eurospeech 2003, pp. 1821–1824. ISCA (2003)
Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L.: Fast Phonetic/Lexical searching in the archives of the Czech holocaust testimonies: advancing towards the MALACH project visions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 385–391. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_49
Psutka, J., et al.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 2011(1), 1–10 (2011). https://doi.org/10.1186/1687-4722-2011-10
Psutka, J.V., et al.: USC-SFI MALACH interviews and transcripts Czech (2014). https://catalog.ldc.upenn.edu/LDC2014S04
Ramabhadran, B., Huang, J., Picheny, M.: Towards automatic transcription of large spoken archives - English ASR for the MALACH project. In: ICASSP 2003, p. I (2003). https://doi.org/10.1109/ICASSP.2003.1198756
Ramabhadran, B., et al.: USC-SFI MALACH interviews and transcripts English (2012). https://catalog.ldc.upenn.edu/LDC2012S05
Stanislav, P., Švec, J., Ircing, P.: An engine for online video search in large archives of the holocaust testimonies. In: Interspeech 2016, pp. 2352–2353 (2016)
Švec, J., Psutka, J., Trmal, J., Šmídl, L., Ircing, P., Sedmidubský, J.: On the use of grapheme models for searching in large spoken archives. In: ICASSP 2018, pp. 6259–6263 (2018). https://doi.org/10.1109/ICASSP.2018.8461774
Vaněk, J., Trmal, J., Psutka, J.V., Psutka, J.: Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors. IEEE Trans. Audio Speech Lang. Process. 20(6), 1818–1828 (2012). https://doi.org/10.1109/TASL.2012.2190928
Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Interspeech 2013, pp. 2345–2349 (2013)
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989). https://doi.org/10.1109/29.21701
Wang, D., Wang, X., LV, S.: An overview of end-to-end automatic speech recognition. Symmetry 11(8) (2019). https://doi.org/10.3390/sym11081018
Young, S.: The HTK hidden Markov model toolkit: design and philosophy, vol. 2, pp. 2–44. Entropic Cambridge Research Laboratory, Ltd. (1994)
Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: ICASSP 2014, pp. 215–219 (2014). https://doi.org/10.1109/ICASSP.2014.6853589
Acknowledgements
This paper was supported by the Technology Agency of the Czech Republic, project no. TN01000024.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Psutka, J.V., Pražák, A., Vaněk, J. (2021). Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)