Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures

Psutka, Josef V.; Pražák, Aleš; Vaněk, Jan

doi:10.1007/978-3-030-87802-3_50

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1650 Accesses
3 Citations

Abstract

The Malach Project [6] verified the possibility of using automatic speech recognition (ASR) methods to search for information in large multilingual archives of Holocaust testimonies. After the end of the MALACH project, in which we participated, we continued to work on the completion and implementation of the project’s objectives with priority for two languages - Czech and English. We have developed and implemented a full-text search system that can be used by experts and by the general public in the MALACH Centre for Visual History and Jewish Museum in Prague. ASR is a key technology that ensures the functioning of the whole information retrieval process. To ensure the highest quality searches, we are constantly striving to develop this technology using the state-of-the-art methods. The article presents the latest results obtained in extensive experiments using various DNN architectures in the ASR of the English and Czech MALACH archives. The paper is therefore one of the first responses to M. Picheny’s call [10] to the speech community to reconsider this very difficult task of recognizing strongly emotional and heavily accented speech of Holocaust survivors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 96.29; Price includes VAT (Germany)

Softcover Book: EUR 128.39; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A study on the challenges and opportunities of speech recognition for Bengali language

Article 05 November 2021

Improving Automatic Speech Recognition with Dialect-Specific Language Models

Turkish Speech Recognition

Notes

1.
Speakers were selected according to several parameters, not only according to nationality. Place of birth (better than nationality), length of transcribed speech and gender formed the basis for the selection of a representative test set. It is well known that the place where the speaker spent his childhood fundamentally influences the way he speaks. This process resulted in these 10 testimonies: 00026, 00055 (only 3th tape), 01032, 19894 (only 3th and 10th tapes), 20806, 22984, 28430, 32907, 33414 and 34024 (MALACH IntCode) [21].

References

Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736
Article Google Scholar
Byrne, W., et al.: Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. Speech Audio Process. 12(4), 420–435 (2004). https://doi.org/10.1109/TSA.2004.828702
Article Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
Ghahremani, P., Manohar, V., Povey, D., Khudanpur, S.: Acoustic modelling from the signal domain using CNNs. In: Interspeech 2016, pp. 3434–3438 (2016). https://doi.org/10.21437/Interspeech.2016-1495
Hadian, H., Sameti, H., Povey, D., Khudanpur, S.: Flat-start single-stage discriminatively trained HMM-based models for ASR. IEEE ACM Trans. Audio Speech Lang. Process. 26(11), 1949–1961 (2018). https://doi.org/10.1109/TASLP.2018.2848701
MALACH project (2006). https://malach.umiacs.umd.edu/
Mihajlik, P., Fegyó, T., Németh, B., Tüske, Z., Trón, V.: Towards automatic transcription of large spoken archives in agglutinating languages – Hungarian ASR for the MALACH project. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 342–349. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_45
Chapter Google Scholar
Novak, J.R., Nobuaki, M., Keikichi, H.: Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 22(6), 907–938 (2016). https://doi.org/10.1017/S1351324915000315
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech 2015, pp. 3214–3218 (2015)
Google Scholar
Picheny, M., Tüske, Z., Kingsbury, B., Audhkhasi, K., Cui, X., Saon, G.: Challenging the boundaries of speech recognition: the MALACH corpus. In: Interspeech 2019, pp. 326–330 (2019). https://doi.org/10.21437/Interspeech.2019-1907
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech 2018, pp. 3743–3747 (2018). https://doi.org/10.21437/Interspeech.2018-1417
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding 01 (2011)
Google Scholar
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech 2016, pp. 2751–2755 (2016). https://doi.org/10.21437/Interspeech.2016-595
Psutka, J., Hoidekr, J., Ircing, P., Psutka, J.V.: Recognition of spontaneous speech - some problems and their solutions. In: CITSA 2006, pp. 169–172. IIIS (2006)
Google Scholar
Psutka, J., Ircing, P., Psutka, J.V., Hajič, J., Byrne, W., Mírovský, J.: Automatic transcription of Czech, Russian and Slovak spontaneous speech in the MALACH project. In: Eurospeech 2005, pp. 1349–1352. ISCA (2005)
Google Scholar
Psutka, J., et al.: Large vocabulary ASR for spontaneous Czech in the MALACH project. In: Eurospeech 2003, pp. 1821–1824. ISCA (2003)
Google Scholar
Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L.: Fast Phonetic/Lexical searching in the archives of the Czech holocaust testimonies: advancing towards the MALACH project visions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 385–391. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_49
Chapter Google Scholar
Psutka, J., et al.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 2011(1), 1–10 (2011). https://doi.org/10.1186/1687-4722-2011-10
Article Google Scholar
Psutka, J.V., et al.: USC-SFI MALACH interviews and transcripts Czech (2014). https://catalog.ldc.upenn.edu/LDC2014S04
Ramabhadran, B., Huang, J., Picheny, M.: Towards automatic transcription of large spoken archives - English ASR for the MALACH project. In: ICASSP 2003, p. I (2003). https://doi.org/10.1109/ICASSP.2003.1198756
Ramabhadran, B., et al.: USC-SFI MALACH interviews and transcripts English (2012). https://catalog.ldc.upenn.edu/LDC2012S05
Stanislav, P., Švec, J., Ircing, P.: An engine for online video search in large archives of the holocaust testimonies. In: Interspeech 2016, pp. 2352–2353 (2016)
Google Scholar
Švec, J., Psutka, J., Trmal, J., Šmídl, L., Ircing, P., Sedmidubský, J.: On the use of grapheme models for searching in large spoken archives. In: ICASSP 2018, pp. 6259–6263 (2018). https://doi.org/10.1109/ICASSP.2018.8461774
Vaněk, J., Trmal, J., Psutka, J.V., Psutka, J.: Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors. IEEE Trans. Audio Speech Lang. Process. 20(6), 1818–1828 (2012). https://doi.org/10.1109/TASL.2012.2190928
Article Google Scholar
Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Interspeech 2013, pp. 2345–2349 (2013)
Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989). https://doi.org/10.1109/29.21701
Article Google Scholar
Wang, D., Wang, X., LV, S.: An overview of end-to-end automatic speech recognition. Symmetry 11(8) (2019). https://doi.org/10.3390/sym11081018
Young, S.: The HTK hidden Markov model toolkit: design and philosophy, vol. 2, pp. 2–44. Entropic Cambridge Research Laboratory, Ltd. (1994)
Google Scholar
Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: ICASSP 2014, pp. 215–219 (2014). https://doi.org/10.1109/ICASSP.2014.6853589

Download references

Acknowledgements

This paper was supported by the Technology Agency of the Czech Republic, project no. TN01000024.

Author information

Authors and Affiliations

Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
Josef V. Psutka
NTIS - New Technologies for the Information Society, UWB, Pilsen, Czech Republic
Josef V. Psutka, Aleš Pražák & Jan Vaněk

Authors

Josef V. Psutka
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Pražák
View author publications
You can also search for this author in PubMed Google Scholar
Jan Vaněk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josef V. Psutka .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Psutka, J.V., Pražák, A., Vaněk, J. (2021). Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_50
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A study on the challenges and opportunities of speech recognition for Bengali language

Improving Automatic Speech Recognition with Dialect-Specific Language Models

Turkish Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Recognition of Heavily Accented and Emotional Speech of English and Czech Holocaust Survivors Using Various DNN Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A study on the challenges and opportunities of speech recognition for Bengali language

Improving Automatic Speech Recognition with Dialect-Specific Language Models

Turkish Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation