Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition

Rathod, Siddharth; Charola, Monil; Patil, Hemant A.

doi:10.1007/978-3-031-48309-7_46

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

International Conference on Speech and Computer

685 Accesses
1 Citations

Abstract

Dysarthria is a motor speech disorder that affects an individual’s ability to articulate words, making speech recognition a challenging task. Automatic Speech Recognition (ASR) technologies have the potential to greatly benefit individuals with dysarthria by providing them with a means of communication through computing and portable digital devices. These technologies can serve as an interaction medium, enabling dysarthric patients to communicate with others and computers. In this paper, we propose a transfer learning approach using the Whisper model to develop a dysarthric ASR system. Whisper, Web-scale Supervised Pretraining for Speech Recognition, is a multi-tasking model trained on various speech-related tasks, such as speech transcription on various languages, speech translation, voice activity detection, language identification, etc. on a wide scale of 680,000 h of labeled audio data. Using the proposed Whisper-based approach, we have obtained an word recognition average accuracy of \(59.78\%\) using 155 words of UA-Speech Corpus, using the Bi-LSTM classifier model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Develo** a Speech Recognition Service for Korean Speakers with Dysarthria

Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

Article Open access 07 April 2023

Noise Robust Whisper Features for Dysarthric Severity-Level Classification

References

Agarap, A.F.: Deep learning using rectified linear units (ReLU). CoRR abs/1803.08375 (2018). http://arxiv.org/abs/1803.08375. Accessed 6 Feb 2023
Bock, S., Weiß, M.: A proof of local convergence for the ADAM optimizer. In: 2019 International Joint Conference on Neural Networks, IJCNN, Budapest, Hungary, pp. 1–8 (2019)
Google Scholar
Iwamoto, Y., Shinozaki, T.: Unsupervised spoken term discovery using Wav2Vec 2.0. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1082–1086 (2021)
Google Scholar
Kim, H., et al.: Dysarthric speech database for universal access research. In: INTERSPEECH, Brisbane, Australia, pp. 1741–1744 (2008)
Google Scholar
Lieberman, P.: Primate vocalizations and human linguistic ability. J. Acoust. Soc. Am. (JASA) 44(6), 1574–1584 (1968)
Article Google Scholar
Lin, Y.Y., et al.: A speech command control-based recognition system for dysarthric patients based on deep learning technology. Appl. Sci. 11(6), 2477 (2021)
Article Google Scholar
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. ar**v preprint ar**v:1511.08458 (2015). Accessed 25 Feb 2023
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. ar**v preprint ar**v:2212.04356 (2022). Accessed 6 Mar 2023
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Sehgal, S., Cunningham, S.: Model adaptation and adaptive training for the recognition of dysarthric speech. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, Dresden, Germany, pp. 65–71 (2015)
Google Scholar
Shahamiri, S.R.: Speech vision: an end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Trans. Neural Syst. Rehabil. Eng. 29, 852–861 (2021). https://doi.org/10.1109/TNSRE.2021.3076778
Article Google Scholar
Torrey, L., Shavlik, J.: Transfer learning. In: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 242–264. IGI Global (2010)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), vol. 30, Long Beach, USA (2017)
Google Scholar
Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in NIPS, vol. 31, Montreal, Canada (2018)
Google Scholar
Zhao, Y., Kuruvilla-Dugdale, M., Song, M.: Voice conversion for persons with amyotrophic lateral sclerosis. IEEE J. Biomed. Health Inform. 24(10), 2942–2949 (2019)
Article Google Scholar

Download references

Acknowledgments

The authors would like to express their sincere appreciation to the Ministry of Electronics and Information Technology (MeitY), New Delhi, Govt. of India, for the project ‘Speech Technologies in Indian Languages BHASHINI’, (Grant ID: 11(1)2022-HCC (TDIL)) for their support.

Author information

Authors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India
Siddharth Rathod, Monil Charola & Hemant A. Patil

Authors

Siddharth Rathod
View author publications
You can also search for this author in PubMed Google Scholar
Monil Charola
View author publications
You can also search for this author in PubMed Google Scholar
Hemant A. Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siddharth Rathod .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rathod, S., Charola, M., Patil, H.A. (2023). Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_46

Download citation

DOI: https://doi.org/10.1007/978-3-031-48309-7_46
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Develo** a Speech Recognition Service for Korean Speakers with Dysarthria

Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

Noise Robust Whisper Features for Dysarthric Severity-Level Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Develo** a Speech Recognition Service for Korean Speakers with Dysarthria

Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

Noise Robust Whisper Features for Dysarthric Severity-Level Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation