Abstract
This paper investigates the nuanced characteristics of the spectral envelope attributes due to vocal-tract resonance structure and fine-level excitation source features within short-term Fourier transform (STFT) magnitude spectra for the assessment of dysarthria. The single-channel convolutional neural network (CNN) employing time-frequency representations such as STFT spectrogram (STFT-SPEC) and Mel-spectrogram (MEL-SPEC) does not ensure capture of the source and system information simultaneously due to the filtering operation using a fixed-size filter. Building upon this observation, this study first explores the significance of convolution filter size in the context of the CNN-based automated dysarthric assessment system. An approach is then introduced to effectively capture resonance structure and fine-level features through a multi-channel CNN. In the proposed approach, the STFT-SPEC is decomposed using a one-level discrete wavelet transform (DWT) to separate the slow-varying spectral structure and fine-level features. The resulting decomposed coefficients in four directions are taken as the inputs to multi-channel CNN to capture the source and system features by employing different sizes of convolution filters. The experimental results conducted on the UA-speech corpus validate the efficacy of the proposed approach utilizing multi-channel CNN. The proposed approach demonstrates the notable enhancement in accuracy and F1 score (60.86% and 48.52%) compared to a single-channel CNN using STFT-SPEC (46.45% and 40.97%), MEL-SPEC (48.86% and 38.20%), and MEL-SPEC appended with delta and delta-delta coefficients (52.40% and 42.84%) for assessment of dysarthria in a speaker-independent and text-independent mode.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00034-024-02739-6/MediaObjects/34_2024_2739_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00034-024-02739-6/MediaObjects/34_2024_2739_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00034-024-02739-6/MediaObjects/34_2024_2739_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00034-024-02739-6/MediaObjects/34_2024_2739_Fig4_HTML.png)
Similar content being viewed by others
Data Availability
The speech database employed for experimental assessments in this paper can be accessed via the following link: http://www.isle.illinois.edu/sst/data/UASpeech/. The data can be obtained by sending a request email to Mark Hasegawa-Johnson, specifying your username and affiliated institution.
References
A. Aggarwal, Enhancement of GPS position accuracy using machine vision and deep learning techniques. J. Comput. Sci. 16(5), 651–659 (2020)
K. An, M. Kim, K. Teplansky, J. Green, T. Campbell, Y. Yunusova, D. Heitzman, J. Wang, Automatic early detection of amyotrophic lateral sclerosis from intelligible speech using convolutional neural networks. Proc. Interspeech 2018, 1913–1917 (2018)
M. Aqil, A. Jbari, A. Bourouhou, ECG signal denoising by discrete wavelet transform. Int. J. Online Eng. 13, 51 (2017)
K.K. Baker, L.O. Ramig, E.S. Luschei, M.E. Smith, Thyroarytenoid muscle activity associated with hypophonia in Parkinson’s disease and aging. Neurology 51(6), 1592–1598 (1998)
S.S. Barreto, K.Z. Ortiz, Speech intelligibility in dysarthrias: influence of utterance length. Folia Phoniatr. Logop. 72(3), 202–210 (2020)
A. Benba, A. Jilbab, A. Hammouch, Detecting patients with Parkinson’s disease using Mel frequency cepstral coefficients and support vector machines. Int. J. Electr. Eng. Inform. 7(2), 297–307 (2015)
J.C. Brown, Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)
H. Chandrashekar, V. Karjigi, N. Sreedevi, Spectro-temporal representation of speech for intelligibility assessment of dysarthria. IEEE J. Sel. Topics Signal Process. 14(2), 390–399 (2019)
H. Chandrashekar, V. Karjigi, N. Sreedevi, Investigation of different time-frequency representations for intelligibility assessment of dysarthric speech. IEEE Trans. Neural Syst. Rehabil. Eng. 28(12), 2880–2889 (2020)
G. Constantinescu, D. Theodoros, T. Russell, E. Ward, S. Wilson, R. Wootton, Assessing disordered speech and voice in Parkinson’s disease: a telerehabilitation application. Int. J. Lang. Commun. Disord. 45(6), 630–644 (2010)
J.R. Duffy. Motor speech disorders: Clues to neurologic diagnosis, in Parkinson’s disease and movement disorders: Diagnosis and treatment guidelines for the practicing physician, pp. 35–53 (2000)
T.H. Falk, W.Y. Chan, F. Shein, Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility. Speech Commun. 54(5), 622–631 (2012)
T.H. Falk, R. Hummel, W.Y. Chan. Quantifying perturbations in temporal dynamics for automated assessment of spastic dysarthric speech intelligibility, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4480–4483 (2011)
K. Gurugubelli, A.K. Vuppala, Analytic phase features for dysarthric speech detection and intelligibility assessment. Speech Commun. 121, 1–15 (2020)
A. Hernandez, S. Kim, M. Chung, Prosody-based measures for automatic severity assessment of dysarthric speech. Appl. Sci. 10(19), 6999 (2020)
S.A. Hicks, I. Strümke, V. Thambawita, M. Hammou, M.A. Riegler, P. Halvorsen, S. Parasa, On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 12(1), 1–9 (2022)
N.M. Joy, S. Umesh, Improving acoustic models in torgo dysarthric speech database. IEEE Trans. Neural Syst. Rehabil. Eng. 26(3), 637–645 (2018)
K.L. Kadi, S.A. Selouani, B. Boudraa, M. Boudraa. Automated diagnosis and assessment of dysarthric speech using relevant prosodic features, in Transactions on Engineering Technologies: Special Volume of the World Congress on Engineering 2013, (Springer, 2014) pp. 529–542
T. Kapoor, R. Sharma, Parkinson’s disease diagnosis using MEL-frequency cepstral coefficients and vector quantization. Int. J. Comput. Appl. 14(3), 43–46 (2011)
R.D. Kent, G. Weismer, J.F. Kent, H.K. Vorperian, J.R. Duffy, Acoustic studies of dysarthric speech: methods, progress, and potential. J. Commun. Disord. 32(3), 141–186 (1999)
H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T.S. Huang, K. Watkin, S. Frame. Dysarthric speech database for universal access research, in Proc. INTERSPEECH, pp. 1741–1744 (2008)
I. Kodrasi, Temporal envelope and fine structure cues for dysarthric speech detection using CNNS. IEEE Signal Process. Lett. 28, 1853–1857 (2021)
R. Kronland-Martinet, J. Morlet, A. Grossmann, Analysis of sound patterns through wavelet transforms. Int. J. Pattern Recognit Artif Intell. 1(02), 273–302 (1987)
A. Kumar, G. Pradhan, Detection of vowel onset and offset points using non-local similarity between DWT approximation coefficients. Electron. Lett. 54(11), 722–724 (2018)
R. Kumar, P.K. Singh, J. Yadav, Digital image watermarking technique based on adaptive median filter and hl sub-band of two-stage dwt. Int. J. Comput. Aided Eng. Technol. 18(4), 290–310 (2023)
X. Ma, D. Wang, D. Liu, J. Yang, DWT and CNN based multi-class motor imagery electroencephalographic signal recognition. J. Neural Eng. 17(1), 016, 073 (2020)
A. Maier, T. Haderlein, F. Stelzle, E. Nöth, E. Nkenke, F. Rosanowski, A. Schützenberger, M. Schuster, Automatic speech recognition systems for the evaluation of voice and speech disorders in head and neck cancer. EURASIP J. Audio Speech Music Process. 2010, 1–7 (2009)
D. Maini, A.K. Aggarwal, Camera position estimation using 2d image dataset. Int. J. Innov. Eng. Technol. 10, 199–203 (2018)
D. Martínez, E. Lleida, P. Green, H. Christensen, A. Ortega, A. Miguel, Intelligibility assessment and speech recognizer word accuracy rate prediction for dysarthric speakers in a factor analysis subspace. ACM Trans. Accessible Comput. 6(3), 1–21 (2015)
C. Middag, G. Van Nuffelen, J.P. Martens, M. De Bodt. Objective intelligibility assessment of pathological speakers, in 9th annual conference of the international speech communication association (interspeech 2008), (International Speech Communication Association (ISCA), 2008) pp. 1745–1748
J. Müller, G.K. Wenning, M. Verny, A. McKee, K.R. Chaudhuri, K. Jellinger, W. Poewe, I. Litvan, Progression of dysarthria and dysphagia in postmortem-confirmed parkinsonian disorders. Arch. Neurol. 58(2), 259–264 (2001)
N. Narendra, P. Alku, Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features. Comput. Speech Lang. 65, 1–14 (2021)
P.D. Polur, G.E. Miller, Experiments with fast fourier transform, linear predictive and cepstral coefficients in dysarthric speech recognition algorithms using hidden markov model. IEEE Trans. Neural Syst. Rehabil. Eng. 13(4), 558–561 (2005)
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
J. Ramirez, J.M. Górriz, J.C. Segura, Voice activity detection. Fundamentals and speech recognition system robustness. Robust Speech Recogn. Understand. 6(9), 1–22 (2007)
S. Ratsameewichai, N. Theera-Umpon, J. Vilasdechanon, S. Uatrongjit, K. Likit-Anurucks. Thai phoneme segmentation using dual-band energy contour, in ITC-CSCC, pp. 111–113 (2002)
P. Sahane, S. Pangaonkar, S. Khandekar. Dysarthric speech recognition using multi-taper mel frequency cepstrum coefficients, in 2021 International Conference on Computing, Communication and Green Engineering (CCGE), pp. 1–4. IEEE (2021)
L.P. Sahu, G. Pradhan, Analysis of short-time magnitude spectra for improving intelligibility assessment of dysarthric speech. Circ. Syst. Signal Process. 41, 5676–5698 (2022)
L.P. Sahu, G. Pradhan. Significance of filterbank structure for capturing dysarthric information through cepstral coefficients, in SPCOM 2022-IEEE International Conference on Signal Processing and Communications pp. 1–5 (2022)
R. Sandyk, Resolution of dysarthria in multiple sclerosis by treatment with weak electromagnetic fields. Int. J. Neurosci. 83(1–2), 81–92 (1995)
P. Singh, G. Pradhan, S. Shahnawazuddin, Denoising of ECG signal by non-local estimation of approximation coefficients in dwt. Biocybern. Biomed. Eng. 37(3), 599–610 (2017)
S. Skodda, W. Visser, U. Schlegel, Vowel articulation in Parkinson’s disease. J. Voice 25(4), 467–472 (2011)
R.S. Stanković, B.J. Falkowski, The HAAR wavelet transform: its status and achievements. Comput. Electr. Eng. 29(1), 25–44 (2003)
R. Thukral, A. Arora, A. Kumar. Gulshan: Denoising of thermal images using deep neural network, in Proceedings of International Conference on Recent Trends in Computing: ICRTC 2021, (Springer, 2022) pp. 827–833
G. Tzanetakis, G. Essl, P. Cook. Audio analysis using the discrete wavelet transform, in Proc. conf. in acoustics and music theory applications, vol. 66. (Citeseer, 2001)
B.J. Wilson Bronagh, Acoustic variability in dysarthria and computer speech recognition. Clin. Linguist. Phon. 14(4), 307–327 (2000)
Funding
We would like to declare that we have not received any financial support from any source for the completion of this work.
Author information
Authors and Affiliations
Contributions
All authors contributed equally
Corresponding author
Ethics declarations
Conflict of interest
We wish to declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ahmad, M.T., Pradhan, G. & Singh, J.P. Modeling Source and System Features Through Multi-channel Convolutional Neural Network for Improving Intelligibility Assessment of Dysarthric Speech. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02739-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00034-024-02739-6