Key points

Question: This study compares the different machine learning methods for predicting several neurodegenerative syndromes.

Findings: The comparison of support vector machine, random forest, gradient boosting, and deep feed-forward neural networks yielded the neural networks to be the best for the classification of different neurodegenerative syndromes based on pre-structured volume measures.

Meaning: Even with pre-structured data, deep neural networks are most promising.

Introduction

In light of the demographic shift and the pending shortage of resources in healthcare systems across the globe, computer-aided methods are to shoulder some of the challenges. Supportive technology will find its way into the clinic to assist physicians in finding the correct diagnosis [1]. The implementation of artificial intelligence into clinical routine is happening already, and it is a matter of time until medical decisions will rely on algorithms in conjunction with the experience of physicians.

In case of neurodegenerative syndromes, brain imaging can render important MRI-morphological biomarkers in the form of atrophy patterns. While some focal atrophy patterns are quite disease-specific [2,3,4,5,6,7,8] leading even to incorporation into diagnostic criteria [9,10,11], neuroimaging findings for other diseases might be less conclusive [12]. However, it requires highly trained and specialized neuroradiologists to correctly detect and interpret the signs—an expertise that is not available ubiquitously.

For analyzing the complex multivariate and nonlinear relationships in high dimensional data derived from MRI data, machine learning algorithms are superior to standard inferential statistics [13, 14]. For the classification of neurological and psychiatric diseases, support vector machines (SVM) based on imaging-derived data have been the most popular method [14]. SVMs have proven to be a suitable approach at least in binary differentiations of patients from healthy controls [13,14,15]. A few studies further used SVM to differentiate disease entities from each other—a more complex approach that simulates the process of differential diagnosis. In a previous study, we assessed the performance of SVM to differentiate two dementia syndromes from each other [16, 17]. In another study, SVM was used to classify various parkinsonian syndromes based on the results of volumetric MRI analysis [18]. While SVM produced satisfactory results, other methodological approaches were not assessed further.

In recent years, deep learning methods have become more and more popular for pattern recognition tasks such as the classification of image and text data, but also of structured data [19]. Deep learning methods process data on several levels. In this way, more and more abstract representations are generated up to the class as the most abstract form of representation [19]. Deep neural networks (DNNs) in particular have proven to be highly proficient in predicting diagnoses based on imaging data of the eye, skin, or lung [20,21,22] and will most likely become a key component of imaging diagnostics in the future. Hopefully, these advanced models will be able to capture more complex atrophy patterns in the human brain than SVM approaches and might assist radiologists with their assessment in the future.

Accordingly, we will compare these models for the classification of neurodegenerative syndromes based on atlas-based volume measures in a very large dataset including numerous diseases in this work. Besides DNN and SVM, we will apply two ensemble learning methods (i.e., random forest (RF) and gradient boosting (GB)) that have been thriving as proficient models in many classification challenges dealing with similar data [14, 23]. The preprocessing of the data into structured data via volumetry in the form of an atlas is useful for clinical purposes, because it normalizes data, reduces thereby inter-center variability, guarantees a complete anonymization of the data, and decreases computing time when training the models.

The syndromes considered in this study all belong to the neurodegenerative disease spectrum ranging from Alzheimer’s disease (AD), frontotemporal lobar degeneration with its subtypes behavioral variant frontotemporal dementia (bvFTD), and primary progressive aphasias (PPA) with the three subforms—semantic variant (svPPA), nonfluent-agrammatic variant (nfvPPA), and logopenic variant (lvPPA)—to atypical Parkinson syndromes such as corticobasal syndrome (CBS), progressive supranuclear palsy (PSP), and multiple system atrophy with cerebellar features (MSA-C), as well as MSA with predominant parkinsonism (MSA-P) and idiopathic Parkinson’s disease (PD).

This use case is exemplary for imaging-derived structural data and can be transferred to other use cases of the biomedical sciences. By including ten different neurodegenerative diseases beside a control cohort, our approach mirrors best the work of radiologists in clinical routine, i.e., firstly categorizing a brain scan as normal or abnormal and secondly defining the neurodegenerative entity in the differential diagnostic process. We hypothesize (i) that neurodegenerative diseases can be classified with reasonable accuracy from structural brain imaging data, in particular, if they are characterized by specific atrophy patterns, and (ii) that DNNs perform better than SVM.

Methods

Subjects and demographic characteristics

The study included multi-centric data from 940 subjects, i.e., 124 healthy controls and 816 patients from the German Research Consortium of Frontotemporal Lobar Degeneration (www.ftld.de) [24] and from the German Atypical Parkinson Consortium Study Group [18, 25]. The patient cohort consisted of 72 patients with AD, 146 patients with bvFTD, 26 patients with CBS, 30 patients with lvPPA, 21 patients with MSA-C, 60 patients with MSA-P, 58 patients with nfvPPA, 203 patients with PD, 154 patients with PSP, and 46 patients with svPPA.

Figure 1 and Table 1 provide an overview of the age and gender distribution of the study cohort. Age distribution was compared with the Kruskal-Wallis test and post hoc with a Wilcoxon rank-sum test between all pairs of samples (Bonferroni-corrected). Patients with AD were significantly older than patients with bvFTD (p < 0.05). Patients with PSP were significantly older than healthy controls (p < 0.001) and patients with MSA-C (p < 0.05), MSA-P (p < 0.001), PD (p < 0.05), bvFTD (p < 0.001), and svPPA (p < 0.001). Furthermore, patients with bvFTD were significantly younger than patients with nfvPPA (p < 0.001). Also, patients with svPPA were significantly younger than patients with nfvPPA (p < 0.05).

Fig. 1
figure 1

Violin plot of the age and gender distribution of the cohort sample. The dashed line indicates the mean, and the dotted line indicates the standard deviation. AD, Alzheimer’s disease; bvFTD, behavioral variant frontotemporal dementia; CBS, corticobasal syndrome; lvPPA, logopenic variant primary progressive aphasia; MSA-C, multiple system atrophy (cerebellar dysfunction subtype); MSA-P, multiple system atrophy (parkinsonian subtype); nfvPPA, nonfluent variant primary progressive aphasia; PD, Parkinson’s disease; PSP, progressive supranuclear palsy; svPPA, semantic variant primary progressive aphasia

Table 1 Demographic characteristics for patients and healthy controls

Gender distribution was tested pairwise with the Fisher test (Bonferroni-corrected) post hoc if the chi-square test indicated significant differences (chi-square = 38.855, p < 0.001). The gender distribution significantly differed between patients with bvFTD and PD (p < 0.001), MSA-P (p < 0.05), and PSP (p < 0.05). Furthermore, there was a significant difference in gender distribution between patients with PD and svPPA (p < 0.05).

The study was conducted according to the Declaration of Helsinki. It was approved by the local ethics committees of all participating centers. Patients, participants, caregivers, or legal representatives gave written informed consent for the study.

Imaging acquisition and analysis

Standardized structural MRI head scans were acquired multi-centrically at German university hospitals. Every subject obtained a T1-weighted three-dimensional (1-mm isovoxel resolution) magnetization-prepared rapid gradient echo (MPRAGE) head MRI brain scan [18, 24, 26]. The MPRAGE sequence was converted to ANALYZE 7.5 format, and the file names were pseudonymized before further processing. Whereas standardized operating procedures (SOPs) have been applied throughout the data acquisition including MRI in the German Research Consortium of Frontotemporal Lobar Degeneration, no sequence adjustment or homogenization between the centers was done in the German Atypical Parkinson Consortium Study. Instead, the MPRAGE sequence from the clinical routine at each center was used (for further information on MRI parameters, see papers and supplemental materials [4, 17, 18, 25, 27]). Atlas-based volumetric analysis of the MPRAGE sequence data was done using the LONI Probabilistic Brain Atlas (LPBA40) [28], and further masks were derived from this atlas. The atlas structures were used as an input vector for the model and represent the volume measures of the input data. A detailed description of all image processing steps and the 63 atlas structures included can be found in [18]. Before being used as predictive features, all volume results were corrected for intracranial volume (ICV).

Training and evaluation of classifiers

In order to reduce the bias of the existing sampling distribution, we used a 5-fold cross-validation with the full dataset (models were trained on 80% of the data (4-folds), 20% served for testing (1-fold)). The folds were selected randomly, and the experiments were repeated ten times. Thus, we trained and evaluated 50 models of each type (see Fig. 2). In each training iteration, we optimized the learning and hyperparameters of the RF model, the GB, and the SVM using Bayesian optimization. In contrast to a grid search or a random search, Bayesian optimization allows a sequential search and thus includes every previous search step. This leads to better optimization results [31]. For the weight update, we used Adam [Full size image

Model-wise performance measures can be found in Table 1. Among the models evaluated in this study, the DNN rendered the best classification results producing a Cohen’s kappa score slightly larger than 0.4 as well as a total model accuracy of approximately 0.5. The second-best performance was obtained with SVM, followed by GB and RF. Furthermore, the variability over 50 permutations was lowest for DNNs, which is reflected by the lowest standard deviation. This indicates that DNN models have the highest reliability of the models across different simulations.

Modelwise performance measures are shown separately for each of the classes, i.e., diseases, in Table 2. Whereas some diseases such as PSP, svPPA, MSA-P, bvFTD, and PD reached relatively high classification performance, other classes reached middle values, i.e., healthy controls and AD, and others relatively low performance such as lvPPA, MSA-C, and nfvPPA. Of note, CBS was characterized by the lowest performance results. The order of the modelwise performance quality across the whole cohort (DNN > SVM > GB > RF) was also observed for AD and bvFTD, whereas the other classes showed a more complex picture.

Table 2 Metrics for model comparison

Importance of brain regions

The LIME method allowed us to assess the contribution of each brain region for classifying each syndrome within a model. An entire listing with the weighting of all brain regions for all models is publically available in the project repository. In the interest of greater clarity, we display the five most important brain regions for all models for three selected pathologies with well-known atrophy patterns (i.e., AD, PSP, and svPPA; see Table 3). Note that the weighting of brain regions was averaged over all patients that were classified correctly by the respective model. All models independently identified the key regions, such as the midbrain for PSP, the inferior temporal gyrus on the left side for svPPA, and the hippocampus for AD.

Table 3 Class-wise performance metrics for multi-syndrome classification