Introduction

Atherosclerotic cardiovascular disease (ASCVD), defined as nonfatal acute myocardial infarction, coronary heart disease (CHD) death, and stroke, has become the leading cause of morbidity and mortality worldwide [1]. In China, cardiovascular disease (CVD) accounts for 38.9% of deaths among females and 35.5% of deaths among males [18]. Many studies have applied machine learning to the evaluation of various disease and with good results [19,20,21]. Although there has been a rapid expansion of ML being applied to cardiology [22,23,24], few direct comparisons have been made between ML and traditional ASCVD models [25, 26], none of these studies have included the Chinese population.

The present study aimed to establish ML-based risk prediction models from the dataset that integrates demographic, behavioral, psychological, Electrocardiograph and Echocardiography variables to predict ASCVD in a community-based general population in Northeast China. Meanwhile, we compared the performance of ML algorithms to traditional Cox regression models (PCE and China-PAR) to evaluate which method provided superior predictive performance.

Methods

Study population

The Northeast China Rural Cardiovascular Health Study (NCRCHS) is a multistage, stratified, random cluster sampling prospective population cohort of 11,956 participants aged ≥ 35 years, recruited between Jan 9, 2013, to Aug 23, 2013, from rural residents living in the Liaoning Province, China. Demographics, physical status and vitals, medical histories, echocardiography data, ECG exams, and laboratory data were collected. Consistent with the target population of contemporary risk prediction scores, participants were included in case of age between 35 and 85 years, no history of CVD. Of 11,956 participants assessed for eligibility, 10,349 (86.6%) participants completed at least one follow-up visit.

Clinical demographics

At baseline face-to-face interviews, detailed information included Clinical demographics (sex, age, marriage, education, nation, etc.) as well as lifestyle factors and (family) medical histories (heart disease, stroke, diabetes, hypertension, etc.) were collected using standardized questionnaires by trained staff (Supplementary file 1). Weight (the nearest 0.1 kg), height and waist circumference (the nearest 0.1 cm) of participants were measured. The body mass index (BMI) was calculated as weight in kilograms divided by height in meters squared. Blood pressure (BP) was measured using a standardized automatic electronic sphygmomanometer (HEM-907; Omron, Tokyo, Japan) after 5 min of rest for three times, and the mean values of systolic/diastolic blood pressure were calculated. The questionnaire was checked by trained staff at the end of each participant’s follow-up to ensure that the data collected were complete and accurate. The paper questionnaire was manually double-entered and subsequently saved. Blood samples from all participants were collected in the morning after > 12 h of overnight fasting.

Hypertension was defined as a mean systolic blood pressure > 140 mmHg and/or diastolic blood pressure > 90 mmHg or taking antihypertensive medications. Diabetes mellitus was defined by medical history and/or use of insulin or oral hypoglycemic agents. Participants were considered to be current smokers/drinkers if they had smoked/drank at any point in the 3 months prior to the date of the ECG examination.

Electrocardiograph and Echocardiography measurement

Standard 12-lead ECGs (MAC 5500, GE Healthcare, Little Chalfont, UK) were recorded in the resting supine position at baseline and were analyzed automatically with the MUSE Cardiology Information System, version 7.0.0 (GE Marquette™ 12SL™ ECG analysis program) [27]. Total of 645 parameters from the unprocessed digital ECG data were disposed by the GE system. 201 parameters (9 not lead-specific and 192[16*12] lead-specific) were temporally stored that including the relative coordinate points (the start point of the p-wave, etc.), and calculated values (QTc Framingham and QTc Fridercia, etc.) were excluded. The remaining 444 variables were used for analysis.

Transthoracic doppler echocardiography (Vivid; GE Healthcare, Connecticut, USA) constituted M-mode, two-dimensional, spectral, and color Doppler formats were operated by the sonographers with a 3.0-MHz transducer. Three professional doctors performed readings and analysis of the echocardiogram and had the option to consult two additional specialists if questions or uncertainties arose. Total of 9 echocardiographic parameters were used for analysis.

Outcome assessment

Primary endpoints include stroke and CHD. Health status, hospital admissions, outpatient diagnosis, and deaths of each participant were followed up from 2015 to 2018. Two physicians reviewed medical records independently, categorized the events and specified the event dates. Stroke was defined as a sudden onset of focal neurological dysfunction lasting 24 h or until death, or less than 24 h but with a clinically relevant brain lesion. CHD was defined to include any myocardial infarction (MI), resuscitated cardiac arrest, definite angina, probable angina followed by revascularization, and CHD death.

Machine learning

Feature selection

In this study, a total of 635 candidate variables were collected and 84 variables with a missing ratio greater than 10% were excluded. Missing values were imputed using the Multiple imputation method when missingness was < 10%. The filtered dataset included 551 variables: 98 demographics, behavioral and psychological variables (age, sex, BP, BMI, lifestyle, biochemical test, etc.), 444 ECG parameters and 9 Echo parameters. The detailed descriptions of features are available in Table S1. Feature selection was implemented using an approach known as ‘Recursive Feature Elimination’ (RFE) to reduce the feature dimension and find out the most discriminative information by selecting the most relevant variables and removing redundant variables. During the recursion process, an optimal subset of candidates is generated by eliminating the least important features from the complete feature set (Figure S1).

Feature importance

To determine the major predictors of ASCVD in our study population, the importance of each permutation feature was judged from the final model. Permutation feature importance weighs the importance of each feature by calculating the increase in the prediction error of the model after permuting its values. A feature is considered important if removing its values decreases the discriminative capability of the model, as the model depends significantly on that feature for prediction. A feature is immaterial if removing its values but the mean area under the receiver operating characteristic curve (AUC) remains the same, as the model ignores the feature for prediction in this case.

Model building and testing

The datasets were randomized into training (80%) and testing sets (20%). Model development included trials of several ML classifiers such as Artificial Neural Network (ANN), Random Forest (RF), Gradient Boosting Machine (GBM), K Nearest Neighbors (KNN), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), Categorical Boosting (CatBoost). Models were trained using optimal subset and evaluated with stratified 10-fold cross-validation on the training set and we used a grid search approach to determine the appropriate hyperparameters of each ML model [9, 28] (Table S2). To solve the class imbalance in the datasets, we assigned more weights to the minority class sample to increase the misclassification cost of minority class samples. We then evaluated the performance of PCE (White), China-PAR, Recalibrated PCE (White), Recalibrated China-PAR and ML-based risk prediction model in terms of discrimination, calibration, net benefit, and net reclassification improvement (NRI) (Table S3).

Statistical analysis

Categorical variables are presented as count (%), and continuous variables are reported as mean (± SD). Brier score and Matthews correlation coefficient (MCC) were used to assessing the overall performance [29, 30]. The calibration of the models was tested with Hosmer–Lemeshow χ2 statistic [31]. Pairwise comparisons were performed between all predictive models using the DeLong test [32]. Decision-curve analysis (DCA) was used to quantify the net benefit of each risk prediction model [33]. Statistical significance was defined as two-tailed P < 0.05. All analyses were performed with R version 4.1.2.

Results

Study population

A total of 9,609 participants (mean [SD] age: 53.4 [10.4] years; male [46.3%]) with digital ECG and Echo free of CVDs at baseline were included in the final cohort (Fig. 1). During a median of 4.7 (IQR, 4.4–4.9) years, 431 (4.5%) participants developed ASCVD (Table 1). 7,688 participants were included in the training cohort and 1,921 in the test cohort.

Fig. 1
figure 1

Flow chart of inclusion of participants for final analyses

Table 1 Baseline clinical characteristics

Performance of traditional ASCVD prediction models

Figure 2 depicted the discrimination and calibration of the PCE (White) and China-PAR models before and after recalibration. All models showed moderate discrimination, and the highest discrimination was showed in the China-PAR model with an AUC of 0.780. However, all models exhibited poor calibration with a Hosmer–Lemeshow χ2 value greater than 18 (p < 0.05). The Brier score was between 0.043 and 0.057 and MCC was between 0.186 and 0.194 (Table 2).

Fig. 2
figure 2

Discrimination and Calibration of Contemporary Prediction Models in Our Cohort. Discrimination and calibration of contemporary prediction models in each cohort. A Receiver operating characteristic curve (AUC) analysis for contemporary prediction models. B Hosmer–Lemeshow calibration plots of contemporary prediction models. Abbreviations: ReChina-PAR, Recalibrated China-PAR; RePCE, Recalibrated PCE

Table 2 Performance of risk prediction models in the test cohort

Compared of ML-based ASCVD and traditional models

Through stepwise model building and RFE algorithm, the final ML-based ASCVD models were reduced to 30 key predictor variables (Table 3). Figure 3 depicted the comparison of discrimination and calibration between the established ML classifiers. As shown, the ANN algorithm outperformed other classifiers and had the greatest AUC value and consistency. The AUC value of the ANN model was 0.800, which was higher than that in the China-PAR and PCE model (p = 0.12, 0.08). Calibration of the ANN model showed a significant improvement, the Hosmer–Lemeshow χ2 value was 9.1 (p = 0.33), and the Brier score and MCC of the ANN model were respectively 0.041 and 0.216, indicating a superior overall performance of the ANN model than traditional regression models. Decision Curve Analysis (DCA) demonstrated that ANN model provided a greater net benefit within a range of thresholds (Fig. 4). When the threshold was 5%, the ANN model had the greatest net benefit value of 0.017 among all models (Table 2). We also assess the NRI when using ML models compared to the traditional models, ANN model correctly classified more events and more non-events than China-PAR, PCE, Recalibrated China-PAR, and Recalibrated PCE (NRI: 0.355, 0.089, 0.088, 0.098, all p < 0.05) (Table 4).

Table 3 Predictor variables in ASCVD models
Fig. 3
figure 3

Discrimination and Calibration of Machine Learning-based ASCVD Models in Test Cohort. Discrimination and calibration of machine learning-based ASCVD prediction models in test cohort. A Receiver operating characteristic curve (AUC) analysis for machine learning-based ASCVD prediction models. B Hosmer–Lemeshow calibration plots of machine learning-based ASCVD prediction models. Abbreviations: ASCVD, Atherosclerotic cardiovascular disease; ANN, Artificial Neural Network; RF, Random Forest; GBM, Gradient Boosting Machine; KNN, K Nearest Neighbor; Adaboost, Adaptive Boosting; SVM, Support Vector Machine; Catboost, Categorical Boosting

Fig. 4
figure 4

Decision Curves for PCE, China-PAR and Machine Learning-based Models. Abbreviations: PCE, Pooled Cohort Equations; ReChina-PAR, Recalibrated China-PAR; ANN, Artificial Neural Network; RF, Random Forest; GBM, Gradient Boosting Machine; KNN, K Nearest Neighbor; Adaboost, Adaptive Boosting; SVM, Support Vector Machine; Catboost, Categorical Boosting

Table 4 Net reclassification improvement (NRI) in the test set

Variable importance

The leading predictors of the ANN ASCVD model are shown in Fig. 5 and a complete table of feature importance is available in Table 3. Age, SBP, R Area in V2, Max R Amplitude, and I.T Area (Full) in V2 were the most significant features to predict ASCVD.

Fig. 5
figure 5

Radar Plot for the Ten Most Important Predictors of ASCVD. As the values of feature importances were spread over a wide range (more orders of magnitude), base-10 logarithmic transformation was performed to facilitate plotting

Discussion

In a community-based general population which included 11,956 adults from Northeast China, we developed seven ML models based on different algorithms. After extensive evaluation, the ANN model was chosen as the best model. The ANN model includes 30 predictors which can accurately and efficiently predict 5-year ASCVD in individuals with no history of CVD. Compared to the traditional regression models (China-PAR and PCE), the ANN model showed higher discrimination, better calibration, net benefit, and improved NRI in predicting ASCVD. Besides, our study provided a ranking of candidate variables by their significance in predicting ASCVD, which may help ASCVD risk stratification and management.

Early detection of high-risk individuals is the most effective approach to reduce the escalating incidence of ASCVD across multiple countries, while significant improvements of available prediction models were absent [34]. The PCE integrates several cardiovascular risk predictors to assess an individual’s 10-year risk of ASCVD and to guide treatment decisions. Disparities on distribution of cardiovascular risk factors existed between Asian and Western populations [35], clinical decisions may be influenced by over- or underestimation of risk. PCE provided moderate discrimination in the Korean KHS cohort, absolute 10-year ASCVD risk was overestimated by 56.5% for men and underestimated by 27.9% for women [8]. In the CHERRY Study in Southeastern China, PCE overestimated the risk by 63% in men and underestimated the risk by 34% in women [9]. In the “stroke belt” of Northern China, Fangshan Cohort Study found PCE showed an underestimate of 76.2% for men and 88.2% for women with poor calibration [10].

The China-PAR model was derived from multiple contemporary Chinese cohorts, and external validation studies of the model are limited. In the CHERRY study, the China-PAR model underestimated by 20% in men and 40% in women [9]. In the Fangshan Cohort, the China-PAR model overestimated by 29.4% risk in women [10]. When PCE and China-PAR models were applied in our cohort, we found that PCE overestimated by 63.8% in men and inversely underestimated the risk by 10.3% in women. Meanwhile, the China-PAR model underestimated the risk by 55.5% in men and 52.4% in women. After recalibration, the recalibrated PCE overestimated the risk by 144.8% in men and 129.9% in women, and the recalibrated China-PAR model inversely overestimated the risk by 98.3% in men and 64.7% in women. All PCE and China-PAR models had poor calibration, despite good discrimination. A potential reason for the differences in diverse Chinese populations is the regional disparity [36]. Residents in Northeast China tend to have a diet with high sodium and fat, which leads to the high prevalence of ASCVD up to 12.6% in this area [37].

These contemporary ASCVD risk calculators are parsimonious models based on a limited number of clinical risk variables, the potential influence of intricate and hidden interactions between weaker predictors may be overlooked. With the extension of artificial intelligence, ML algorithms have emerged as highly effective methods for resolving medical prediction puzzles in large-scale datasets and to allow guideline-directed management based on risk assessment [38]. Same as previous studies [25, 39,40,41], when compared with conventional models, our ANN-based ASCVD prediction model exhibited improved prediction performance (Table S4). The AUC of the ANN model was + 0.023, + 0.02, + 0.021, + 0.02 compared to that of PCE (White), China-PAR, Recalibrated PCE, and Recalibrated China-PAR (p = 0.08, 0.12, 0.12, 0.12), while calibration was significantly better (HL χ2 = 9.1 vs. 37.3, 67.6, 126.6, 18.6). In addition, DCA that accounts for the influence of false-negative (undertriage) and false-positive (overtriage), suggests an increased net clinical benefit with use of the ANN model as compared to traditional models if the ideal risk threshold for medical consultation lies between 5 and 10%. NRI also highlight the ability of ANN model to augment ASCVD prediction and provide a better risk stratification strategy for patients.

The observed incremental gains compared to traditional methods demonstrate the potential value of machine learning algorithms. The major advantage of ML algorithms over linear models is their capacity to capture the complex underlying interactions of myriad features and improve ex-sample predictions [42]. Althought the ML algorithms are inherently complex and difficult to interpret, ML models remain attractive due to their more accurate predictive power and the capacity to assimilate and evaluate large amounts of complex healthcare data.

Limitations

Our study has several limitations. First, the ML algorithm cannot assess the independent effects of each variable on events and may be difficult to identify specific treatments to reduce individual risk. However, as the volume of data increases, ML algorithms allow for more in-depth prospective studies to identify the causative factors and interaction mechanisms. Second, longer follow-up is needed considering the chronic, progressive course of ASCVD. Third, we excluded 84 variables with > 10% missing data that might have predictive value, and the imputation of missing data might bias the analysis. However, the imputation by Multiple Imputation (MI) is known to be a precise method for imputation. Fourth, external validation studies are required to demonstrate the accuracy of the model’s predictions in diverse populations. Ultimately, the models were established using the initial follow-up blood pressure and glucose, but it may have changed during the follow-up period and this was not taken into account in the model.

Conclusion

In this contemporary cohort of Northeast China, we observed that the PCE and China-PAR models provided adequate discrimination but poor calibration in predicting 5-year ASCVD risk. However, the ANN-based model incorporating 30 clinical variables outperformed PCE and China-PAR, even after recalibration. Our study highlights the limitations of traditional risk prediction models for ASCVD and demonstrates the potential of machine learning algorithms in improving risk prediction accuracy. Further studies are required to demonstrate the benefits of ML algorithms and to enhance the clinician's triage decision making.