Background

With the development of information technology, medical data is becoming huge. Many researchers analyze electronic medical records data to provide reference for medical diagnosis, treatment, and prognosis. And machine learning methods have been widely used in medical field. However, medical data may suffer from outliers and class imbalance, which could affect the performance of machine learning models [1, 2]. Therefore, it is necessary to effectively process outliers and imbalanced data in modeling to improve the accuracy of model prediction.

Outlier detection is the process of finding observations that are far from most of the observations. Many studies have shown that removing outliers will improve classification accuracy [3,4,5,6,7]. Podgorelec et al. and Li et al. used outlier detection techniques to remove the detected outliers from training set, and improved the classification accuracy of machine learning methods [5, 7]. There is a lot of outlier detection techniques, and there is no consensus on which method should be used. The cross-validated committees filter (CVCF) [8] is an ensemble filter based on majority voting. CVCF has no complicated parameter settings, and does not need to set threshold for dividing outliers and inliers [8, 9]. Therefore, this study adopts CVCF as an example for outlier detection and removal in modeling.

The performance of machine learning can be affected by class imbalance [1]. In general, the performance of classifier decreases with the increase of imbalanced ratio (IR, the ratio of majority class and minority class). However, IR is not the only factor affecting the performance of classifiers. Class overlap** is also responsible for the decrease in performance of classifiers [10]. Although the IR is not very high, the performance of the classifier can significantly decrease when the classes are highly overlapped. A hybrid resampling method called synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN) [11] was proposed not only to balance the training set but also to remove noisy examples lying on the wrong side of the decision border, which might be caused by SMOTE [11]. And, some studies also showed that the model performance after hybrid resampling was better than that of single resampling [11, 12]. Therefore, several commonly used resampling methods, such as random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN) [13], Borderline SMOTE [14], and SMOTEENN, are used to balance the training set.

Machine learning methods can discover non-linear relationships and explore deeper information in data, and they have great potential for prediction. Although machine learning methods are widely used, the performance of machine learning methods will vary from one data to another, and no one method can always perform well for all data. For example, in the field of intracerebral hemorrhage (ICH) mortality and prognosis prediction. Guo et al. used logistic regression (LR), random forest (RF), support vector machine (SVM), and other methods to predict 90-day functional outcome of patients with ICH, and LR had the highest AUC of 0.89 [15]. Bacchi et al. used four methods, including LR, RF, decision trees (DT), and artificial neural network (ANN), to predict in-hospital mortality of patients with stroke, and LR performed the best with an AUC of 0.90 [16]. Nie et al. used nearest neighbors, DT, ANN, AdaBoost, RF to predict in-hospital mortality of patients with cerebral hemorrhage in intensive care units, and RF had the highest AUC of 0.819 [17]. The other four studies also achieved good performance (high AUC) using RF [18,19,20,21]. Lim et al. used SVM to predict 30-day mortality and 90-day poor functional outcome of ICH patients with good AUC performance of 0.9 and 0.883, respectively [22].

Stacking ensemble learning [23] which combines different single classifiers usually performs better than a single classifier [24]. And, it has been increasingly used in medicine in recent years and achieved good performance, for example, predicting the prognosis of patients with glioma [25], predicting adult outcomes in childhood-onset ADHD [26], predicting the recurrence of colorectal cancer [27]. Therefore, we use stacking ensemble learning to combine different machine learning methods which were applied in the prognosis and mortality prediction of patients with ICH.

In this study, we propose a joint modeling strategy to provide reference for physicians and researchers who want to build their own models. It consists of outlier detection and removal, data balancing, model fitting and prediction, performance evaluation.

Materials and methods

Data sources

This is a retrospective study, and the data was extracted from the database of Comprehensive Data Collection and Decision Support System for health statistics in Sichuan Province (CDCDS). This database was built by the Sichuan government on January 1, 2017 and covers all ICH admissions in the province. It includes the information of medical records from all general hospitals and community hospitals in Sichuan. We collected medical records information for all ICH patients with admissions in 2017–2019. Patients were identified by International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM). The patients with nontraumatic intracerebral hemorrhage (I61) were considered in the study.

Medical record information includes clinical and radiological information of the patient at the time of hospitalization. Clinical variables included age, gender, Glasgow Coma Scale (GCS) score at admission, the presence of chronic comorbidity (hypertension and diabetes), treatment (surgery or not), and infection or not. GCS score at admission was estimated and determined by physicians. Hypertension and diabetes are either diagnosed by doctors or self-reported by patients. Treatment refers to whether or not patients had surgery while in the hospital. Infection refers to whether patients developed infection after surgery.

Radiological variables were determined by clinicians using head computed tomography (CT) scans, including ICH location (supratentorial superficial, supratentorial deep, cerebellar, brain stem, intraventricular hemorrhage (IVH)), hematoma volume (measured by the ABC/2 method). ICH location and hematoma volume were estimated and determined by physicians. These variables were regularly collected during hospitalization of patients with ICH.

The outcome of this study was whether patients died within 90 days after discharge. The 90-day mortality was from Ministry of Civil Affairs through unique personal identification numbers.

Variable selection

We divided age into five categories (40–54, 55–64, 65–74, 75–84, ≥ 85 years). According to clinical criteria, GCS score at admission was divided into three categories (13–15, 9–12, 3–8), indicating mild coma, moderate coma, and severe coma respectively.

In this study, the data has only 10 independent variables, which are not high-dimensional data, so univariate analysis was used to select variables. Because the data are all categorical variables, the chi-square test or Fisher exact test was used to select variables.

The results of univariate analysis showed that age and diabetes have no statistical significance. Considering that the P value of age was close to 0.05 and age was an important factor for ICH, the age variable was used for modeling in this study. Therefore, in addition to diabetes, 9 predictors were used for modeling, including age, gender, GCS score at admission, hypertension, surgery, infection, ICH location, supratentorial hemorrhage volume, and infratentorial hemorrhage volume.

Joint modeling strategy

Physicians can use information of patients with ICH at the time of hospitalization to predict 90-day mortality after discharge. After ICH patients are admitted to the hospital and treated (after relevant variables were collected), the physicians could give advice to patients (whether to continue treatment or not) based on clinical experience and a prediction of model. However, for physicians and researchers, there are many factors that need to be considered in modeling, such as outliers, imbalanced data, model selection, and parameter tuning. This study shows the use of different methods for handling outliers, imbalanced data and model selection. This joint modeling strategy includes the following steps: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. To emphasize the importance of outlier removal and data balancing processing, we compared the model performance with and without the corresponding processing. The flow chart is shown in Fig. 1.

Fig. 1
figure 1

The joint modeling strategy flowchart

We used 10-fold cross-validation (CV) to estimate the results of models, and the IR of each fold remained the same. The final results were the average of the results of 10 test sets. The 95% confidence interval (95% CI) of the results were estimated from the results of the 10 test sets.

Step 1 outlier detection and removal

In this study, we used CVCF to detect and remove outliers. The R 4.0.2 and “NoiseFiltersR” library were used to implement the CVCF. The parameters of CVCF were set to the default settings in R. We removed outliers detected by CVCF from the training set before further analysis.

In this study, missing values were not processed because there were no missing values.

Step 2 data balancing

Although the IR of this study is not very high, we still want to provide physicians with reference for imbalanced data processing methods.

Five resampling methods, including random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline SMOTE, SMOTEENN, were used to balance the training set according to outcome variable. The python 3.8.3 and scikit-learn library were used to implement resampling methods. The parameters of resampling methods were set to the default settings in python.

Step 3 model fitting and prediction

Stacking ensemble learning was used to combine different machine learning methods which were applied in the prediction of patients with ICH.

It consists of a two-stage modeling process. In the first stage, different methods (base classifiers) are built on the training set. In the second stage, the meta classifier is trained with the results of the base classifiers as input and the true labels of training set as output. In this study, logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN), which were commonly used, were used as the base classifiers. There is no general criterion for the selection of the meta classifier. Therefore, LR, the classical method, was chosen as the meta classifier. The stacking model is shown in Fig. 2.

Fig. 2
figure 2

Stacking model

Ensemble learning generally includes bagging, boosting, and stacking. Therefore, we also compared three ensemble learning methods. For bagging, random forest (RF) was chosen because it is commonly used and robust [28]. For boosting, we chose the most famous and classic methods called AdaBoost [29]. All combinations of the joint modeling strategy are shown in Table 1. The optimal parameters for each model were selected by grid search using 5-fold cross validation, and the parameter settings were shown in Table 2.

Table 1 All combinations of joint modeling strategy
Table 2 The parameter settings

Step 4 performance evaluation

We used the confusion matrix for the performance evaluations [30]. Confusion matrix represents counts from predicted and actual values. In this study, six indicators were selected to evaluate model performance, namely accuracy, sensitivity (recall), specificity, precision (Positive Predictive Value, PPV), F1 score, the area under the receiver operating characteristics curve (AUC). We chose 0.5 as the threshold to obtain all these metrics. A larger value for all these six indicators indicates better model performance.

All analyses were performed using R 4.0.2 and Python 3.8.3.

Results

Descriptive analysis and variable selection

A total of 4207 patients with ICH were considered in this study. The baseline characteristics for all patients are presented in Table 3. Among 4207 patients, 2909 (69.15%) survived 90 days after discharge and 1298 (30.85%) died within 90 days after discharge. In the univariate analyses, age group and diabetes were not statistically significant. Considering that 99.76% of the patients in this study did not have diabetes, and diabetes was not statistically significant, diabetes was not included in the prediction models in this study.

Table 3 Patient baseline characteristics

Comparison of training set with and without CVCF

Figure 3 shows the average performance of LR, RF, ANN, SVM, KNN, Stacking and AdaBoost on training set with and without CVCF. As can be seen from the figure, with CVCF, the accuracy, specificity, and precision of all models were improved, but the sensitivity was the opposite. The AUC of training set with CVCF were better than that of training set without CVCF, except for stacking model. Similarly, the F1 score of all models except LR improved with CVCF. Overall, removing the detected outliers from training set could improve the performance of some machine learning models.

Fig. 3
figure 3

The average performance of 7 models on training set with and without CVCF

Comparison of training set with and without resampling

We calculated the performance of 7 models under each resampling method and ranked from largest to smallest. The smaller the rank is, the better the resampling method performs under the data of this study. Table 4 shows the average performance of 7 models under each resampling method. Table 5 shows the rank of the average performance of each resampling method.

As illustrated in Tables 4 and 5, the accuracy, specificity, and precision of the training set without resampling were better than that of the training set with resampling, but the sensitivity was the opposite. Among the five resampling methods, SMOTEENN showed the greatest increase in sensitivity. The resampling methods can improve the sensitivity of models, but at the cost of reducing the specificity. For AUC and F1 score, different models performed differently under different resampling methods. Combining the performance of each model, the AUC of training set with ROS was the highest. The F1 score of training set with RUS was the highest, followed by ROS. Taking all indicators into account, training set with RUS performed the best, followed by training set with ROS and training set without resampling.

Table 4 The average performance of 7 models under each resampling method
Table 5 The rank of the average performance of each resampling method

Comparison of 7 models

Table 4 shows the performance of each model under different resampling methods. Table 6 shows the rank of the average performance of each model.

As illustrated in Tables 4 and 6, different models performed differently on different resampling methods. The average accuracy, specificity, AUC and precision of RF were the highest, indicating that RF performed best in distinguishing between patient survival and death. Stacking had good performance in the two indicators of F1 score (ranked 1st), and sensitivity (ranked 2nd). Taking all indicators into account, RF performed best, followed by ANN, AdaBoost and stacking. Compared with LR, SVM, KNN, the performances of ensemble learning were better. For physicians who do not know what model to choose, ensemble learning may be a good choice.

Table 6 The rank of the average performance of each model

Comparison of all 84 combinations of the joint modeling strategy

Table 7 shows the performance of all 84 combinations of joint modeling strategy. The performance with 95% CI of all 84 combinations of joint modeling strategy is shown in Additional file 1. There were eight combinations that performed the best in terms of accuracy (0.816), namely AdaBoost, CVCF + ANN, CVCF + SVM, CVCF + Stacking, CVCF + RUS + Stacking, CVCF + BSMOTE + SVM, CVCF + SMOTEENN + SVM, and CVCF + SMOTEENN + AdaBoost. For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). For AUC, the best performance was Stacking (0.756). For precision, the best performance was CVCF + SVM (0.938). For F1 score, the best performance was AdaBoost (0.602). Taken together, the joint modeling strategy of CVCF and ensemble learning performed better.

Table 7 The performance of all 84 combinations of joint modeling strategy

Discussion

Taking ICH as an example, this study presented a joint modeling strategy considering outliers, imbalanced data, model selection, parameter tuning, in order to provide a reference for physicians and researchers interested in constructing similar models. The results of this study show that it is necessary to adopt a joint modeling strategy that considers multiple processing and modeling methods, which can improve the performance of models.

The results of this study illustrate that removing the detected outliers from training set could improve the performance of models. Patients of ICH may get worse or even die after discharge for competitive risks, such as recurrence of ICH, thrombus dislodgement, infection. We did not collect information about these competitive risks and therefore there was no way to predict them. Those deaths that were unpredictable with the information we collected were removed from the training set by CVCF, but kept in the test set, as similar situations may still occur in future datasets. Therefore, this may be the reason why the sensitivity of the model of the training set with CVCF decreased compared to the model of the training set without CVCF. In addition, iForest [31] is also a good choice for outlier detection, but requires multiple attempts to select optimal parameters. There were only ten variables in this study, so variable selection was relatively simple. In case of more variables, more complex methods can be considered, such as Least Absolute Shrinkage and Selection Operator (LASSO) [32].

In terms of data balancing processing, due to the low IR in this study, all resampling methods did not improve the model performance compared to no resampling. But our study also compared 5 resampling methods, which could provide some insights. In the case of a large number of minority samples in this study, ROS achieved the best AUC, which was consistent with the findings of Batista et al. Batista et al. [11] showed that SMOTE + Tomek and SMOTE + ENN were more suitable for data sets with a small number of minority instances. For data sets with larger number of minority instances, the ROS could be a good choice because it is less computationally expensive and it could provide competitive results with the more complex methods [11].

For model selection, this study showed that ensemble learning might be a good choice, such as RF, AdaBoost, and Stacking. For stacking, researchers can choose methods commonly used in their fields as base classifiers. The most classic LR was selected as the meta classifier in this study, and researchers can try other more complex methods as meta classifier to obtain better performance.

This study has some strengths. Firstly, the data of this study is large and comes from multi-center population of Sichuan Province. Secondly, stacking was used to combine several common machine learning methods. Finally, the joint modeling strategy considering outliers, imbalanced data, model selection, and parameter tuning was presented to achieve good prediction performance.

Meanwhile, this study inevitably has several limitations. Firstly, this study is a retrospective design with the inherent risk of bias and lack of a validation cohort. Secondly, this study did not have information about early withdrawal of care, which was an important confounder in ICH research.

The results of this study could shed light upon future work in several ways. First of all, external validation is needed to test the generalizability of this model. Besides, more predictive factors could be considered in this model, so as to improve the prediction performance. Finally, the parameters in this model were selected automatically by software using grid searching, which may result in sub-optimal parameters selection. Further work can focus on expanding the range of parameters selection and considering more comprehensive selection of base and meta classifiers, so as to improve the predictive efficiency.

Conclusion

This study used information of patients with ICH at the time of hospitalization to predict 90-day mortality after discharge. We proposed a joint modeling strategy that takes into account outliers, imbalanced data, model selection, and parameter tuning, in order to provide reference for physicians and researchers. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low IR in this study, we did not find obvious improvement of models with resampling methods in terms of accuracy, specificity, and precision. However, our results also validated that ROS performed comparable to more complex methods on AUC in the case of a large number of minority samples.