Background

Since December 2019, the novel Coronavirus Disease 2019 (COVID-19) outbreak, which occurred in Wuhan, Hubei province, China, has infected over 5.7 million people globally by May 29th, 2020 [1]. As this severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spreads globally, great strains are put on health care system of every country. In order to save more lives, more concerns should be focused on severe ill patients, thus it is critical to recognize severe ill patients from mild ones. Possible risk factors for progressing to severe illness may include, but are not limited to, older age, and pre-existing chronic medical conditions such as lung disease, heart failure, cerebrovascular disease, and so on [Statistical analysis

Statistical analysis was performed using the R software, version 3.6.3. The results of continuous variables were expressed as the median with interquartile range and analyzed using Wilcoxon signed-rank test or Pearson correlation test. Categorical variables were presented as numbers (percentages) and analyzed using chi-squared test or Fisher's exact test. Repeated measured data of different time points was compared by repeated measures analysis of variance. Multivariate logistic regression analysis was adopted to identify risk factors of disease progression.

Sparse principal component analysis (SPCA) and model evaluation

SPCA was performed using the R software package (sparsepca, https://github.com/erichson/spca) [9]. Clinical continuous variables of Hefei Cohort including age and all of the above laboratory indicators were used and the data were centered and scaled by subtracting each mean and dividing each standard deviation to allow all the variables to have unit variance. In the SPCA process, controlling parameter alpha was adjusted from 0.0001 to 0.002 with stepsize 0.0001 for better variable selection, and for each alpha value, the cumulative variance and number of variables selected in the top principal components (PCs) were calculated. PCs produced by SPCA were then subjected to multivariate logistic regression for disease severity prediction. The prediction models using PCs were evaluated using receiver operator characteristic curve (ROC) and the area under curve (AUC) was calculated. The accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were also calculated. For clinical net benefit assessment [10], decision curve analysis was also performed using rmda package ( http://mdbrown.github.io/rmda/).

Independent cohort validation

The 169 COVID-19 patients of Nanchang Cohort were used as an independent validation cohort. Using scaled clinical markers of each patient, the PCs of each patient were calculated according to the corresponding PC loadings matrix originated from Hefei Cohort. The produced prediction models were then used to predict the disease severity of this independent cohort and the prediction efficiency was estimated using ROC. The sensitivity, specificity, PPV, NPV and accuracy of each marker were also calculated. The clinical net benefit was evaluated using decision curve analysis.

Results

Demographics and baseline laboratory test results

The demographic and the first-time clinical laboratory test results of 82 COVID-19 patients in Hefei Cohort, are showed in Table 1. Compared with the 54 mild ill COVID-19 patients, most of the 28 severe ill patients are male and have comorbidities. Severe ill COVID-19 patients also showed older age, increased white blood cell count (WBC), neutrophil count (NEU), neutrophil percentage (NEU%), aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyltransferase (GGT), glucose (Glu), Urea, lactic dehydrogenase (LDH), serum amyloid a (SAA), C-reactive protein (CRP), procalcitonin (PCT), interleukin-6 (IL-6), D-Dimer (DD), and myohemoglobin (MYO). Meanwhile the lymphocyte count (LYM) and lymphocyte percentage (LYM%), albumin (Alb), calcium (Ca), phosphorus (P), fibrinogen (FIB) levels of severe ill patients decreased significantly.

Table 1 Demographics and baseline laboratory markers of mild and severe ill COVID-19 patients in Hefei Cohort

Results of Sparse principal component analysis (SPCA)

When predicting the disease severity by multivariate logistic regression using clinical laboratory indicators directly, the fit curve did not converge. As the clinical laboratory markers always correlated with each other, we attempt to use SPCA to reduce dimensionality of the data and extract several PCs to explain such dozens of markers.

Using sparsepca package, the SPCA was performed based on the 44 clinical variables and the alpha parameter was adjusted from 0.0001 to 0.002 with stepsize 0.0001. In such SPCA models, cumulative variance of the first 13 PCs were greater than 80% of the total variance. For models of each alpha, the cumulative variance of the first 13 PCs were summed and the number of variables selected in the first 13 PCs was counted (Fig. 1a). As alpha increases, the cumulative variance decreases gradually and the number of variables reduces sharply. When alpha is 0.0015, the first 13 PCs account for 80.8% of the cumulative variance of the original data and the number of variables selected in the 13 PCs is only 30. Based on the variance-sparsity trade-off [11], SPCA model with alpha of 0.0015 was used for further analysis.

Fig. 1
figure 1

Results of the sparse principal components analysis using clinical data of Hefei cohort. Sparse principal analysis (SPCA) was performed based on the 44 clinical variables of Hefei cohort and the alpha parameter was adjusted from 0.0001 to 0.002 with stepsize 0.0001. For models of each alpha, the cumulative variance of the first 13 principal components (PCs) were summed and the number of variables selected in the first 13 PCs was counted. Variance of different alpha values in SPCA was plotted (a) and the number of selected clinical variables in the 13 PCs of each SPCA were added upon the point. b Distribution of the coronavirus disease 2019 (COVID-19) patients projected to principal components of SPCA with alpha of 0.0015. Depending on each patient's first (X-axis) and 12th (Y-axis) principal components value, COVID-19 patients were projected on the principal components plot of SPCA. c Scatter plot of the clinical markers selected in the first and 12th principal components of SPCA with alpha being 0.0015. Depending on each variable's first (X-axis) and 12th (Y-axis) principal components loadings, 44 clinical variables were projected on the principal components plot of SPCA. The first (X-axis) and 12th (Y-axis) principal components accounted for the 17.8% and 2.9% of the total variance of the 44 clinical markers, respectively

The patients distribution and variables’ loadings using SPCA with alpha being 0.0015 were showed in Fig. 1b,c. The mild and severe ill COVID-19 patients distributed separately in the PC1 direction (X-axis) in the patients’ distribution plot (Fig. 1b). Each PC only depends on less than 5 clinical variables. An additional table file (Additional file 1: Table S1) shows this in more details.

Next, the 13 PCs were subjected to multivariate logistic regression for disease progression prediction. Using both step logistic regression and logistic regression with L1 penalty (glmnet package, https://cran.r-project.org/package=glmnet), two of the 13 PCs were finally selected in the prediction model, where the first PC (PC1) and the 12th PC (PC12) showed significant association with the disease severity classification (Table 2). This model was named as Model-A for further analysis.

Table 2 Multivariate logistic regression of 13 principal components produced by SPCA for disease severity prediction of COVID-19 patients

According to the PC loading matrix (Additional file 1: Table S1) and variable loading plots of SPCA (Fig. 1c), The PC1 depends on NEU%, LYM%, LYM, and MONO, while PC12 only depends on DD and LDH. Since the NEU%, LYM%, LYM, and MONO in PC1 could be obtained in one blood routine test and the PC1 accounted for 17.8% of the total variance, Model-A was further simplified to PC1, which was named as Lymphocyt-Monocyte-Neutrophil index, abbreviated as LMN index.

The relationships between Model-A and LMN index with clinical variables were assessed. Both of them showed significant correlation with CD8+ lymphocyte counts (Fig. 2a, b). Meanwhile, higher Model-A probabilities and LMN indices were observed in patients with comorbidities and older age (Fig. 2c–f). Furthermore, Model-A probabilities and LMN indices of different time point during hospitalization were investigated and both of them significantly decreased as treatment took effect and before discharge (P < 0.001, Fig. 3a, b). Patients with mild and severe status showed clearly variation tendency difference (P < 0.001) in Model-A probability and LMN index. Both of Model-A probabilities and the LMN indices of mild ill patients fell sharply (Fig. 3a, b, green lines), while the counterparts of severe ill patients declined slowly (Fig. 3a, b, red lines).

Fig. 2
figure 2

The association between prediction models with clinical characteristics of COVID-19 patients. Using prediction model Model-A and LMN index, in COVID-19 patients, CD8 + T lymphocytes negatively correlated with Model-A probability (a) and LMN index (b), while, patients age always positively correlated with Model-A probability (c) and LMN index (d). COVID-19 patients with comorbidity always have higher Model-A probabilities (e) and LMN index (f). Abbreviations: COVID-19 coronavirus disease 2019, Model-A prediction model based on the first and 12th principal components produced by sparse principal component analysis, LMN index lymphocyte–monocyte–neutrophil index, a simplified version of Model-A

Fig. 3
figure 3

The dynamics of prediction models of COVID-19 patients from hospital admission to discharge. For all of the patients, results of three time points during hospitalization were collected: the first time point upon hospitalization (After Admission), the medium-term after hospitalization (Middle Stage), and the last time of laboratory test before hospital discharge (Before Discharge). With results of the three time points, the Model-A probability and LMN index were calculated and plotted. a Dynamics of Model-A probability for individual patients (P < 0.05); b dynamics of LMN index of individual patients (P < 0.05). Abbreviations: COVID-19, coronavirus disease 2019; Model-A, prediction model based on the first and 12th principal components produced by sparse principal component analysis; LMN index lymphocyte–monocyte–neutrophil index, a simplified version of Model-A

Prediction efficiency evaluation

Then, ROC was used to estimate the disease severity classification performance of Model-A and the LMN index. The AUC and accuracy of Model-A for prediction of disease severity of COVID-19 patients were 0.867 and 0.726 in Hefei cohort (Table 3). The counterparts of LMN index were 0.837 and 0.793, respectively (Table 3).

Table 3 Summary of severity prediction efficiency of COVID-19 patients using Model-A, LMN index and other markers in Hefei cohort

Since several laboratory markers are classical predictors of disease severity, so we also compared the prediction results of these markers and they were summarized in Table 3. The Model-A showed the best performance and LMN index showed robust prediction effect compared with classical predictors including neutrophil-to-lymphocyte ratio (NLR) which is a hopeful predictor for severity ill COVID-19 [12, 13]. In order to assess the clinical net benefit of Model-A and LMN index, we also performed decision curve analysis (Fig. 4a). Although curves of all the markers tangled and the Model-A gave slightly greater net benefit, while the LMN index just showed similar performance as albumin and NLR.

Fig. 4
figure 4

Decision curve analysis of prediction models produced by SPCA. Decision curve analysis of prediction models in the training Hefei Cohort (a) and independent validation Nanchang Cohort (b). Model-A showed slightly better net benefit both in Hefei Cohort and Nanchang Cohort. Abbreviations: ALB Albumin, COVID-19 coronavirus disease 2019, Model-A prediction model based on the first and 12th principal components produced by sparse principal component analysis, LMN index lymphocyte–monocyte–neutrophil index, a simplified version of Model-A, NLR Neutrophil-to-lymphocyto ratio

Independent cohort validation

In order to validate the severity prediction efficiency of Model-A and LMN index, the laboratory indicators of Nanchang Cohort (An additional table file shows this in more detail in Additional file 1: Table S2) were scaled and PC1 and PC12 of each patient were calculated using PC loading matrix of Hefei Cohort. Model-A probabilities and LMN indices were then predicted. ROC estimated the severity prediction efficiency, and the AUC and accuracy of Model-A produced with Nanchang Cohort for disease progression prediction were 0.835 and 0.757, respectively (Table 4). Meanwhile, the AUC and accuracy of LMN index were 0.800 and 0.740 in this independent cohort. Just like in the training cohort (Table 3). Model-A also gave the best efficiency and the LMN index showed comparable prediction efficiency compared with classical clinical variables (Table 4), such as NLR, albumin and so on. In decision curve analysis (Fig. 4b), all the curves intertwined and the Model-A gave slightly better performance, while the LMN index just showed similar performance as albumin and NLR.

Table 4 Summary of severity prediction efficiency of COVID-19 patients in the independent cohort (Nanchang Cohort)

Discussion

Since the outbreak of COVID-19, the number of patients worldwide has increased drastically, which put massive pressure on the health care system of every country. In order to save lives as more as possible, more resources should be focused on the severe ill patients. Several studies have attempted to seek the predictors of disease progression of COVID-19, such as Neutrophil-to-lymphocyte ratio [12, 13], thrombocytopenia [5], DD, IL-6 [7] and so on. There are also dozens of laboratory indicators used for disease severity prediction. In present study, we used SPCA to extract principal components of laboratory indicators. In SPCA model with alpha being 0.0015, the first 13 PCs accounted 80.8% of the total variance of the 44 clinical variables. Using logistic regression, Model-A based on PC1 and PC12 was deduced and showed the best prediction efficiency in the training cohort (Hefei Cohort. AUC = 0.867) as well as the independent validation cohort (Nanchang Cohort. AUC = 0.835). Because PC1 depending on blood routine test markers accounted 17.8% of the total variance, Model-A was further simplified to LMN index, which predicted disease severity just using PC1. LMN index also showed satisfactory prediction efficiency in the Hefei Cohort (AUC = 0.837) as well as the independent Nanchang Cohort (AUC = 0.800). In decision curve analysis, Model-A showed slightly better performance both in the Hefei Cohort and Nanchang Cohort and the LMN index performed comparably to albumin and NLR.

In clinical laboratory, combinations of test items are very common, while indicators in these combinations always correlated with each other. Such as in blood routine examination, the neutrophil counts always negatively relate with lymphocyte counts and in liver function examination, serum ALT always changes in parallel with AST alteration. This feature of laboratory markers is called collinearity and could enhance the diagnostic accuracy. The collinearity of these laboratory markers makes it difficult for traditional multivariate statistical analysis to include all the significant indicators. This is why PCA is used in this study, which can extract distinct PC from a group of highly correlated variables in combinations of the original variables [14, 15]. Furthermore, controlling parameter alpha was induced to PCA for better variable selection, which is the so-called SPCA [16, 17]. In this study, alpha value was adjusted from 0.0001 to 0.002 and when alpha was set as 0.0015, the 13 PCs accounted for 80.8% of the total variance of the 44 clinical variables and only depended on 30 variables. Thus this SPCA model balanced variance and sparsity [11] and could represent the original 44 variables. Furthermore, sparsepca package [9] used in current study is a recently published method for SPCA, which offers some immediate improvements over previously proposed SPCA algorithms, such as much faster and more scalable algorithm, robustness to outliers.

In the disease severity prediction Model-A, the PC1 is dependent on four clinical markers: NEU%, MONO, LYM%, and LYM, while the PC12 merely depends on DD and LDH. Several previous studies have convinced the relationship between LYM decrease and NEU increase in severe ill COVID-19, SARS and MERS patients [18,19,20,21,22,24], a little improvement is also improvement, so Model-A indeed bring net benefit for patients. While in DCA, LMN index just showed comparable performance as Alb and NLR.

Finally, Model-A shows the best prediction efficiency for disease severity of COVID-19 patients, and the LMN index depending on four blood routine test markers. is very economical for clinical application. So both of them have the potential for clinical use in COVID-19 treatment and even in other disease treatment. This use of SPCA for clinical variables extraction may also shadow new application direction of SPCA.

Our study also have some weaknesses. Clinical characteristics other than laboratory markers were not concerned in this study, which were also risk factors of disease progression. More clinical characteristics should be included for model training in future. On the other hand, the sample size was small, which may have some impact on the statistical results and bias may exist during data standardization process, model training and cut-point selection. In future, with numerous patients enrolled to optimize the above processes, more accurate prediction model will be produced.

Conclusions

In the study, using SPCA method for feature selection and dimensionality reduction, prediction model Model-A and LMN index were deduced, which showed significant association with clinical outcomes and robust disease severity prediction efficiency of COVID-19 patients. Model-A and LMN index may have the potential for clinical application and are helpful to the patients classification so as to save more lives.