1 Introduction

COVID-19 is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, and the WHO declared a global pandemic of COVID-19 on March 11, 2020 [1]. The stinger subunit of SARS-CoV-2 infects host cells by binding to the human angiotensin-converting enzyme 2 receptor, and the main pathogenic mechanisms include direct viral toxicity, endothelial and microvascular damage, a hypercoagulable state leading to thrombosis and micro-thrombosis, dysregulation of the immune system, and stimulation of the body in a hyperinflammatory state [2]; after being infected with SARS-CoV-2, patients experience a remarkable increase in blood, elevated white blood cell counts, elevated neutrophil counts, thrombocytopenia, decreased CD8 + T lymphocyte counts, and elevated certain biomarkers (e.g., C-reactive protein, calcitonin, cytokine IL-6, ferritin, and serum amyloid A).

Many patients are experiencing lingering sequelae as they recover from the virus. Post-acute sequelae of COVID-19 (PASC) is a develo** complication of SARS-CoV-2 infection that causes persistent symptoms in patients who have not fully recovered after the acute phase, which is defined as 4 weeks after the initial onset of symptoms and not yet more than 12 weeks; it can also lead to multi-organ system sequelae after the acute phase [2,3,4]. PASC is usually nonspecific, with the most common symptoms being fatigue (15–87%), dyspnea (10–71%), chest pain or tightness (12–44%), cough (17–34%), and sleep disturbance (24–26%) [5,6,7,8,9]. The most common gastrointestinal symptoms of PASC include nausea, vomiting, diarrhea, anorexia, and loss of appetite [10,11,12]. Studies have shown that 79.7% of patients with acute COVID-19 develop olfactory loss, especially in patients with coexisting fever, and the loss of smell persists in up to 56% of these patients after their recovery from acute infection [13, 14]. Study of PASC severity may be related to the multi-omics data of SARS-CoV-2 mRNA extracted from blood and different phenotypes of CD8 + T and CD4 + T cells [15]. PASC is more prevalent in women than in men [5]. However, the symptoms are more severe and the risk of death is higher in men than in women [16]. A total of 54% of women experienced fatigue 10 weeks after the initial symptoms of COVID-19 [17]. Some studies have also confirmed that older age was associated with PASC [18, 19]. This situation may be due to the higher prevalence of comorbidities and severe disease in older age groups. Severe disease in these cases may trigger a highly inflammatory immune response, which leads to a prolonged recovery period from COVID-19. Lower household income is also correlated with the occurrence of PASC, probably because low-income people cannot work at home, lack adequate personal protective equipment, and have overcrowded living conditions that give them greater exposure to higher viral doses [20, 21]. PASC affects at least 10–30% of people who test positive for COVID-19 [4]. Thus, understanding the prevalence, predisposing factors, and susceptibility factors for PASC can help with the early identification, intervention, and support for COVID-19 patients who are at higher risk of develo** PASC.

On the one hand, SARS-CoV-2 may drive chronic symptoms in PASC patients with persistent symptoms by persisting in certain body parts or tissues and organs after acute infection. On the other hand, although SARS-CoV-2 is completely cleared from the blood, tissues, and nerves of patients, the virus may dysregulate the host immune response during the acute phase of COVID-19. This condition reactivates the pathogen that is previously carried by the host, infects new body sites, and triggers chronic symptoms. For example, one study reported that a critically ill COVID-19 patient infected with SARS-CoV-2 had his previously infected VZV and HSV-1 reactivated, which were correlated with the onset of septic shock [22].

Previous cohort studies of PASC had limited sample sizes from a specific region; data measures of symptom severity in sample cases are based on subjective patient self-reported statements and are subject to patient recall and response biases [23]. Previous RNA-seq analyses have failed to investigate the relationship between PASC and host immune status systematically. In our previous research, we discovered that machine learning algorithms can identify latent biomarkers and characterize the features of different COVID-19 cohorts accurately. The severity of viral infections can be predicted more effectively by utilizing this technology. This feature facilitates the optimization of treatment protocols, which improves patient outcomes [24,25,26,27,28]. In this study, we analyzed whole blood gene expression in hospitalized COVID-19 patients and explored the molecular mechanism of COVID-19 sequelae through a group of machine learning algorithms. The essential genes highly related to PASC were obtained, and some special expression patterns about different PASC levels were extracted. We also constructed efficient classifiers to predict the quality of life after COVID-19 infection, which can be a useful tool and a convenient way to predict the severity of the sequelae of SARS-CoV-2 infection. This work can provide guidance for the development of corresponding liquid biopsy methods and a reference for improving the prognosis of patients infected with SARS-CoV-2.

2 Materials and methods

In this study, we attempted to discover essential genes and patterns that can indicate the PASC levels. To this end, a machine learning-based approach was designed and applied to an existing whole blood RNA-seq data from COVID-19 patients reported in the GEO database. Figure 1 shows the workflow of the approach. The investigated dataset was grouped based on the quality of life of patients after the acute phase of COVID-19. Our approach first assessed the importance of genes using six feature-ranking methods, yielding six feature lists. Subsequently, each list was fed into a combination of incremental feature selection (IFS) [29] method and four classification algorithms to further filter the key genes that can best distinguish the different classes (quality of life changing before and after the acute phase of COVID-19) and determine their special expression patterns.

Fig. 1
figure 1

Flowchart of the entire machine learning-based analysis process. Whole blood transcriptomic data from COVID-19 patients hospitalized during acute infection were analyzed. Samples in this data was divided into “Better,” “The Same,” and “Worse” groups by comparing the quality of life of patients after the acute phase of COVID-19 and before the disease. Each patient was represented by the expression levels of 58,929 genes. Gene expression features were analyzed by six feature ranking algorithms, namely, LASSO, LightGBM, MCFS, RF-based, CATBoost, and XGBoost. The obtained feature lists were fed into an IFS method that combined DT, KNN, RF, and SVM to extract key genes, construct efficient classifiers, and classification rules

2.1 Data

Whole blood RNA-seq data from COVID-19 patients hospitalized during acute infection were obtained from the GEO database [30] under the accession number GSE215865. Based on subsequent sequelae follow-up, we compared the quality of life of patients after the acute phase of COVID-19 and before the disease, which was reported by study participants using a self-reported checklist that assesses the emergence of PASC or health changes after COVID-19. The quality of life changing since the emergence of COVID-19 were coded as three text values, namely, “Better,” “The Same,” or “Worse.” A total of 81, 234, and 187 patient samples were labeled by the three aforementioned text values and constituted three groups. Each patient was represented by the expression levels of 58,929 genes.

2.2 Problem description

According to the data mentioned in Sect. 2.1, COVID-19 patient samples were classified into Better, The Same, and Worse. In the classification problem, they can be deemed as three labels, denoted by \({L}_{Better}\), \({L}_{Same}\), and \({L}_{Worse}\). The expression levels of 58,929 genes were used to express each patient. These genes were denoted by \({g}_{1},{g}_{2},\cdots ,{g}_{58929}\). The gene expression levels of i-th patient sample comprised a vector, denoted by \({Y}_{i}={\left[{y}_{i,1},{y}_{i,2},\cdots ,{y}_{i,58929}\right]}^{T}\), where \({y}_{i,j}\) represented the expression level on gene \({g}_{j}\). Accordingly, the whole blood RNA-seq data mentioned in Sect. 2.1 can be represented by \({\left\{({Y}_{i},{L}_{i})\right\}}_{i=1}^{502}\), where \({L}_{i}\in \left\{{L}_{Better},{L}_{Same},{L}_{Worse}\right\}\) is the label of the i-th patient sample. The analysis on the whole blood RNA-seq data can be converted into the investigation on the classification problem that involved three labels (\({L}_{Better}\), \({L}_{Same}\), and \({L}_{Worse}\)). Evidently, some genes provided key contributions to distinguish patients with one label from those with other labels. Thus, the main purpose of this study was to extract these essential genes, such as \({g}_{{i}_{1}},{g}_{{i}_{2}},\cdots ,{g}_{{i}_{k}}\), which can provide high identification ability for one classification algorithm. Furthermore, the special patterns, expressed by a group of some essential genes, along with the thresholds of their expression levels, can be inferred for each label. Finally, we also wanted to establish efficient classifiers for identifying the PASC level.

2.3 Feature ranking algorithms

The investigated classification problem contained excessive features (more than 55,000). Evidently, not all features provided positive contributions to classify samples into three classes; that is, the essential features were limited. This study mainly aimed to identify such essential features. In the field of machine learning, several methods have been proposed to assess the importance of features. The use of these methods was suitable in deal with the dataset. However, selecting a suitable method was challenging. Based on our experiences, one method can only screen out a part of essential features because each method has its limitations. The essential features identified by one algorithm may be important supplements for those yielded by another algorithm. To discover essential features as complete as possible, this study employed six feature ranking algorithms, including least absolute shrinkage and selection operator (LASSO) [31], light gradient boosting machine (LightGBM) [32], Monte Carlo feature selection (MCFS) [33], random forest (RF)-based method [34], categorical boosting (CATBoost) [35], and extreme gradient boosting (XGBoost) [36]. These algorithms were designed by following different rules. Hence, they can overview the dataset from different points of views. A more complete picture on essential features can be provided by using these algorithms. These algorithms have also been used in numerous publications [27, 37,38,39,40,41,42,43]. In the remainder of this section, their brief descriptions were given.

2.3.1 Least absolute shrinkage and selection operator

LASSO is a linear regression method introduced by Robert Tibshirani in 1996. It combines regularization and feature selection by adding a penalty term to the objective function, resulting in sparse solutions with reduced model complexity [31]. LASSO has gained substantial attention in machine learning and statistics because of its ability to produce interpretable models and effectively handle high-dimensional data. It employs L1 regularization, which adds the sum of the absolute values of the coefficients to the loss function. This form of regularization helps prevent overfitting by encouraging smaller coefficient values and reducing model complexity. In LASSO, the magnitude of the nonzero coefficients is an indication of feature importance. Larger coefficients imply a greater influence on the target variable, while coefficients close to or equal to zero indicate that the corresponding features have little or no effect on the target variable. By penalizing large coefficients and promoting sparsity, LASSO highlights the most relevant features and filters out the less important ones. Based on the decreasing order of absolute value of coefficients, all features can be sorted in a list.

2.3.2 Light gradient boosting machine

LightGBM is a gradient boosting framework developed by Microsoft that has gained considerable popularity among machine learning practitioners [32]. It is designed to be more efficient and scalable than traditional gradient boosting algorithms, making it suitable for large-scale and high-dimensional data. By utilizing novel techniques, such as gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB), LightGBM remarkably reduces the computational burden while maintaining high predictive accuracy. LightGBM is designed to handle large datasets and high-dimensional data more efficiently than other gradient boosting frameworks, such as XGBoost and CATBoost, because of the unique techniques of GOSS and EFB. LightGBM provides two common methods for calculating feature importance: (1) split-based importance: this method calculates feature importance based on the number of times a feature is used to split the data across all trees in the model. The more frequently a feature is used to make splits, the higher its importance will be. (2) Gain-based importance: this method calculates feature importance based on the total gain in accuracy or reduction in the objective function achieved by using a feature for splitting across all trees. This study adopted the former method to assess feature importance; that is, features were sorted in a list with the decreasing order of their occurrence numbers.

2.3.3 Monte Carlo feature selection

MCFS is a probabilistic method for identifying important features in high-dimensional datasets [33]. By using Monte Carlo sampling techniques, MCFS efficiently explores the feature space and evaluates feature importance based on their contributions to model performance. This method has gained attention in various domains, including bioinformatics, text classification, and image recognition, because of its ability to handle large datasets and provide reliable estimates of feature importance. MCFS works as follows: first, s subsets of features are randomly selected, and for each feature subset, t trees are constructed based on t training sets that are randomly selected. Therefore, s × t trees are constructed. The relative importance of a feature can be estimated by considering the weighted accuracy of the prediction, the information gain, and the number of samples affected by a split in these trees. Accordingly, features are ranked in a list in terms of their relative importance.

2.3.4 Random forest-based method

RFs provide a natural way to compute feature importance because they are built using a collection of decision trees [34]. In general, feature importance in RF-based method is calculated based on the average contribution of a feature to the improvement in the predictive performance of the trees in the ensemble. Feature importance can be calculated using two common methods: (1) Mean Decrease in Impurity (MDI) or Gini Importance: this method calculates feature importance based on the average decrease in impurity, such as the Gini index or entropy, when a feature is used to split the data across all trees in the forest. The more a feature contributes to reducing impurity in the trees, the higher its importance will be. MDI is the default method used by many machine learning libraries, such as scikit-learn. (2) Mean Decrease in Accuracy (MDA) or Permutation Importance: this method calculates feature importance by randomly permuting the values of a feature and measuring the change in the model’s predictive performance, such as accuracy or mean squared error. A larger decrease in performance indicates a more important feature. MDA is a model-agnostic method, which means that it can be applied to other models. This study adopted the RF-based method using MDA to measure feature importance. Features are sorted in the list with the decreasing order of MDA.

2.3.5 Categorical boosting

CATBoost is a gradient boosting framework developed by Yandex to handle categorical features efficiently [35]. Gradient boosting is an ensemble learning method that builds a strong model by iteratively adding weak learners (typically decision trees) to minimize a loss function. CATBoost offers competitive performance and ease of use compared with other popular gradient boosting frameworks, such as XGBoost and LightGBM. Its unique selling point is its ability to handle categorical features effectively using an innovative approach called ordered target statistics. CATBoost is less prone to overfitting because of its ordered target statistics approach and carefully designed regularization techniques. In CATBoost, feature importance is calculated using a technique called “Permutation Importance.” The basic idea behind this method is to measure the effect of a feature on the model’s performance by randomly shuffling the values of that feature and observing the change in model performance. The larger the decrease in performance after shuffling a feature, the more important that feature is considered to be. Accordingly, features are ranked in a list with the decreasing order of their Permutation Importance.

2.3.6 Extreme gradient boosting

XGBoost is an open-source library that provides an efficient and scalable implementation of the gradient boosting algorithm [36]. Developed by Tianqi Chen and Carlos Guestrin, XGBoost has gained popularity among machine learning practitioners because of its ability to deliver state-of-the-art performance on various tasks, such as classification and regression. By utilizing a number of optimizations and parallelization techniques, XGBoost has become a go-to library for many data scientists and machine learning engineers. XGBoost provides built-in methods to compute feature importance based on the average contribution of a feature to the improvement in the predictive performance of the trees in the ensemble. Three factors affect the relative importance of a feature in XGBoost: (1) Split: it calculates the number of times a feature is used to split the data across all trees in the ensemble. A higher value indicates that the feature is used more frequently and is thus considered more important. (2) Gain: it estimates feature importance based on the average improvement in the splitting criterion (e.g., Gini index or information gain) brought about by a feature when it is used to split the data across all trees. A higher value indicates that the feature contributes more to the improvement of the model’s performance, making it more important. (3) Coverage: it estimates feature importance based on the average coverage of a feature when it is used to split the data across all trees. Coverage refers to the number of samples affected by the split. A higher value indicates that the feature affects a larger portion of the dataset, making it more important. The current study used the XGBoost with the first scheme to rank features.

The six aforementioned feature ranking algorithms were applied to the whole blood RNA-seq data in Sect. 2.1. Each algorithm generated one feature list. For easy descriptions, these lists were called LASSO, LightGBM, MCFS, RF, CATBoost, and XGBoost feature lists.

2.4 Incremental feature selection

Although the six aforementioned algorithms measure the importance of features by listing features in the lists, they cannot answer the question in which features should be selected as essential features. In view of this, we employed the IFS [29] method, a heuristic search-based feature selection technique that aims to find the optimal subset of features for a given classification algorithm. The main idea behind IFS is to add features incrementally in a given list to the current subset, evaluate the performance of the resulting feature subset, and select the best performing subset. The IFS procedure can be summarized as follows: (1) initialize an empty feature subset; (2) add the next feature in the given feature list to the current subset and evaluate the performance of the resulting feature subset using a predefined criterion (e.g., accuracy, F1 score, or AUC-ROC) and a specific classification algorithm (e.g., logistic regression, decision tree, or SVM). This evaluation can be performed using techniques, such as cross-validation, to estimate the generalization performance of the model. (3) After all the desired features are added and evaluated, the feature subset that performs best is selected as the optimal feature subset, and the corresponding classifier is deemed as the optimal classifier. When the feature list is very large (i.e., the number of features is large), IFS can only consider some top features according to the traits of the investigated problem and Step 2 can be modified by adding a fixed number of features (i.e., five or ten) to the current subset. Under this operation, the computation time can be reduced sharply.

2.5 Synthetic minority oversampling technique

By checking the number of patient samples in three classes, the class The Same was 3 times as large as the class Better, indicating that the dataset was imbalanced. The classifier directly built on this dataset may produce bias. This problem must be addressed. Synthetic minority over-sampling technique (SMOTE) [44] is a widely used oversampling method designed to address the class imbalance problem in machine learning datasets. The SMOTE procedure can be summarized as follows: (1) select a random instance from the minority class, (2) find its k-nearest neighbors within the minority class (k is a user-defined parameter), (3) randomly select one of the k-nearest neighbors, and (4) interpolate a new synthetic instance between the selected instance and its chosen neighbor. Repeat Steps 1–4 until the numbers of data in each class are balanced.

2.6 Classification algorithm

To implement the IFS method, we need to use supervised classification algorithms as the foundation. Therefore, we selected four efficient algorithms, namely, decision tree (DT) [45], k-nearest neighbor (KNN) [46], random forest (RF) [34], and support vector machine (SVM) [47].

2.6.1 Decision tree

DT is a machine learning algorithm used for classification and regression tasks [45]. It is a tree-like structure that represents a series of decisions based on specific features, resulting in a predicted outcome or value. The main components of a decision tree are nodes, branches, and leaves. Nodes represent decision points, where the tree splits based on a feature’s value. The root node is the starting point of the tree, whereas the internal nodes represent subsequent decisions. Branches connect nodes and denote the possible outcomes of a decision. Leaves are the terminal nodes of the tree, which contain the final predicted outcome or class label. A DT is constructed by recursively partitioning the data using a set of feature-value rules. It is built top-down, starting with the root node, and selects the best feature to split on at each step. Various metrics, such as Gini impurity or information gain, are used to determine the best split.

2.6.2 K-nearest neighbor

KNN is a simple, yet powerful supervised learning algorithm used for classification and regression tasks [46]. It is a lazy learning algorithm; that is, it does not explicitly build a model during the training phase. Instead, it stores the training instances and only performs computations during the prediction phase. KNN is based on the concept of similarity or distance metrics, assuming that similar data points are likely to share the same class label or target value. The KNN algorithm works as follows: (1) select the number of neighbors (k). The choice of k depends on the problem and can greatly influence the algorithm’s performance; (2) calculate distances: for a new, unseen instance, KNN computes the distance between this instance and each training instance using a distance metric, such as Euclidean, Manhattan, or Minkowski distance; (3) identify k-nearest neighbors: the algorithm selects the k training instances with the smallest distances to the new instance; and (4) make predictions: for classification tasks, the most common class label among the k-nearest neighbors is assigned as the prediction for the new instance.

2.6.3 Random forest

RF is an ensemble learning method in machine learning that combines multiple DTs to improve prediction accuracy and robustness [34]. It operates by constructing a multitude of DTs at training and outputting the class by majority rule or the mean prediction (regression) of the individual trees at testing. RF works through a process called bagging (e.g., bootstrap aggregating) and feature randomness. For each tree, a random sample of the training data is selected with replacement. This process creates diverse subsets of data for each tree, thereby reducing the likelihood of overfitting. Each tree is grown using the selected subset of data. During the construction process, a random subset of features is selected as candidates for splitting at each node, rather than considering all features. This process introduces additional randomness and diversity among the trees. Once all the trees are constructed, the predictions of the individual trees are combined to reach the final decision.

2.6.4 Support vector machine

SVM is a supervised machine learning algorithm widely used for classification and regression tasks [47]. SVM is particularly effective for high-dimensional and linearly separable data. However, it can also handle nonlinear relationships using the kernel trick. The main idea behind SVM is to find the optimal hyperplane that maximizes the margin among different classes while minimizing classification errors. An overview of how the SVM algorithm works for classification tasks is presented as follows: (1) identify the optimal hyperplane: SVM aims to find the hyperplane that best separates the data points of different classes. The margin is the distance between the hyperplane and the closest data points from each class, known as support vectors. SVM seeks to maximize this margin to ensure the best separation between classes and improve generalization performance. (2) Soft-margin classification: in cases where the data are not linearly separable, SVM introduces a regularization parameter (C) to allow some misclassifications. This parameter controls the trade-off between maximizing the margin and minimizing classification errors. (3) Kernel trick: to handle nonlinear relationships, SVM employs the kernel trick, which transforms the input data into a higher-dimensional space. Popular kernels include linear, polynomial, radial basis function (RBF), and sigmoid kernels. The choice of kernel and its parameters greatly influences the model’s performance.

2.7 Performance evaluation

The F1 score is an essential measurement in binary classification [48,49,50,51,52]. It has two forms for multi-class classification, namely, macro F1 and weighted F1. To compute them, the F1 score for each class should be computed in advance. When computing the F1 score for the i-th class, samples in this class are termed as positive samples, whereas others are regarded as negative samples. Tenfold cross-validation [53,54,55] results can be counted as four entries: true positive (TPi), false positive (FPi), false negative (FNi), and true negative (TNi). Accordingly, F1 score can be computed by.

$${Precision}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FP}_{i}}$$
(1)
$${Recall}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FN}_{i}}$$
(2)
$${F1\;score}_{i}=\frac{2\bullet {Precision}_{i}\bullet {Recall}_{i}}{{Precision}_{i}+{Recall}_{i}}$$
(3)

After that, macro F1 and weighted F1 are defined as.

$$Macro\;F1=\frac{1}{L}{\sum }_{i=1}^{L}{F1\;score}_{i}$$
(4)
$$Weighted\;F1=\frac{1}{L}{\sum }_{i=1}^{L}{{w}_{i}\times F1\;score}_{i}$$
(5)

where \(L\) represents the number of classes and \({w}_{i}\) stands for the proportion of samples in the i-th class to the overall sample. Considering that weighted F1 further considers the distribution of samples across classes, it was selected as the key measurement in this study.

In addition, we employed two other classic measurements, namely, prediction accuracy (ACC) and Mathews correlation coefficient (MCC) [56]. ACC is defined as the proportion of correctly predicted samples, whereas the definition of MCC is quite complex. First, two matrices are constructed, namely, X and Y, where X stores the real classes of samples and Y collects the predicted classes of samples. Then, MCC can be calculated by.

$$MCC=\frac{cov(X,Y)}{\sqrt{cov(X,X)\bullet cov(Y,Y)}}$$
(6)

where \(cov(X,Y)\) denotes the correlation coefficient of two matrices.

3 Results

This study investigated the transcriptomic data from COVID-19 patients to predict the quality of life of patients after the acute phase. The data contained 502 patient samples that were divided into three classes based on changes in their quality of life, and each patient was represented by the expression levels of 58,929 genes. To deeply mine essential genes and special patterns for a certain class, a machine learning-based method was designed, as illustrated in Fig. 1. According to this method, six feature ranking algorithms were applied to the transcriptomic data, thereby generating six feature lists (e.g., LASSO, LightGBM, MCFS, RF, CATBoost, and XGBoost feature lists). These lists are provided in Supplementary Table S1. Generally, features (genes) with high ranks in one list implied great contributions for predicting the level of patient sequelae.

3.1 Performance of the classification algorithms on different feature lists

After obtaining the feature lists, we then applied IFS method to each feature list, and four classification algorithms (e.g., DT, KNN, RF, and SVM) were used in this procedure. Considering that each feature list was very large, testing all possible feature subsets would require a considerable amount of time. Thus, we only considered the top 3000 features in each list because of the limited number of essential features related to PASC. Furthermore, to accelerate the IFS procedure, we added five features to the current subset each time. Accordingly, 600 feature subsets were constructed from each feature list. A classifier was built on each feature subset with a given classification algorithm, and it was evaluated by tenfold cross-validation. The cross-validation results were counted as ACC, MCC, macro F1, and weighted F1, which are available in Supplementary Table S2. To demonstrate the performance of one classification algorithm under all constructed feature subsets derived from one feature list, an IFS curve was plotted with weighted F1 as Y-axis and a number of features in the subset as X-axis, which is displayed in Figs. 2 and 3.

Fig. 2
figure 2

IFS curves for evaluating the performance of four classification algorithms under different top features in three feature lists. A IFS curves based on LASSO feature list. B IFS curves based on LightGBM feature list. C IFS curves based on MCFS feature list. The highest weighted F1 for each IFS curve was marked, along with the number of used features. For the IFS curve containing the highest weighted F1, relative high weighted F1 was also marked

Fig. 3
figure 3

IFS curves for evaluating the performance of four classification algorithms under different top features in three feature lists. A IFS curves based on RF feature list. B IFS curves based on CATBoost feature list. C IFS curves based on XGBoost feature list. The highest weighted F1 for each IFS curve was marked, along with the number of used features. For the IFS curve containing the highest weighted F1, relative high weighted F1 was also marked

The IFS curves of four classification algorithms for the LASSO feature list are shown in Fig. 2A. DT, KNN, RF, and SVM provided the highest weighted F1 (0.617, 0.763, 0.749, and 0.773, respectively) when top 25, 110, 85, and 145 features in the LASSO feature list were used, respectively. Accordingly, the optimal DT, KNN, RF, and SVM classifiers can be obtained using the aforementioned features. Their detailed performance is listed in Table 1. We can find that the optimal SVM classifier yielded the best performance with highest ACC, MCC, macro F1, and weighted F1.

Table 1 Performance of the optimal classifiers on different feature lists

The IFS curves for the LightGBM feature list are displayed in Fig. 2B. The highest performance of DT, KNN, RF, and SVM was obtained by using top 325, 55, 165, and 45 features in this list, respectively. They provided the weighted F1 of 0.643, 0.714, 0.855, and 0.762, respectively. Likewise, the optimal DT, KNN, RF, and SVM classifiers can be built. The detailed performance of these optimal classifiers is listed in Table 1. Evidently, the optimal RF classifier generated higher performance than the three other optimal classifiers.

As for the rest four feature lists (e.g., MCFS, RF, CATBoost, and XGBoost feature lists), the IFS curves are shown in Figs. 2C and 3. With similar arguments, the optimal DT, KNN, RF, and SVM classifiers on each feature list can be determined. Their performance is also listed in Table 1. The optimal RF classifier always yielded the best performance on each of the four lists. They adopted top 1015 (MCFS feature list), 220 (RF feature list), 90 (CATBoost feature list), and 1020 (XGBoost feature list) features and yielded the weighted F1 of 0.795, 0.795, 0.869, and 0.825, respectively.

Based on the performance of the optimal SVM classifier on LASSO feature list and the optimal RF classifiers on five other feature lists, the optimal RF classifier on the CATBoost feature list was the best, giving the highest weighted F1. Thus, this classifier can be a useful tool to predict the quality of life of COVID-19 patients after the acute phase.

3.2 Intersection of key genes from six lists

From Sect. 3.1, the best classifier was found on each feature list. Generally, the features used in these classifiers were essential to predict the quality of life of COVID-19 patients after the acute phase. However, these features were excessive for most classifiers. For example, the optimal SVM classifier on the LASSO feature list adopted 145 features, whereas the optimal RF classifier on the MCFS feature list used 1015 features. The most essential genes cannot be screened out in this case. Thus, we need to further extract most essential genes from each feature subset used for building an efficient classifier. By checking the IFS results with SVM on the LASSO feature list and IFS results with RF on five other lists, we can determine another small feature subset, on which SVM or RF exhibited a similar performance to the optimal SVM or RF classifiers. Notably, we did not find such feature subset for RF on the CATBoost feature list because the optimal RF classifier on this list only needed 90 features. The sizes of the feature subsets on five other feature lists are listed in Table 2. Furthermore, the performance of SVM or RF classifiers on this feature subset is listed in this table. At a glance, such performance was only slightly lower than that of the optimal SVM or RF classifier. Figure 4 clearly confirmed this fact. The used features sharply declined. We called the classifiers using these features as the sub-optimal classifiers. For convenience, the optimal RF classifier on the CATBoost feature list was also termed as the sub-optimal classifier. Evidently, features used in the sub-optimal classifiers were more important than the other features used in the optimal classifiers because their addition can enhance their limited performance. For example, the sub-optimal SVM classifier on the LASSO feature list used the top 80 features in this list and yielded the weighted F1 of 0.748. The optimal SVM classifier on the same list used the top 145 features and generated the weighted F1 of 0.773. The 64 (145–80) features used in the optimal SVM classifier only improved the weighted F1 of 0.025 (0.773–0.748), which was considered a limited improvement. Thus, the 64 features were less important than the 80 features used in the sub-optimal SVM classifier.

Table 2 Performance of sub-optimal classifiers on different feature lists
Fig. 4
figure 4

Performance gap between the optimal and sub-optimal classifiers on different feature lists. A Radar plot based on LASSO feature list. B Radar plot based on LightGBM feature list. C Radar plot based on MCFS feature list. D Radar plot based on RF feature list. E Radar plot based on XGBoost feature list. The performance of sub-optimal classifier was quite similar to that of the optimal classifier

With this analysis, a sub-optimal classifier was obtained on each feature list. These classifiers used the top 80 (LASSO feature list), 70 (LightGBM feature list), 215 (MCFS feature list), 105 (RF feature list), 90 (CATBoost feature list), and 185 (XGBoost feature list) features in the corresponding lists, respectively. The combination of these features (genes) may provide a full picture to describe the essential differences among the three classes, thereby yielding 675 features. An upset graph was plotted to show the intersection of these feature subsets, as illustrated in Fig. 5. Some genes were identified to be essential by three or more algorithms. The detailed genes identified by one to six algorithms are provided in Supplementary Table S3. Generally, genes identified by multiple algorithms should be given more attention. However, those identified by one algorithm may still be important. Thus, we also listed them in Supplementary Table S3 for later investigations.

Fig. 5
figure 5

Upset graph to show the essential genes identified by LASSO, LightGBM, MCFS, RF-based, CATBoost, and XGBoost. Some genes were identified by multiple algorithms

3.3 Classification rules

According to the performance of the optimal classifiers on six lists (Table 1), the optimal DT classifiers always yielded the lowest performance. Their weighted F1 values were all lower than 0.7. However, DT has a special merit that is not shared by the three other classification algorithms. It is a white-box algorithm because its classification procedures are completely open and hence can provide more medical insights. Based the optimal DT classifier on each feature list, we can extract a rule group. Each rule indicated a path from the root to one leaf, and it contained a number of genes with special thresholds, indicating a special expression pattern for the predicted result (one of three classes). Six rule groups are provided in Supplementary Table S4. The number of rules in each group is listed in Table 3. The rules on LASSO feature list were most, followed by those on MCFS, RF, LightGBM, CATBoost, and XGBoost feature lists. In each group, some rules were built for each class. The distribution of rules in each group across three classes is illustrated in Fig. 6. The rules for class Worse in each group were always the greatest. As for the rules for Better and The Same, rules for The Same were greater than those for Better on the LASSO and LightGBM feature lists. This finding was contrary to the MCFS, RF, and XGBoost feature lists. Some rules would be analyzed in Sect. 4.

Table 3 Number of rules in six rule groups
Fig. 6
figure 6

Number of rules for three classes in six rule groups that were extracted from the optimal DT classifiers on six feature lists

4 Discussion

Using whole blood transcriptomic data from patients, as described above, we identified a number of potential biomarkers that could reveal the differences among patients with different qualities of life following SARS-CoV-2 infection. Based on the available studies, the current research on the sequelae of SARS-CoV-2 is very limited. However, the identified biomarkers and associated decision rules show relevance to SARS-CoV-2-related pathogenesis. They could also contribute to the development of appropriate drugs to improve the prognosis of SARS-CoV-2-infected patients, especially those who may suffer from severe sequelae.

4.1 Genes associated with differential quality of life after COVID-19 recovery

Based on the machine learning-based approach, we identified six groups of genes that were used in six sub-optimal classifiers. They were deemed important in hel** identify the severity of the different SARS-CoV-2 sequelae. Here, we analyzed essential genes that were identified by three or more feature ranking algorithms. According to recent publications, some genes are involved in biological processes, such as viral RNA editing and immune regulation.

CCDC18 (ENSG00000122483.17; features found in 5/6 methods) is a potential RcRE gene with sequence and function similar to that of the viral regulatory protein Rev (human immunodeficiency virus) and may play a complex regulatory role in the retroviral infection of cells [57, 58]. SARS-CoV-2 RNA can be reverse transcribed and integrated into the human genome, but it cannot infect or transmit [59]. Although the role of CCDC18 in SARS-CoV-2 infection has not been reported, our study shows that CCDC18 is associated with sequelae and quality of life after SARS-CoV-2 infection.

ADARB2 (ENSG00000185736.16) is another important signature gene that shows an important role in most (4/6) algorithms. ADARB2 has been highly expressed in patients with severe SARS-CoV-2 infection and severe influenza disease; it may induce alterations in inflammation-related mediators and lead to changes in SARS-CoV-2 morbidity and mortality through reduced adenosine to inosine (A-to-I) editing of endogenous dsRNA [60]. Other studies have also confirmed the association between high expression of ADARB2 and severe SARS-CoV-2 infections [61]. Given that RNA editing can be a relevant mechanism for controlling viral evolutionary dynamics, ADARB2 may influence virulence, pathogenicity, and host response; this situation leads to different qualities of life after SARS-CoV-2 infection [62].

The protein encoded by CXCL8 (ENSG00000169429.11) is a member of the CXC chemokine family and is a major mediator of the inflammatory response, primarily chemotactic to neutrophils [63]. High plasma levels of CXCL8 are often associated with inflammatory diseases. Some studies have reported remarkably higher CXCL8 expression in patients with severe COVID-19 but not in patients with milder cases [64]. Following SARS-CoV-2 infection, the relative expression of CXCL8 is remarkably lower in the infected population than in the uninfected population for most of the post-infection period; however, CXCL8 plasma concentrations are significantly higher on the first day after infection, and lower baseline CXCL8 expression is associated with worse prognosis and more severe clinical response [65]. CXCL8 was investigated in COVID-19, and higher plasma concentrations of CXCL8 were associated with greater mortality [66].

IL4R (ENSG00000077238.14) and IGHM (ENSG00000211899.10) are B cell molecular markers. IgM-type neutralizing antibodies have been shown to help prevent virus transmission and promote cytosolic virus uptake, and their peak after SARS-CoV-2 infection corresponds to a T cell peak and the disappearance of the virus, which suggests that IGHM may be involved in the recovery process after SARS-CoV-2 infection [67, 68]. A remarkable increase in IGHM subtypes in B cells was found in the Kexing Biopharm vaccinated population [69]; in addition, a substantial downregulation of IL-4R was observed in patients with mild SARS-CoV-2 infection [69]. Although no correlation between IL4R and IGHM and patient prognosis after SARS-CoV-2 infection has been reported, our study shows that they have extensively affected PASC and quality of life of patients.

CXCL8, IL4R, and IGHM are all immune-related genes and show substantial predictive roles in various algorithms (3/6). In addition, some genes are not directly associated with the immune response that are equally important, such as UGT2A3 (ENSG00000135220.11) [70] and CYP26B1 (ENSG00000003137.8) [71]. Several studies have reported that these genes correlate with the severity of SARS-CoV-2 infection. This finding further demonstrates the reliability of our results, and that these key features can provide guidance for predicting quality of life after SARS-CoV-2 infection. Meanwhile, further studies may facilitate the development of targets and drugs to improve the prognosis of SARS-CoV-2 infection.

In the study conducted by Thompson et al. (from which our data was sourced) [30], we conducted a comparative examination of key feature genes. This research has already identified various etiologies of acute post-infection sequelae during SARS-CoV-2 infection, directly linking these sequelae to the host’s acute response to the virus and providing early insights into their development. Genes, such as CCDC18, IL4R, and IGHM, which were also highlighted as critical differential expression genes in this study, play a crucial role in PASC. These genes, along with our findings, further underscore the reliability of our results and demonstrate that these key features can offer guidance for predicting the quality of life after SARS-CoV-2 infection.

4.2 Decision rules related to differential quality of life after COVID-19 recovery

As mentioned above, we identified several sets of valid biomarkers by using different calculation methods. We further established decision rules based on the newly proposed calculation methods and selected representative rules from the different calculation methods for a detailed discussion.

In the decision rule for the CATBoost calculation method, many populations with good quality of life require low levels of CCDC18 expression and high levels of CPED1 (ENSG00000106034.18) expression. As previously mentioned, CCDC18 may be involved in the regulation of viral infection of cellular processes, and its low-level expression may be associated with better post-infection symptoms. Reduced expression of CPED1 in peripheral blood has been shown to be associated with poor outcome in SARS-CoV-2 infection [72]. Few studies related to CPED1 have shown that CPED1 plays a potential tumor suppressive role in lung adenocarcinoma, and its low-level expression is associated with shorter survival [73]. In patients with deteriorating quality of life (Worse), we find that a subset of rules requires higher ITGA2 expression level, which mediates the adhesion of platelets and other cell types to the extracellular matrix and is involved in coagulation-related biological processes. Recent publications have found that abnormal ITGA2 expression correlates with the prognosis of SARS-CoV-2 infection [74], and a higher degree of mutation in ITGA2 is observed in patients with extremely severe SARS-CoV-2 infection [75].

In the decision rule for the LightGBM method, findings suggest that a large proportion of patients with poor quality of life (Worse) has a decision rule that requires low levels of CDC16 (ENSG00000130177.16). This protein is encoded by CDC16, which is involved in cell cycle regulation as a ubiquitin ligase. Studies have shown that the protein levels of CDC16 have potential for COVID-19 diagnosis [76]. Given that CDC16 is involved in the cell cycle, its low-level expression may correlate with the level of immune cell activity and indirectly contribute to poorer quality of life after SARS-CoV-2 infection. High expression of another gene, namely, TMEM176B (ENSG00000106565.18), is also associated with poorer quality of life after SARS-CoV-2 infection. TMEM176B is a possible marker of dendritic cell maturation or differentiation and is upregulated in dendritic cells from SARS-CoV-2-infected patients [77]. Another study found that TMEM176B is overexpressed in monocytes from SARS-CoV-2-infected patients and may be involved in the regulation of inflammasome-dependent T cell dysfunction [78]. Thus, high expression of TMEM176B may be associated with immune dysfunction in patients with poorer quality of life. The results of this method also show similar findings to those of the CATBoost method. Specifically, reduced expression of CPED1 correlates with the severe consequences of SARS-CoV-2 infection.

In summary, our study has identified key decision rules related to the differential quality of life after COVID-19 recovery using various computational methods. These rules are based on the expression levels of specific genes, such as CCDC18, CPED1, ITGA2, CDC16, and TMEM176B, each of which plays a unique role in post-infection outcomes. For instance, lower expression of CCDC18 and higher expression of CPED1 are associated with improved quality of life, possibly due to their roles in viral infection regulation and tumor suppression. Higher expressions of ITGA2, CDC16, and TMEM176B are linked to poorer quality of life, suggesting their potential involvement in immune cell activity, coagulation, dendritic cell maturation, and inflammasome-dependent T cell dysfunction. Our findings align with recent publications and emphasize the importance of these genes in determining the post-COVID-19 recovery experience. Overall, these insights can contribute to a deeper understanding of the molecular mechanisms underlying COVID-19 sequelae and guide future clinical research and interventions aimed at improving the outcomes of patients recovering from SARS-CoV-2 infection.

5 Conclusions

We grouped patients into three classes by comparing the quality of life of patients after the acute phase and before the occurrence of COVID-19. Whole blood transcriptomic data from all patients were analyzed using a series of machine learning algorithms. The analysis result enabled the screening of genetic markers that represent different sequelae severities. The severities tend to be associated with viral immunity and inflammatory triggers. We also constructed high-performance classifiers for the prediction of sequelae levels and classification rules to indicate special patterns on different sequelae levels. Furthermore, our study has provided the foundation for the identification of potential risk factors associated with PASC. Although we have uncovered crucial genetic signatures, further research is still needed to fully elucidate the complex interplay between these markers and the development of PASC. This knowledge will pave the way for the development of targeted therapeutic strategies aimed at improving the quality of life for individuals affected by PASC.