Identification of key gene expression associated with quality of life after recovery from COVID-19

Ren, **g**n; Gao, Qian; Zhou, **anChao; Chen, Lei; Guo, Wei; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong

doi:10.1007/s11517-023-02988-8

Identification of key gene expression associated with quality of life after recovery from COVID-19

Original Article
Published: 21 December 2023

Volume 62, pages 1031–1048, (2024)
Cite this article

Download PDF

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Identification of key gene expression associated with quality of life after recovery from COVID-19

Download PDF

**g**n Ren¹^na1,
Qian Gao²^na1,
**anChao Zhou³^na1,
Lei Chen⁴,
Wei Guo⁵,
KaiYan Feng⁶,
Tao Huang^7,8 &
…
Yu-Dong Cai ORCID: orcid.org/0000-0001-5664-7979¹

390 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Post-acute sequelae of COVID-19 (PASC) is a persistent complication of severe acute respiratory syndrome coronavirus 2 infection that includes symptoms, such as fatigue, cognitive impairment, and respiratory distress. These symptoms severely affect the quality of life of patients after their recovery from COVID-19. In this study, a group of machine learning algorithms analyzed the whole blood RNA-seq data from patients with different PASC levels. The purpose of this analysis was to identify the gene markers associated with PASC and the special expression patterns for different PASC levels. By comparing the quality of life of patients after the acute phase of COVID-19 and before the disease, samples in the dataset were divided into three groups, namely, “Better,” “The Same,” and “Worse.” Each patient was represented by the expression levels of 58,929 genes. The machine learning-based workflow included six feature-ranking algorithms, incremental feature selection (IFS), and four classification algorithms. The feature ranking algorithms were in charge of assessing feature importance, whereas IFS with classification algorithms were used to extract essential genes and to construct efficient classifiers and classification rules. The expression of top genes in the results was associated with the immune response to viral infection, which is supported by the published literature. For example, patients with low CCDC18 expression and high CPED1 expression had good quality of life, whereas those with low CDC16 expression had poor quality of life.

Graphical Abstract

PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients

Article Open access 30 August 2021

Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets

Article Open access 23 July 2021

COVID-19 Biomarkers Detection Using ‘KnowSeq’ R Package

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

COVID-19 is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, and the WHO declared a global pandemic of COVID-19 on March 11, 2020 [1]. The stinger subunit of SARS-CoV-2 infects host cells by binding to the human angiotensin-converting enzyme 2 receptor, and the main pathogenic mechanisms include direct viral toxicity, endothelial and microvascular damage, a hypercoagulable state leading to thrombosis and micro-thrombosis, dysregulation of the immune system, and stimulation of the body in a hyperinflammatory state [2]; after being infected with SARS-CoV-2, patients experience a remarkable increase in blood, elevated white blood cell counts, elevated neutrophil counts, thrombocytopenia, decreased CD8 + T lymphocyte counts, and elevated certain biomarkers (e.g., C-reactive protein, calcitonin, cytokine IL-6, ferritin, and serum amyloid A).

Many patients are experiencing lingering sequelae as they recover from the virus. Post-acute sequelae of COVID-19 (PASC) is a develo** complication of SARS-CoV-2 infection that causes persistent symptoms in patients who have not fully recovered after the acute phase, which is defined as 4 weeks after the initial onset of symptoms and not yet more than 12 weeks; it can also lead to multi-organ system sequelae after the acute phase [2,3,4]. PASC is usually nonspecific, with the most common symptoms being fatigue (15–87%), dyspnea (10–71%), chest pain or tightness (12–44%), cough (17–34%), and sleep disturbance (24–26%) [5,6,7,8,9]. The most common gastrointestinal symptoms of PASC include nausea, vomiting, diarrhea, anorexia, and loss of appetite [10,11,12]. Studies have shown that 79.7% of patients with acute COVID-19 develop olfactory loss, especially in patients with coexisting fever, and the loss of smell persists in up to 56% of these patients after their recovery from acute infection [13, 14]. Study of PASC severity may be related to the multi-omics data of SARS-CoV-2 mRNA extracted from blood and different phenotypes of CD8 + T and CD4 + T cells [15]. PASC is more prevalent in women than in men [5]. However, the symptoms are more severe and the risk of death is higher in men than in women [16]. A total of 54% of women experienced fatigue 10 weeks after the initial symptoms of COVID-19 [17]. Some studies have also confirmed that older age was associated with PASC [18, 19]. This situation may be due to the higher prevalence of comorbidities and severe disease in older age groups. Severe disease in these cases may trigger a highly inflammatory immune response, which leads to a prolonged recovery period from COVID-19. Lower household income is also correlated with the occurrence of PASC, probably because low-income people cannot work at home, lack adequate personal protective equipment, and have overcrowded living conditions that give them greater exposure to higher viral doses [20, 21]. PASC affects at least 10–30% of people who test positive for COVID-19 [4]. Thus, understanding the prevalence, predisposing factors, and susceptibility factors for PASC can help with the early identification, intervention, and support for COVID-19 patients who are at higher risk of develo** PASC.

On the one hand, SARS-CoV-2 may drive chronic symptoms in PASC patients with persistent symptoms by persisting in certain body parts or tissues and organs after acute infection. On the other hand, although SARS-CoV-2 is completely cleared from the blood, tissues, and nerves of patients, the virus may dysregulate the host immune response during the acute phase of COVID-19. This condition reactivates the pathogen that is previously carried by the host, infects new body sites, and triggers chronic symptoms. For example, one study reported that a critically ill COVID-19 patient infected with SARS-CoV-2 had his previously infected VZV and HSV-1 reactivated, which were correlated with the onset of septic shock [22].

Previous cohort studies of PASC had limited sample sizes from a specific region; data measures of symptom severity in sample cases are based on subjective patient self-reported statements and are subject to patient recall and response biases [23]. Previous RNA-seq analyses have failed to investigate the relationship between PASC and host immune status systematically. In our previous research, we discovered that machine learning algorithms can identify latent biomarkers and characterize the features of different COVID-19 cohorts accurately. The severity of viral infections can be predicted more effectively by utilizing this technology. This feature facilitates the optimization of treatment protocols, which improves patient outcomes [24,25,26,27,28]. In this study, we analyzed whole blood gene expression in hospitalized COVID-19 patients and explored the molecular mechanism of COVID-19 sequelae through a group of machine learning algorithms. The essential genes highly related to PASC were obtained, and some special expression patterns about different PASC levels were extracted. We also constructed efficient classifiers to predict the quality of life after COVID-19 infection, which can be a useful tool and a convenient way to predict the severity of the sequelae of SARS-CoV-2 infection. This work can provide guidance for the development of corresponding liquid biopsy methods and a reference for improving the prognosis of patients infected with SARS-CoV-2.

2 Materials and methods

In this study, we attempted to discover essential genes and patterns that can indicate the PASC levels. To this end, a machine learning-based approach was designed and applied to an existing whole blood RNA-seq data from COVID-19 patients reported in the GEO database. Figure 1 shows the workflow of the approach. The investigated dataset was grouped based on the quality of life of patients after the acute phase of COVID-19. Our approach first assessed the importance of genes using six feature-ranking methods, yielding six feature lists. Subsequently, each list was fed into a combination of incremental feature selection (IFS) [29] method and four classification algorithms to further filter the key genes that can best distinguish the different classes (quality of life changing before and after the acute phase of COVID-19) and determine their special expression patterns.

2.1 Data

Whole blood RNA-seq data from COVID-19 patients hospitalized during acute infection were obtained from the GEO database [30] under the accession number GSE215865. Based on subsequent sequelae follow-up, we compared the quality of life of patients after the acute phase of COVID-19 and before the disease, which was reported by study participants using a self-reported checklist that assesses the emergence of PASC or health changes after COVID-19. The quality of life changing since the emergence of COVID-19 were coded as three text values, namely, “Better,” “The Same,” or “Worse.” A total of 81, 234, and 187 patient samples were labeled by the three aforementioned text values and constituted three groups. Each patient was represented by the expression levels of 58,929 genes.

2.2 Problem description

According to the data mentioned in Sect. 2.1, COVID-19 patient samples were classified into Better, The Same, and Worse. In the classification problem, they can be deemed as three labels, denoted by ${L}_{Better}$, ${L}_{Same}$, and ${L}_{Worse}$. The expression levels of 58,929 genes were used to express each patient. These genes were denoted by ${g}_{1},{g}_{2},\cdots ,{g}_{58929}$. The gene expression levels of i-th patient sample comprised a vector, denoted by ${Y}_{i}={\left[{y}_{i,1},{y}_{i,2},\cdots ,{y}_{i,58929}\right]}^{T}$, where ${y}_{i,j}$ represented the expression level on gene ${g}_{j}$. Accordingly, the whole blood RNA-seq data mentioned in Sect. 2.1 can be represented by ${\left\{({Y}_{i},{L}_{i})\right\}}_{i=1}^{502}$, where ${L}_{i}\in \left\{{L}_{Better},{L}_{Same},{L}_{Worse}\right\}$ is the label of the i-th patient sample. The analysis on the whole blood RNA-seq data can be converted into the investigation on the classification problem that involved three labels (${L}_{Better}$, ${L}_{Same}$, and ${L}_{Worse}$). Evidently, some genes provided key contributions to distinguish patients with one label from those with other labels. Thus, the main purpose of this study was to extract these essential genes, such as ${g}_{{i}_{1}},{g}_{{i}_{2}},\cdots ,{g}_{{i}_{k}}$, which can provide high identification ability for one classification algorithm. Furthermore, the special patterns, expressed by a group of some essential genes, along with the thresholds of their expression levels, can be inferred for each label. Finally, we also wanted to establish efficient classifiers for identifying the PASC level.

2.3 Feature ranking algorithms

The investigated classification problem contained excessive features (more than 55,000). Evidently, not all features provided positive contributions to classify samples into three classes; that is, the essential features were limited. This study mainly aimed to identify such essential features. In the field of machine learning, several methods have been proposed to assess the importance of features. The use of these methods was suitable in deal with the dataset. However, selecting a suitable method was challenging. Based on our experiences, one method can only screen out a part of essential features because each method has its limitations. The essential features identified by one algorithm may be important supplements for those yielded by another algorithm. To discover essential features as complete as possible, this study employed six feature ranking algorithms, including least absolute shrinkage and selection operator (LASSO) [31], light gradient boosting machine (LightGBM) [32], Monte Carlo feature selection (MCFS) [33], random forest (RF)-based method [34], categorical boosting (CATBoost) [35], and extreme gradient boosting (XGBoost) [36]. These algorithms were designed by following different rules. Hence, they can overview the dataset from different points of views. A more complete picture on essential features can be provided by using these algorithms. These algorithms have also been used in numerous publications [27, 37,38,39,40,41,42,43]. In the remainder of this section, their brief descriptions were given.

2.3.1 Least absolute shrinkage and selection operator

LASSO is a linear regression method introduced by Robert Tibshirani in 1996. It combines regularization and feature selection by adding a penalty term to the objective function, resulting in sparse solutions with reduced model complexity [31]. LASSO has gained substantial attention in machine learning and statistics because of its ability to produce interpretable models and effectively handle high-dimensional data. It employs L1 regularization, which adds the sum of the absolute values of the coefficients to the loss function. This form of regularization helps prevent overfitting by encouraging smaller coefficient values and reducing model complexity. In LASSO, the magnitude of the nonzero coefficients is an indication of feature importance. Larger coefficients imply a greater influence on the target variable, while coefficients close to or equal to zero indicate that the corresponding features have little or no effect on the target variable. By penalizing large coefficients and promoting sparsity, LASSO highlights the most relevant features and filters out the less important ones. Based on the decreasing order of absolute value of coefficients, all features can be sorted in a list.

2.3.2 Light gradient boosting machine

LightGBM is a gradient boosting framework developed by Microsoft that has gained considerable popularity among machine learning practitioners [32]. It is designed to be more efficient and scalable than traditional gradient boosting algorithms, making it suitable for large-scale and high-dimensional data. By utilizing novel techniques, such as gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB), LightGBM remarkably reduces the computational burden while maintaining high predictive accuracy. LightGBM is designed to handle large datasets and high-dimensional data more efficiently than other gradient boosting frameworks, such as XGBoost and CATBoost, because of the unique techniques of GOSS and EFB. LightGBM provides two common methods for calculating feature importance: (1) split-based importance: this method calculates feature importance based on the number of times a feature is used to split the data across all trees in the model. The more frequently a feature is used to make splits, the higher its importance will be. (2) Gain-based importance: this method calculates feature importance based on the total gain in accuracy or reduction in the objective function achieved by using a feature for splitting across all trees. This study adopted the former method to assess feature importance; that is, features were sorted in a list with the decreasing order of their occurrence numbers.

2.3.3 Monte Carlo feature selection

MCFS is a probabilistic method for identifying important features in high-dimensional datasets [33]. By using Monte Carlo sampling techniques, MCFS efficiently explores the feature space and evaluates feature importance based on their contributions to model performance. This method has gained attention in various domains, including bioinformatics, text classification, and image recognition, because of its ability to handle large datasets and provide reliable estimates of feature importance. MCFS works as follows: first, s subsets of features are randomly selected, and for each feature subset, t trees are constructed based on t training sets that are randomly selected. Therefore, s × t trees are constructed. The relative importance of a feature can be estimated by considering the weighted accuracy of the prediction, the information gain, and the number of samples affected by a split in these trees. Accordingly, features are ranked in a list in terms of their relative importance.

2.3.4 Random forest-based method

RFs provide a natural way to compute feature importance because they are built using a collection of decision trees [34]. In general, feature importance in RF-based method is calculated based on the average contribution of a feature to the improvement in the predictive performance of the trees in the ensemble. Feature importance can be calculated using two common methods: (1) Mean Decrease in Impurity (MDI) or Gini Importance: this method calculates feature importance based on the average decrease in impurity, such as the Gini index or entropy, when a feature is used to split the data across all trees in the forest. The more a feature contributes to reducing impurity in the trees, the higher its importance will be. MDI is the default method used by many machine learning libraries, such as scikit-learn. (2) Mean Decrease in Accuracy (MDA) or Permutation Importance: this method calculates feature importance by randomly permuting the values of a feature and measuring the change in the model’s predictive performance, such as accuracy or mean squared error. A larger decrease in performance indicates a more important feature. MDA is a model-agnostic method, which means that it can be applied to other models. This study adopted the RF-based method using MDA to measure feature importance. Features are sorted in the list with the decreasing order of MDA.

2.3.5 Categorical boosting

CATBoost is a gradient boosting framework developed by Yandex to handle categorical features efficiently [35]. Gradient boosting is an ensemble learning method that builds a strong model by iteratively adding weak learners (typically decision trees) to minimize a loss function. CATBoost offers competitive performance and ease of use compared with other popular gradient boosting frameworks, such as XGBoost and LightGBM. Its unique selling point is its ability to handle categorical features effectively using an innovative approach called ordered target statistics. CATBoost is less prone to overfitting because of its ordered target statistics approach and carefully designed regularization techniques. In CATBoost, feature importance is calculated using a technique called “Permutation Importance.” The basic idea behind this method is to measure the effect of a feature on the model’s performance by randomly shuffling the values of that feature and observing the change in model performance. The larger the decrease in performance after shuffling a feature, the more important that feature is considered to be. Accordingly, features are ranked in a list with the decreasing order of their Permutation Importance.

2.3.6 Extreme gradient boosting

XGBoost is an open-source library that provides an efficient and scalable implementation of the gradient boosting algorithm [36]. Developed by Tianqi Chen and Carlos Guestrin, XGBoost has gained popularity among machine learning practitioners because of its ability to deliver state-of-the-art performance on various tasks, such as classification and regression. By utilizing a number of optimizations and parallelization techniques, XGBoost has become a go-to library for many data scientists and machine learning engineers. XGBoost provides built-in methods to compute feature importance based on the average contribution of a feature to the improvement in the predictive performance of the trees in the ensemble. Three factors affect the relative importance of a feature in XGBoost: (1) Split: it calculates the number of times a feature is used to split the data across all trees in the ensemble. A higher value indicates that the feature is used more frequently and is thus considered more important. (2) Gain: it estimates feature importance based on the average improvement in the splitting criterion (e.g., Gini index or information gain) brought about by a feature when it is used to split the data across all trees. A higher value indicates that the feature contributes more to the improvement of the model’s performance, making it more important. (3) Coverage: it estimates feature importance based on the average coverage of a feature when it is used to split the data across all trees. Coverage refers to the number of samples affected by the split. A higher value indicates that the feature affects a larger portion of the dataset, making it more important. The current study used the XGBoost with the first scheme to rank features.

The six aforementioned feature ranking algorithms were applied to the whole blood RNA-seq data in Sect. 2.1. Each algorithm generated one feature list. For easy descriptions, these lists were called LASSO, LightGBM, MCFS, RF, CATBoost, and XGBoost feature lists.

2.4 Incremental feature selection

Although the six aforementioned algorithms measure the importance of features by listing features in the lists, they cannot answer the question in which features should be selected as essential features. In view of this, we employed the IFS [29] method, a heuristic search-based feature selection technique that aims to find the optimal subset of features for a given classification algorithm. The main idea behind IFS is to add features incrementally in a given list to the current subset, evaluate the performance of the resulting feature subset, and select the best performing subset. The IFS procedure can be summarized as follows: (1) initialize an empty feature subset; (2) add the next feature in the given feature list to the current subset and evaluate the performance of the resulting feature subset using a predefined criterion (e.g., accuracy, F₁ score, or AUC-ROC) and a specific classification algorithm (e.g., logistic regression, decision tree, or SVM). This evaluation can be performed using techniques, such as cross-validation, to estimate the generalization performance of the model. (3) After all the desired features are added and evaluated, the feature subset that performs best is selected as the optimal feature subset, and the corresponding classifier is deemed as the optimal classifier. When the feature list is very large (i.e., the number of features is large), IFS can only consider some top features according to the traits of the investigated problem and Step 2 can be modified by adding a fixed number of features (i.e., five or ten) to the current subset. Under this operation, the computation time can be reduced sharply.

2.5 Synthetic minority oversampling technique

By checking the number of patient samples in three classes, the class The Same was 3 times as large as the class Better, indicating that the dataset was imbalanced. The classifier directly built on this dataset may produce bias. This problem must be addressed. Synthetic minority over-sampling technique (SMOTE) [44] is a widely used oversampling method designed to address the class imbalance problem in machine learning datasets. The SMOTE procedure can be summarized as follows: (1) select a random instance from the minority class, (2) find its k-nearest neighbors within the minority class (k is a user-defined parameter), (3) randomly select one of the k-nearest neighbors, and (4) interpolate a new synthetic instance between the selected instance and its chosen neighbor. Repeat Steps 1–4 until the numbers of data in each class are balanced.

2.6 Classification algorithm

To implement the IFS method, we need to use supervised classification algorithms as the foundation. Therefore, we selected four efficient algorithms, namely, decision tree (DT) [45], k-nearest neighbor (KNN) [46], random forest (RF) [34], and support vector machine (SVM) [47].

2.6.1 Decision tree

DT is a machine learning algorithm used for classification and regression tasks [45]. It is a tree-like structure that represents a series of decisions based on specific features, resulting in a predicted outcome or value. The main components of a decision tree are nodes, branches, and leaves. Nodes represent decision points, where the tree splits based on a feature’s value. The root node is the starting point of the tree, whereas the internal nodes represent subsequent decisions. Branches connect nodes and denote the possible outcomes of a decision. Leaves are the terminal nodes of the tree, which contain the final predicted outcome or class label. A DT is constructed by recursively partitioning the data using a set of feature-value rules. It is built top-down, starting with the root node, and selects the best feature to split on at each step. Various metrics, such as Gini impurity or information gain, are used to determine the best split.

2.6.2 K-nearest neighbor

KNN is a simple, yet powerful supervised learning algorithm used for classification and regression tasks [46]. It is a lazy learning algorithm; that is, it does not explicitly build a model during the training phase. Instead, it stores the training instances and only performs computations during the prediction phase. KNN is based on the concept of similarity or distance metrics, assuming that similar data points are likely to share the same class label or target value. The KNN algorithm works as follows: (1) select the number of neighbors (k). The choice of k depends on the problem and can greatly influence the algorithm’s performance; (2) calculate distances: for a new, unseen instance, KNN computes the distance between this instance and each training instance using a distance metric, such as Euclidean, Manhattan, or Minkowski distance; (3) identify k-nearest neighbors: the algorithm selects the k training instances with the smallest distances to the new instance; and (4) make predictions: for classification tasks, the most common class label among the k-nearest neighbors is assigned as the prediction for the new instance.

2.6.3 Random forest

RF is an ensemble learning method in machine learning that combines multiple DTs to improve prediction accuracy and robustness [34]. It operates by constructing a multitude of DTs at training and outputting the class by majority rule or the mean prediction (regression) of the individual trees at testing. RF works through a process called bagging (e.g., bootstrap aggregating) and feature randomness. For each tree, a random sample of the training data is selected with replacement. This process creates diverse subsets of data for each tree, thereby reducing the likelihood of overfitting. Each tree is grown using the selected subset of data. During the construction process, a random subset of features is selected as candidates for splitting at each node, rather than considering all features. This process introduces additional randomness and diversity among the trees. Once all the trees are constructed, the predictions of the individual trees are combined to reach the final decision.

2.6.4 Support vector machine

SVM is a supervised machine learning algorithm widely used for classification and regression tasks [47]. SVM is particularly effective for high-dimensional and linearly separable data. However, it can also handle nonlinear relationships using the kernel trick. The main idea behind SVM is to find the optimal hyperplane that maximizes the margin among different classes while minimizing classification errors. An overview of how the SVM algorithm works for classification tasks is presented as follows: (1) identify the optimal hyperplane: SVM aims to find the hyperplane that best separates the data points of different classes. The margin is the distance between the hyperplane and the closest data points from each class, known as support vectors. SVM seeks to maximize this margin to ensure the best separation between classes and improve generalization performance. (2) Soft-margin classification: in cases where the data are not linearly separable, SVM introduces a regularization parameter (C) to allow some misclassifications. This parameter controls the trade-off between maximizing the margin and minimizing classification errors. (3) Kernel trick: to handle nonlinear relationships, SVM employs the kernel trick, which transforms the input data into a higher-dimensional space. Popular kernels include linear, polynomial, radial basis function (RBF), and sigmoid kernels. The choice of kernel and its parameters greatly influences the model’s performance.

2.7 Performance evaluation

The F₁ score is an essential measurement in binary classification [48,49,50,51,52]. It has two forms for multi-class classification, namely, macro F₁ and weighted F₁. To compute them, the F₁ score for each class should be computed in advance. When computing the F₁ score for the i-th class, samples in this class are termed as positive samples, whereas others are regarded as negative samples. Tenfold cross-validation [53,54,55] results can be counted as four entries: true positive (TP_i), false positive (FP_i), false negative (FN_i), and true negative (TN_i). Accordingly, F₁ score can be computed by.

$${Precision}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FP}_{i}}$$

(1)

$${Recall}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FN}_{i}}$$

(2)

$${F1\;score}_{i}=\frac{2\bullet {Precision}_{i}\bullet {Recall}_{i}}{{Precision}_{i}+{Recall}_{i}}$$

(3)

After that, macro F₁ and weighted F₁ are defined as.

$$Macro\;F1=\frac{1}{L}{\sum }_{i=1}^{L}{F1\;score}_{i}$$

(4)

$$Weighted\;F1=\frac{1}{L}{\sum }_{i=1}^{L}{{w}_{i}\times F1\;score}_{i}$$

(5)

where $L$ represents the number of classes and ${w}_{i}$ stands for the proportion of samples in the i-th class to the overall sample. Considering that weighted F₁ further considers the distribution of samples across classes, it was selected as the key measurement in this study.

In addition, we employed two other classic measurements, namely, prediction accuracy (ACC) and Mathews correlation coefficient (MCC) [56]. ACC is defined as the proportion of correctly predicted samples, whereas the definition of MCC is quite complex. First, two matrices are constructed, namely, X and Y, where X stores the real classes of samples and Y collects the predicted classes of samples. Then, MCC can be calculated by.

$$MCC=\frac{cov(X,Y)}{\sqrt{cov(X,X)\bullet cov(Y,Y)}}$$

(6)

where $cov(X,Y)$ denotes the correlation coefficient of two matrices.

3 Results

This study investigated the transcriptomic data from COVID-19 patients to predict the quality of life of patients after the acute phase. The data contained 502 patient samples that were divided into three classes based on changes in their quality of life, and each patient was represented by the expression levels of 58,929 genes. To deeply mine essential genes and special patterns for a certain class, a machine learning-based method was designed, as illustrated in Fig. 1. According to this method, six feature ranking algorithms were applied to the transcriptomic data, thereby generating six feature lists (e.g., LASSO, LightGBM, MCFS, RF, CATBoost, and XGBoost feature lists). These lists are provided in Supplementary Table S1. Generally, features (genes) with high ranks in one list implied great contributions for predicting the level of patient sequelae.

3.1 Performance of the classification algorithms on different feature lists

After obtaining the feature lists, we then applied IFS method to each feature list, and four classification algorithms (e.g., DT, KNN, RF, and SVM) were used in this procedure. Considering that each feature list was very large, testing all possible feature subsets would require a considerable amount of time. Thus, we only considered the top 3000 features in each list because of the limited number of essential features related to PASC. Furthermore, to accelerate the IFS procedure, we added five features to the current subset each time. Accordingly, 600 feature subsets were constructed from each feature list. A classifier was built on each feature subset with a given classification algorithm, and it was evaluated by tenfold cross-validation. The cross-validation results were counted as ACC, MCC, macro F₁, and weighted F₁, which are available in Supplementary Table S2. To demonstrate the performance of one classification algorithm under all constructed feature subsets derived from one feature list, an IFS curve was plotted with weighted F₁ as Y-axis and a number of features in the subset as X-axis, which is displayed in Figs. 2 and 3.

The IFS curves of four classification algorithms for the LASSO feature list are shown in Fig. 2A. DT, KNN, RF, and SVM provided the highest weighted F₁ (0.617, 0.763, 0.749, and 0.773, respectively) when top 25, 110, 85, and 145 features in the LASSO feature list were used, respectively. Accordingly, the optimal DT, KNN, RF, and SVM classifiers can be obtained using the aforementioned features. Their detailed performance is listed in Table 1. We can find that the optimal SVM classifier yielded the best performance with highest ACC, MCC, macro F₁, and weighted F₁.

Table 1 Performance of the optimal classifiers on different feature lists

Full size table

The IFS curves for the LightGBM feature list are displayed in Fig. 2B. The highest performance of DT, KNN, RF, and SVM was obtained by using top 325, 55, 165, and 45 features in this list, respectively. They provided the weighted F₁ of 0.643, 0.714, 0.855, and 0.762, respectively. Likewise, the optimal DT, KNN, RF, and SVM classifiers can be built. The detailed performance of these optimal classifiers is listed in Table 1. Evidently, the optimal RF classifier generated higher performance than the three other optimal classifiers.

As for the rest four feature lists (e.g., MCFS, RF, CATBoost, and XGBoost feature lists), the IFS curves are shown in Figs. 2C and 3. With similar arguments, the optimal DT, KNN, RF, and SVM classifiers on each feature list can be determined. Their performance is also listed in Table 1. The optimal RF classifier always yielded the best performance on each of the four lists. They adopted top 1015 (MCFS feature list), 220 (RF feature list), 90 (CATBoost feature list), and 1020 (XGBoost feature list) features and yielded the weighted F₁ of 0.795, 0.795, 0.869, and 0.825, respectively.

Based on the performance of the optimal SVM classifier on LASSO feature list and the optimal RF classifiers on five other feature lists, the optimal RF classifier on the CATBoost feature list was the best, giving the highest weighted F₁. Thus, this classifier can be a useful tool to predict the quality of life of COVID-19 patients after the acute phase.

3.2 Intersection of key genes from six lists

From Sect. 3.1, the best classifier was found on each feature list. Generally, the features used in these classifiers were essential to predict the quality of life of COVID-19 patients after the acute phase. However, these features were excessive for most classifiers. For example, the optimal SVM classifier on the LASSO feature list adopted 145 features, whereas the optimal RF classifier on the MCFS feature list used 1015 features. The most essential genes cannot be screened out in this case. Thus, we need to further extract most essential genes from each feature subset used for building an efficient classifier. By checking the IFS results with SVM on the LASSO feature list and IFS results with RF on five other lists, we can determine another small feature subset, on which SVM or RF exhibited a similar performance to the optimal SVM or RF classifiers. Notably, we did not find such feature subset for RF on the CATBoost feature list because the optimal RF classifier on this list only needed 90 features. The sizes of the feature subsets on five other feature lists are listed in Table 2. Furthermore, the performance of SVM or RF classifiers on this feature subset is listed in this table. At a glance, such performance was only slightly lower than that of the optimal SVM or RF classifier. Figure 4 clearly confirmed this fact. The used features sharply declined. We called the classifiers using these features as the sub-optimal classifiers. For convenience, the optimal RF classifier on the CATBoost feature list was also termed as the sub-optimal classifier. Evidently, features used in the sub-optimal classifiers were more important than the other features used in the optimal classifiers because their addition can enhance their limited performance. For example, the sub-optimal SVM classifier on the LASSO feature list used the top 80 features in this list and yielded the weighted F₁ of 0.748. The optimal SVM classifier on the same list used the top 145 features and generated the weighted F₁ of 0.773. The 64 (145–80) features used in the optimal SVM classifier only improved the weighted F₁ of 0.025 (0.773–0.748), which was considered a limited improvement. Thus, the 64 features were less important than the 80 features used in the sub-optimal SVM classifier.

Table 2 Performance of sub-optimal classifiers on different feature lists

Full size table

With this analysis, a sub-optimal classifier was obtained on each feature list. These classifiers used the top 80 (LASSO feature list), 70 (LightGBM feature list), 215 (MCFS feature list), 105 (RF feature list), 90 (CATBoost feature list), and 185 (XGBoost feature list) features in the corresponding lists, respectively. The combination of these features (genes) may provide a full picture to describe the essential differences among the three classes, thereby yielding 675 features. An upset graph was plotted to show the intersection of these feature subsets, as illustrated in Fig. 5. Some genes were identified to be essential by three or more algorithms. The detailed genes identified by one to six algorithms are provided in Supplementary Table S3. Generally, genes identified by multiple algorithms should be given more attention. However, those identified by one algorithm may still be important. Thus, we also listed them in Supplementary Table S3 for later investigations.

3.3 Classification rules

According to the performance of the optimal classifiers on six lists (Table 1), the optimal DT classifiers always yielded the lowest performance. Their weighted F₁ values were all lower than 0.7. However, DT has a special merit that is not shared by the three other classification algorithms. It is a white-box algorithm because its classification procedures are completely open and hence can provide more medical insights. Based the optimal DT classifier on each feature list, we can extract a rule group. Each rule indicated a path from the root to one leaf, and it contained a number of genes with special thresholds, indicating a special expression pattern for the predicted result (one of three classes). Six rule groups are provided in Supplementary Table S4. The number of rules in each group is listed in Table 3. The rules on LASSO feature list were most, followed by those on MCFS, RF, LightGBM, CATBoost, and XGBoost feature lists. In each group, some rules were built for each class. The distribution of rules in each group across three classes is illustrated in Fig. 6. The rules for class Worse in each group were always the greatest. As for the rules for Better and The Same, rules for The Same were greater than those for Better on the LASSO and LightGBM feature lists. This finding was contrary to the MCFS, RF, and XGBoost feature lists. Some rules would be analyzed in Sect. 4.

Table 3 Number of rules in six rule groups

Full size table

4 Discussion

Using whole blood transcriptomic data from patients, as described above, we identified a number of potential biomarkers that could reveal the differences among patients with different qualities of life following SARS-CoV-2 infection. Based on the available studies, the current research on the sequelae of SARS-CoV-2 is very limited. However, the identified biomarkers and associated decision rules show relevance to SARS-CoV-2-related pathogenesis. They could also contribute to the development of appropriate drugs to improve the prognosis of SARS-CoV-2-infected patients, especially those who may suffer from severe sequelae.

4.1 Genes associated with differential quality of life after COVID-19 recovery

Based on the machine learning-based approach, we identified six groups of genes that were used in six sub-optimal classifiers. They were deemed important in hel** identify the severity of the different SARS-CoV-2 sequelae. Here, we analyzed essential genes that were identified by three or more feature ranking algorithms. According to recent publications, some genes are involved in biological processes, such as viral RNA editing and immune regulation.

CCDC18 (ENSG00000122483.17; features found in 5/6 methods) is a potential RcRE gene with sequence and function similar to that of the viral regulatory protein Rev (human immunodeficiency virus) and may play a complex regulatory role in the retroviral infection of cells [57, 58]. SARS-CoV-2 RNA can be reverse transcribed and integrated into the human genome, but it cannot infect or transmit [59]. Although the role of CCDC18 in SARS-CoV-2 infection has not been reported, our study shows that CCDC18 is associated with sequelae and quality of life after SARS-CoV-2 infection.

ADARB2 (ENSG00000185736.16) is another important signature gene that shows an important role in most (4/6) algorithms. ADARB2 has been highly expressed in patients with severe SARS-CoV-2 infection and severe influenza disease; it may induce alterations in inflammation-related mediators and lead to changes in SARS-CoV-2 morbidity and mortality through reduced adenosine to inosine (A-to-I) editing of endogenous dsRNA [60]. Other studies have also confirmed the association between high expression of ADARB2 and severe SARS-CoV-2 infections [61]. Given that RNA editing can be a relevant mechanism for controlling viral evolutionary dynamics, ADARB2 may influence virulence, pathogenicity, and host response; this situation leads to different qualities of life after SARS-CoV-2 infection [62].

The protein encoded by CXCL8 (ENSG00000169429.11) is a member of the CXC chemokine family and is a major mediator of the inflammatory response, primarily chemotactic to neutrophils [63]. High plasma levels of CXCL8 are often associated with inflammatory diseases. Some studies have reported remarkably higher CXCL8 expression in patients with severe COVID-19 but not in patients with milder cases [64]. Following SARS-CoV-2 infection, the relative expression of CXCL8 is remarkably lower in the infected population than in the uninfected population for most of the post-infection period; however, CXCL8 plasma concentrations are significantly higher on the first day after infection, and lower baseline CXCL8 expression is associated with worse prognosis and more severe clinical response [65]. CXCL8 was investigated in COVID-19, and higher plasma concentrations of CXCL8 were associated with greater mortality [66].

IL4R (ENSG00000077238.14) and IGHM (ENSG00000211899.10) are B cell molecular markers. IgM-type neutralizing antibodies have been shown to help prevent virus transmission and promote cytosolic virus uptake, and their peak after SARS-CoV-2 infection corresponds to a T cell peak and the disappearance of the virus, which suggests that IGHM may be involved in the recovery process after SARS-CoV-2 infection [67, 68]. A remarkable increase in IGHM subtypes in B cells was found in the Kexing Biopharm vaccinated population [69]; in addition, a substantial downregulation of IL-4R was observed in patients with mild SARS-CoV-2 infection [69]. Although no correlation between IL4R and IGHM and patient prognosis after SARS-CoV-2 infection has been reported, our study shows that they have extensively affected PASC and quality of life of patients.

CXCL8, IL4R, and IGHM are all immune-related genes and show substantial predictive roles in various algorithms (3/6). In addition, some genes are not directly associated with the immune response that are equally important, such as UGT2A3 (ENSG00000135220.11) [70] and CYP26B1 (ENSG00000003137.8) [71]. Several studies have reported that these genes correlate with the severity of SARS-CoV-2 infection. This finding further demonstrates the reliability of our results, and that these key features can provide guidance for predicting quality of life after SARS-CoV-2 infection. Meanwhile, further studies may facilitate the development of targets and drugs to improve the prognosis of SARS-CoV-2 infection.

In the study conducted by Thompson et al. (from which our data was sourced) [30], we conducted a comparative examination of key feature genes. This research has already identified various etiologies of acute post-infection sequelae during SARS-CoV-2 infection, directly linking these sequelae to the host’s acute response to the virus and providing early insights into their development. Genes, such as CCDC18, IL4R, and IGHM, which were also highlighted as critical differential expression genes in this study, play a crucial role in PASC. These genes, along with our findings, further underscore the reliability of our results and demonstrate that these key features can offer guidance for predicting the quality of life after SARS-CoV-2 infection.

4.2 Decision rules related to differential quality of life after COVID-19 recovery

As mentioned above, we identified several sets of valid biomarkers by using different calculation methods. We further established decision rules based on the newly proposed calculation methods and selected representative rules from the different calculation methods for a detailed discussion.

In the decision rule for the CATBoost calculation method, many populations with good quality of life require low levels of CCDC18 expression and high levels of CPED1 (ENSG00000106034.18) expression. As previously mentioned, CCDC18 may be involved in the regulation of viral infection of cellular processes, and its low-level expression may be associated with better post-infection symptoms. Reduced expression of CPED1 in peripheral blood has been shown to be associated with poor outcome in SARS-CoV-2 infection [72]. Few studies related to CPED1 have shown that CPED1 plays a potential tumor suppressive role in lung adenocarcinoma, and its low-level expression is associated with shorter survival [73]. In patients with deteriorating quality of life (Worse), we find that a subset of rules requires higher ITGA2 expression level, which mediates the adhesion of platelets and other cell types to the extracellular matrix and is involved in coagulation-related biological processes. Recent publications have found that abnormal ITGA2 expression correlates with the prognosis of SARS-CoV-2 infection [74], and a higher degree of mutation in ITGA2 is observed in patients with extremely severe SARS-CoV-2 infection [75].

In the decision rule for the LightGBM method, findings suggest that a large proportion of patients with poor quality of life (Worse) has a decision rule that requires low levels of CDC16 (ENSG00000130177.16). This protein is encoded by CDC16, which is involved in cell cycle regulation as a ubiquitin ligase. Studies have shown that the protein levels of CDC16 have potential for COVID-19 diagnosis [76]. Given that CDC16 is involved in the cell cycle, its low-level expression may correlate with the level of immune cell activity and indirectly contribute to poorer quality of life after SARS-CoV-2 infection. High expression of another gene, namely, TMEM176B (ENSG00000106565.18), is also associated with poorer quality of life after SARS-CoV-2 infection. TMEM176B is a possible marker of dendritic cell maturation or differentiation and is upregulated in dendritic cells from SARS-CoV-2-infected patients [77]. Another study found that TMEM176B is overexpressed in monocytes from SARS-CoV-2-infected patients and may be involved in the regulation of inflammasome-dependent T cell dysfunction [78]. Thus, high expression of TMEM176B may be associated with immune dysfunction in patients with poorer quality of life. The results of this method also show similar findings to those of the CATBoost method. Specifically, reduced expression of CPED1 correlates with the severe consequences of SARS-CoV-2 infection.

In summary, our study has identified key decision rules related to the differential quality of life after COVID-19 recovery using various computational methods. These rules are based on the expression levels of specific genes, such as CCDC18, CPED1, ITGA2, CDC16, and TMEM176B, each of which plays a unique role in post-infection outcomes. For instance, lower expression of CCDC18 and higher expression of CPED1 are associated with improved quality of life, possibly due to their roles in viral infection regulation and tumor suppression. Higher expressions of ITGA2, CDC16, and TMEM176B are linked to poorer quality of life, suggesting their potential involvement in immune cell activity, coagulation, dendritic cell maturation, and inflammasome-dependent T cell dysfunction. Our findings align with recent publications and emphasize the importance of these genes in determining the post-COVID-19 recovery experience. Overall, these insights can contribute to a deeper understanding of the molecular mechanisms underlying COVID-19 sequelae and guide future clinical research and interventions aimed at improving the outcomes of patients recovering from SARS-CoV-2 infection.

5 Conclusions

We grouped patients into three classes by comparing the quality of life of patients after the acute phase and before the occurrence of COVID-19. Whole blood transcriptomic data from all patients were analyzed using a series of machine learning algorithms. The analysis result enabled the screening of genetic markers that represent different sequelae severities. The severities tend to be associated with viral immunity and inflammatory triggers. We also constructed high-performance classifiers for the prediction of sequelae levels and classification rules to indicate special patterns on different sequelae levels. Furthermore, our study has provided the foundation for the identification of potential risk factors associated with PASC. Although we have uncovered crucial genetic signatures, further research is still needed to fully elucidate the complex interplay between these markers and the development of PASC. This knowledge will pave the way for the development of targeted therapeutic strategies aimed at improving the quality of life for individuals affected by PASC.

References

World Health Organization. Geneva (Switzerland): World Health Organization; 2020. WHO Director-General's opening remarks at the media briefing on COVID-19 - 11 March 2020 [Internet] [cited 2023 Jan. 26]. Available from: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---11-march-2020
Nalbandian A et al (2021) Post-acute COVID-19 syndrome. Nat Med 27(4):601–615
Article CAS PubMed PubMed Central Google Scholar
Ladds E et al (2020) Persistent symptoms after COVID-19: qualitative study of 114 “long COVID” patients and draft quality principles for services. BMC Health Serv Res 20(1):1144
Article PubMed PubMed Central Google Scholar
Greenhalgh T et al (2020) Management of post-acute COVID-19 in primary care. bmj 370:m3026
Article PubMed Google Scholar
Huang C et al (2021) 6-month consequences of COVID-19 in patients discharged from hospital: a cohort study. Lancet 397(10270):220–232
Article CAS PubMed PubMed Central Google Scholar
Al-Jahdhami I, Al-Naamani K, Al-Mawali A (2021) The post-acute COVID-19 syndrome (long COVID). Oman Med J 36(1):e220
Article CAS PubMed PubMed Central Google Scholar
Carfì A, Bernabei R, Landi F (2020) Persistent symptoms in patients after acute COVID-19. JAMA 324(6):603–605
Article PubMed PubMed Central Google Scholar
Arnold DT et al (2021) Patient outcomes after hospitalisation with COVID-19 and implications for follow-up: results from a prospective UK cohort. Thorax 76(4):399–401
Article PubMed Google Scholar
Knight DR et al (2022) Perception, prevalence, and prediction of severe infection and post-acute sequelae of COVID-19. Am J Med Sci 363(4):295–304
Article PubMed PubMed Central Google Scholar
Baj J et al (2020) COVID-19: specific and non-specific clinical manifestations and symptoms: the current state of knowledge. J Clin Med 9(6):1753
Article CAS PubMed PubMed Central Google Scholar
** X et al (2020) Epidemiological, clinical and virological characteristics of 74 cases of coronavirus-infected disease 2019 (COVID-19) with gastrointestinal symptoms. Gut 69(6):1002–1009
Article CAS PubMed Google Scholar
Wong SH, Lui RN, Sung JJ (2020) COVID-19 and the digestive system. J Gastroenterol Hepatol 35(5):744–748
Article CAS PubMed Google Scholar
Zhou Z et al (2020) Effect of gastrointestinal symptoms in patients with COVID-19. Gastroenterology 158(8):2294–2297
Article CAS PubMed Google Scholar
Guotao L et al (2020) SARS-CoV-2 infection presenting with hematochezia. Med Mal Infect 50(3):293
Article CAS PubMed PubMed Central Google Scholar
Munipalli B et al (2022) Post-acute sequelae of COVID-19 (PASC): a meta-narrative review of pathophysiology, prevalence, and management. SN Compr Clin Med 4(1):90
Article CAS PubMed PubMed Central Google Scholar
Lieberman NA et al (2020) In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age. PLoS Biol 18(9):e3000849
Article CAS PubMed PubMed Central Google Scholar
Townsend L et al (2020) Persistent fatigue following SARS-CoV-2 infection is common and independent of severity of initial infection. PLoS One 15(11):e0240784
Article CAS PubMed PubMed Central Google Scholar
Sudre CH et al (2021) Attributes and predictors of long COVID. Nat Med 27(4):626–631
Article CAS PubMed PubMed Central Google Scholar
Petersen MS et al (2021) Long COVID in the Faroe Islands: a longitudinal study among nonhospitalized patients. Clin Infect Dis 73(11):e4058–e4063
Article CAS PubMed Google Scholar
Patel JA et al (2020) Poverty, inequality and COVID-19: the forgotten vulnerable. Public Health 183:110
Article CAS PubMed Google Scholar
McClure ES et al (2020) Racial capitalism within public health—how occupational settings drive COVID-19 disparities. Am J Epidemiol 189(11):1244–1253
Article PubMed PubMed Central Google Scholar
Xu R et al. Co‐reactivation of human herpesvirus alpha subfamily (HSV I and VZV) in critically ill patient with COVID‐19. Br J Dermatol 183(6):1145–1147
Hirschtick JL et al (2021) Population-based estimates of post-acute sequelae of SARS-CoV-2 infection (PASC) prevalence and characteristics. Clin Infect Dis 73(11):2055–2064
Article CAS PubMed Google Scholar
Chen L et al (2021) Identifying COVID-19-specific transcriptomic biomarkers with machine learning methods. Biomed Res Int 2021:9939134
PubMed PubMed Central Google Scholar
Huang F et al (2022) Identifying COVID-19 severity-related SARS-CoV-2 mutation using a machine learning method. Life 12(6):806
Article ADS CAS PubMed PubMed Central Google Scholar
Chen L et al (2022) Recognition of immune cell markers of COVID-19 severity with machine learning methods. Biomed Res Int 2022:6089242
PubMed PubMed Central Google Scholar
Lu J et al (2022) Identification of COVID-19 severity biomarkers based on feature selection on single-cell RNA-Seq data of CD8(+) T cells. Front Genet 13:1053772
Article CAS PubMed PubMed Central Google Scholar
Chen L et al (2022) Identification of DNA methylation signature and rules for SARS-CoV-2 associated with age. Front Biosci (Landmark Ed) 27(7):204
Article PubMed Google Scholar
Liu H, Setiono R (1998) Incremental feature selection. Appl Intell 9(3):217–230
Article Google Scholar
Thompson RC et al (2023) Molecular states during acute COVID-19 reveal distinct etiologies of long-term sequelae. Nat Med 29(1):236–246
Article CAS PubMed Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
MathSciNet Google Scholar
Ke G et al (2017) LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3146–3154
Google Scholar
Draminski M et al (2008) Monte Carlo feature selection for supervised classification. Bioinformatics 24(1):110–117
Article CAS PubMed Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Dorogush AV, Ershov V, A Gulin (2018) CatBoost: gradient boosting with categorical features support. ar**v preprint ar**v:1810.11363
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. in The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Assoc Comput Mach 785–794
Li H et al (2022) Identifying functions of proteins in mice with functional embedding features. Front Genet 13:909040
Article CAS PubMed PubMed Central Google Scholar
Li H et al (2022) Identification of COVID-19-specific immune markers using a machine learning method. Front Mol Biosci 9:952626
Article ADS CAS PubMed PubMed Central Google Scholar
Li Z et al (2022) Identifying key microRNA signatures for neurodegenerative diseases with machine learning methods. Front Genet 13:880997
Article CAS PubMed PubMed Central Google Scholar
Huang F et al (2023) Analysis and prediction of protein stability based on interaction network, gene ontology, and KEGG pathway enrichment scores. BBA - Proteins Proteomics 1871(3):140889
Article CAS PubMed Google Scholar
Huang F et al (2023) Identification of smoking associated transcriptome aberration in blood with machine learning methods. Biomed Res Int 2023:5333361
Article PubMed PubMed Central Google Scholar
Ren J et al (2023) Identification of genes associated with the impairment of olfactory and gustatory functions in COVID-19 via machine-learning methods. Life 13(3):798
Article ADS CAS PubMed PubMed Central Google Scholar
Zhao X, Chen L, Lu J (2018) A similarity-based method for prediction of drug side effects with heterogeneous information. Math Biosci 306:136–144
Article MathSciNet CAS PubMed Google Scholar
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674
Article MathSciNet Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Powers D (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Mach Learn Technol 2(1):37–63
MathSciNet Google Scholar
Chen L et al (2022) Predicting RNA 5-methylcytosine sites by using essential sequence features and distributions. Biomed Res Int 2022:4035462
PubMed PubMed Central Google Scholar
Chen L, Chen K, Zhou B (2023) Inferring drug-disease associations by a deep analysis on drug and disease networks. Math Biosci Eng 20(8):14136–14157
Article PubMed Google Scholar
Wu C, Chen L (2023) A model with deep analysis on a large drug network for drug classification. Math Biosci Eng 20(1):383–401
Article PubMed Google Scholar
Yang Y, Chen L (2022) Identification of drug–disease associations by using multiple drug and disease networks. Curr Bioinform 17(1):48–59
Article MathSciNet CAS Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. in International joint Conference on artificial intelligence. Lawrence Erlbaum Associates Ltd
Wang H, Chen L (2023) PMPTCE-HNEA: predicting metabolic pathway types of chemicals and enzymes with a heterogeneous network embedding algorithm. Curr Bioinform 18(9):748–759
Article CAS Google Scholar
Tang S, Chen L (2022) iATC-NFMLP: identifying classes of anatomical therapeutic chemicals based on drug networks, fingerprints and multilayer perceptron. Curr Bioinform 17(9):814–824
Article CAS Google Scholar
Matthews B (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Struct 405(2):442–451
Article CAS Google Scholar
Magin C, Löwer R, Löwer J (1999) cORF and RcRE, the Rev/Rex and RRE/RxRE homologues of the human endogenous retrovirus family HTDV/HERV-K. J Virol 73(11):9496–9507
Article CAS PubMed PubMed Central Google Scholar
Gray LR et al (2019) HIV-1 Rev interacts with HERV-K RcREs present in the human genome and promotes export of unspliced HERV-K proviral RNA. Retrovirology 16:1–17
Article Google Scholar
Zhang L, et al. (2020) SARS-CoV-2 RNA reverse-transcribed and integrated into the human genome. BioRxiv 2020.12. 12.422516
Crooke PS et al (2021) Cutting edge: reduced adenosine-to-inosine editing of endogenous Alu RNAs in severe COVID-19 disease. J Immunol 206(8):1691–1696
Article CAS PubMed Google Scholar
Pang X, et al. (2021) Emerging SARS-CoV-2 mutation hotspots associated with clinical outcomes. bioRxiv 2021: 2021.03. 31.437666.
Picardi E, Mansi L, Pesole G (2021) Detection of A-to-I RNA editing in SARS-COV-2. Genes 13(1):41
Article PubMed PubMed Central Google Scholar
Russo RC et al (2014) The CXCL8/IL-8 chemokine family and its receptors in inflammatory diseases. Expert Rev Clin Immunol 10(5):593–619
Article CAS PubMed Google Scholar
Park JH, Lee HK (2020) Re-analysis of single cell transcriptome reveals that the NR3C1-CXCL8-neutrophil axis determines the severity of COVID-19. Front Immunol 11:2145
Article CAS PubMed PubMed Central Google Scholar
Pius-Sadowska E et al (2022) CXCL8, CCL2, and CMV seropositivity as new prognostic factors for a severe COVID-19 course. Int J Mol Sci 23(19):11338
Article CAS PubMed PubMed Central Google Scholar
Huang Y et al (2020) The associations between fasting plasma glucose levels and mortality of COVID-19 in patients without diabetes. Diabetes Res Clin Pract 169:108448
Article ADS CAS PubMed PubMed Central Google Scholar
Nouailles G et al (2021) Temporal omics analysis in Syrian hamsters unravel cellular effector responses to moderate COVID-19. Nat Commun 12(1):4869
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang J-Y et al (2020) Single-cell landscape of immunological responses in patients with COVID-19. Nat Immunol 21(9):1107–1118
Article CAS PubMed Google Scholar
Wang Y, et al. Single-cell transcriptomic atlas of individuals receiving inactivated COVID-19 vaccines reveals distinct immunological responses between vaccine and natural SARS-CoV-2 infection. medRxiv, 2021: 2021.08. 30.21262863
Vastrad BM, Vastrad CM (2021) Bioinformatics analysis of expression profiling by high throughput sequencing for identification of potential key genes among SARS-CoV-2/COVID 19. Researchsquare
Sarohan AR, et al. Retinol depletion in severe COVID-19. medRxiv 2021: 2021.01. 30.21250844
Guardela BMJ et al (2021) 50-gene risk profiles in peripheral blood predict COVID-19 outcomes: a retrospective, multicenter cohort study. EBioMedicine 69:103439
Article Google Scholar
Hsu Y-L et al (2017) Identification of novel gene expression signature in lung adenocarcinoma by using next-generation sequencing data and bioinformatics analysis. Oncotarget 8(62):104831
Article PubMed PubMed Central Google Scholar
Charitou T et al (2022) Drug genetic associations with COVID-19 manifestations: a data mining and network biology approach. Pharmacogenomics J 22(5–6):294–302
Article CAS PubMed PubMed Central Google Scholar
Gorodin V et al (2021) Role of polymorphisms of genes involved in hemostasis in COVID-19 pathogenesis. Infektsionnye Bolezni 19(2):16–26
Fu L et al (2022) Using bioinformatics and systems biology to discover common pathogenetic processes between sarcoidosis and COVID-19. Gene Rep 27:101597
Article CAS PubMed PubMed Central Google Scholar
Nikitopoulou I et al (2021) Increased autotaxin levels in severe COVID-19, correlating with IL-6 levels, endothelial dysfunction biomarkers, and impaired functions of dendritic cells. Int J Mol Sci 22(18):10006
Article CAS PubMed PubMed Central Google Scholar
Duhalde Vega M et al (2022) PD-1/PD-L1 blockade abrogates a dysfunctional innate-adaptive immune axis in critical β-coronavirus disease. Sci Adv 8(38):eabn6545
Article PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by the National Key R&D Program of China (2022YFF1203202), Strategic Priority Research Program of Chinese Academy of Sciences (XDA26040304, XDB38050200), the Fund of the Key Laboratory of Tissue Microenvironment and Tumor of Chinese Academy of Sciences (202002), and Shandong Provincial Natural Science Foundation (ZR2022MC072).

Author information

**g**n Ren, Qian Gao and **anChao Zhou contributed equally to this work.

Authors and Affiliations

School of Life Sciences, Shanghai University, Shanghai, 200444, China
**g**n Ren & Yu-Dong Cai
Department of Pharmacy, Shanghai Children’s Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
Qian Gao
Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
**anChao Zhou
College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
Lei Chen
Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, 200030, China
Wei Guo
Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China
KaiYan Feng
Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
Tao Huang
CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
Tao Huang

Authors

**g**n Ren
View author publications
You can also search for this author in PubMed Google Scholar
Qian Gao
View author publications
You can also search for this author in PubMed Google Scholar
**anChao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wei Guo
View author publications
You can also search for this author in PubMed Google Scholar
KaiYan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Tao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Dong Cai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Conceptualization: Tao Huang, Yu-Dong Cai; methodology: **g**n Ren, Qian Gao, Lei Chen, KaiYan Feng; formal analysis and investigation: **g**n Ren, **anChao Zhou, Wei Guo; writing — original draft preparation: **g**n Ren, Qian Gao, **anChao Zhou; writing — review and editing: Tao Huang; funding acquisition: Tao Huang, Yu-Dong Cai; supervision: Yu-Dong Cai.

Corresponding authors

Correspondence to Tao Huang or Yu-Dong Cai.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 5555 KB)

Supplementary file2 (XLSX 665 KB)

Supplementary file3 (XLSX 29 KB)

Supplementary file4 (XLSX 107 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, J., Gao, Q., Zhou, X. et al. Identification of key gene expression associated with quality of life after recovery from COVID-19. Med Biol Eng Comput 62, 1031–1048 (2024). https://doi.org/10.1007/s11517-023-02988-8

Download citation

Received: 06 July 2023
Accepted: 30 November 2023
Published: 21 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11517-023-02988-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Identification of key gene expression associated with quality of life after recovery from COVID-19

Abstract

Graphical Abstract

Similar content being viewed by others

PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients

Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets

COVID-19 Biomarkers Detection Using ‘KnowSeq’ R Package

1 Introduction

2 Materials and methods

2.1 Data

2.2 Problem description

2.3 Feature ranking algorithms

2.3.1 Least absolute shrinkage and selection operator

2.3.2 Light gradient boosting machine

2.3.3 Monte Carlo feature selection

2.3.4 Random forest-based method

2.3.5 Categorical boosting

2.3.6 Extreme gradient boosting

2.4 Incremental feature selection

2.5 Synthetic minority oversampling technique

2.6 Classification algorithm

2.6.1 Decision tree

2.6.2 K-nearest neighbor

2.6.3 Random forest

2.6.4 Support vector machine

2.7 Performance evaluation

3 Results

3.1 Performance of the classification algorithms on different feature lists

3.2 Intersection of key genes from six lists

3.3 Classification rules

4 Discussion

4.1 Genes associated with differential quality of life after COVID-19 recovery

4.2 Decision rules related to differential quality of life after COVID-19 recovery

5 Conclusions

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (XLSX 5555 KB)

Supplementary file2 (XLSX 665 KB)

Supplementary file3 (XLSX 29 KB)

Supplementary file4 (XLSX 107 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation