Introduction

Although lung cancer is the second most common cancer globally, it is the leading cause of cancer deaths, accounting for an annual estimated total of two million new cases and 1.76 million deaths1,2. Lung cancer can be broadly grouped into small-cell lung cancer (SCLC, 15%) and non-small-cell lung cancer (NSCLC, 85%), and lung adenocarcinoma (LUAD) is the most common subtype of NSCLC2,3,4. The treatment of NSCLC has changed dramatically over the past decade, primarily due to advances in biomarkers that allow for targeted and immune-based therapies for specific patients with significant success5. However, the vast majority of advanced NSCLC become resistant to current treatments and eventually progress6. Therefore, searching for new predictors to predict and improve the prognosis of LUAD is imminent.

Reprogramming of cell metabolism is an essential feature of malignancy, as shown by abnormal uptake of glucose and amino acids and dysregulation of glycolysis7,8. Glycolysis is a specific metabolic pattern of tumor cells, which meets the requirements of tumor cells for ATP, etc9. Mitochondrial pyruvate carrier (MPC) consists of MPC1, and MPC2 is responsible for the import of pyruvate from the cytoplasmic matrix into the mitochondrial matrix, which can affect glycolysis, and damaged MPC function may induce tumors with solid capabilities for proliferation, migration, and invasion10,11. Pyruvate is converted to acetyl coenzyme A, which is further changed to citric acid, a precursor substance required for lipogenesis, including the synthesis of cholesterol12. There is growing evidence for a close relationship between cholesterol metabolism and some types of cancer, such as allosteric interactions in the microenvironment of tumors, cancer cell spreading and metastasis forming, and lipid metabolism in tumor-initiating cells (TICs)13,14.

In recent decades, some studies have investigated potential prognostic signatures of LUAD using only bulk RNA- seq data, which principally provides data on the average of the total number of cells in the sample15,16. Single-cell sequencing is a powerful instrument for dissecting the cellular and molecular landscape with single-cell resolution, revolutionizing our comprehension of the biological features and dynamics within cancer pathologies17. Single-cell RNA-seq technology can comprehensively characterize the heterogeneity of the tumor microenvironment and help dissect the complex cell type compositions and expressive heterogeneity in TME, and the Tumor Immune Single Cell Hub (TISCH) can assist us with a simple analysis18.

The analysis of large volumes of complex biomedical data through computer algorithms, driven by the ongoing development of computer hardware and enormous amounts of data, offers substantial advantages for advancing biology and accurately estimating patient conditions19,20. Machine learning (ML) is a scientific discipline focusing on how computers learn from data and build predictive models. It is becoming an embedded part of modern research in biology, but its “black box” nature is an additional challenge21,22. Many algorithms are widely used to analyze complex biomedical data, such as extreme gradient boosting (XGBoost)23, Random Forest Classifier (RFC)24, Logistic Regression (LR)25, Support Vector Machine (SVM)26, and K-Nearest Neighbors (KNN)27. Interpretation is an integral branch of method development, with Shapley additive explanation (SHAP) being an integral approach28. The SHAP, which explains the model outcome by computing the contribution of each input feature for all samples, was applied to study the effects of different variables7C).

Figure 7
figure 7

(A) Association of risk scores between samples of various gender, stage, and subtype groups. (B) Prognosis comparison of various age groups according to risk scores. (C) Waterfall plot displaying details of mutations in each gene for each sample in the high and low risk groups.

Analysis of the correlation between prognostic genes and TME

Given the role of TME in tumor development and its impact on prognosis, we used a NSCLC_GSE117570 dataset from the TISCH database to analyze the expression of some prognostic genes in TME-associated cells. We then examined the dataset, which is categorized into ten cell types. Figure 8A shows the number of cells of each cell type and presents the distribution of each type of TME-associated cells. In this dataset, malignant cells were the most abundant (n = 2721). We found that COL4A3, FOLR1, KRT6A, and SLC22A3 had higher expression in malignant cells compared to other types of TME-associated cells (Fig. 8B). These results support the association of prognostic genes with lung cancer.

Figure 8
figure 8

(A) Number of cells of each cell type in the NSCLC_GSE117570 dataset, with a description of the distribution of TME-associated cells of each type. (B) Distribution of COL4A3, FOLR1, KRT6A, and SLC22A3 in TME-associated cell types.

Discussion

Lung cancer is one of the most deadly malignancies in humans, and most patients with advanced lung cancer experience recurrence and treatment resistance. The abnormal metabolism of cancer cells characterized by high glycolysis that occurs even in the presence of high amounts of oxygen, a metabolic reprogramming called the Warburg effect or aerobic glycolysis, has been recognized as a new hallmark of cancer34. Inhibition of glycolysis is considered a therapeutic option for aggressive cancers, including lung cancer, and related genes can be used as potential targets for metabolic therapy against cancer cells, such as ARID1A and circ-ENO135,36,37. Altered metabolism is not limited to cellular energy pathways but also includes alterations in lipid biosynthesis and other pathways (e.g., polyamine processing) in lung cancer and can affect its surrounding microenvironment38. It has been shown that lung cancer tissues demonstrate elevated cholesterol levels because the proliferation of cancer cells depends heavily on its availability. Strategies to reduce cholesterol synthesis or inhibit cholesterol uptake have been proposed as potential antineoplastic therapies39,40. Therefore, it is essential to clarify the metabolic pathways of lung cancer for its prevention and treatment.

In this study, based on 93 glycolysis and cholesterol synthesis genes, in order to find the most representative genes, consistent clustering was used to minimize the gene numbers, yielding 7 cholesterogenesis and glycolysis co-expressed genes, respectively. Based on these genes, the samples were classified into four subtypes: glycolytic, cholesterogenic, quiescent, and mixed. Although cholesterol plays a crucial role in tumors, survival analysis showed that cholesterol subtypes have a better prognosis than other subtypes, and randomized controlled trials could not support a survival benefit through lipid lowering in lung cancer patients, the reasons for which deserve further investigation41.

Pyruvate is central to carbohydrate, fat, and amino acid metabolism. Pyruvate is appealing as a therapeutic target against cancer because it promotes respiratory reserve capacity and mitochondrial oxygen consumption, which may contribute to the aggressive disease phenotype42. Mitochondrial pyruvate carrier(MPC) is one of the critical enzymes responsible for pyruvate transport and oxidation11. Low or absent MPC1 and MPC2 levels lead to metabolic disorders and alterations in tumor metabolism, and their restored expression inhibits tumor growth, invasiveness, metastasis, and stemness43. By analyzing the expression of MPC1/2, the results showed that there were significantly different expressions of MPC1/2 among different subtypes of metabolism, suggesting that the MPC complex affects the metabolic pathway and thus participates in the malignant progression of lung cancer by regulating the amount of pyruvate entering the mitochondria.

We identified DEGs between the best and worst prognosis subtypes and performed a functional enrichment analysis. The results showed significant enrichment of DEGs between the mixed and cholesterogenic subtypes in terms of p53 signaling pathways, microRNAs in cancer, and cell cycle. Then we decided to build the model using ML, with DEGs as features and ending events as labels. It performs best in the test set based on XGBoost’s powerful ability to handle complex classification problems. SHAP was then used to select the most important nine features. Then the most important nine features which SHAP selected were used to construct a prognostic model using a multivariate cox regression model. And this makes it possible to combine the excellent classification power of ML with the interpretability of the prognostic model.

Nine prognostic genes were included four non-coding RNA genes (RP11.87E22.2, RP11.789C1.1, AL589743.1, and MIR31HG) and five coding protein genes (COL4A3, SLC22A3, FOLR1, ESPL1, and KRT6A). After reviewing the kinds of literature, we found that the role of many prognostic genes has been studied concerning lung cancer and has been revealed to impact tumorigenesis and progression. Specifically, ESPL1 expression was positively correlated with SHAP values, and high expression of ESPL1 has been previously shown to be associated with poor prognosis in lung cancer by Zhao et al.44. Similarly, KRT6A, MIR31HG, and FOLR1 have been found to enhance lung cancer proliferation and may be potential therapeutic targets45,46,47. Analyzing the association between these genes and TME, we discovered that some prognostic genes were highly expressed in malignant cells using a single-cell sequencing database, which contributed to our construction of a better prognostic model. Subsequently, all samples were classified into high and low risk groups, and the clinical characteristics of the different risk groups were analyzed. Consistent results with the training set were observed in this test set. The model was robust on the training and test datasets and had a great predictive performance.

There are also some limitations to this study. Firstly, the performance of our model has not been tested externally, and there are doubts about its availability for large-scale use. Secondly, the biological associations between the selected prognostic genes remain to be investigated, and their biological explanations with prognostic profiles are to be explored. Future experimental verification is needed. Finally, using the median risk score as a cutoff value to classify high and low risk needs to be optimized.

Conclusions

Patients with LUAD were effectively typed by glycolytic and cholesterogenic genes and were identified as having the worst prognosis in the glycolytic and cholesterogenic enriched gene groups. Prognostic genes selected by the XGboost algorithm and SHAP analysis can be used to analyze patient prognosis. The prognostic models can provide an important basis for clinicians to predict clinical outcomes for patients. Of course, our model also has some challenges. In clinical practice, a large proportion of patients with LUDA do not undergo genetic testing. We hope to design a more popular model in the future.