Background

Bladder cancer (BC) is the tenth most commonly diagnosed carcinoma, with an estimated 549,000 new cases and 200,000 deaths reported globally in 2018, and BC ranks the first in urinary malignant neoplasm among males [1]. Therefore, it is crucial to developed accurate prognostic tools for predicting clinical results to help clinicians make decisions about treatment, drug therapy, and conservation options [2].

Conventional signatures used to predict overall survival (OS) can range from tumor clinical parameters and tumor pathology to special mutated genes. For instance, the tumor node metastasis (TNM) classification system is the most frequently utilized to predict the prognosis of cancer patients [3, 4]. Zhang et al. constructed a prediction tool based on clinical parameters to predict the survival of patients with BC [5]. The most significant advantage of TNM is straightforward, but the inevitable disadvantage is not an individualized prediction for each patient [6]. Besides, an increasing number of single signatures have been explored to predict the OS of BC patients, such as OIP5 [

Table 1 The clinical information on training set and testing set

Establishment of multiple-gene prognostic signature

We utilized LASSO regression with tenfold cross-validation to narrow down OCGs by R package “glmnet” [25]. A gene-based prognostic signature was constructed via stepwise multivariate Cox regression. Risk score based on gene prognostic signature was calculated for each TCGA-BLCA patient via gene expression multiplied by the regression coefficient in stepwise multivariate Cox regression.

Estimation and validation of the multi-gene model

The testing set (n = 162) and the whole set (n = 405) were utilized to assess the predictive validity of the multi-gene prognostic signature. In the validation set, the risk score of each patient was calculated via the coefficient of the candidate genes obtained above. Then the patients were stratified into high-risk and low-risk groups based on the median risk score as the cutoff. The Kaplan-Meier (KM) survival analysis with log-rank test and time-dependent receiver operating characteristic (ROC) analysis was applied to validate the gene-based prognostic signature. Furthermore, the mutation type of selected genes was explored in cBioPortal (https://www.cbioportal.org/) [26, 27].

Construction and validation of the prognostic nomogram

Based on risk score and some clinical parameters, a nomogram was established to predict the probability of 1-year, 3-year, and 5-year OS using R package “rms” [28]. The score of the prediction of nomograms for each patient was calculated via R package “nomogramFormular” [29]. With the source code provided on the MSKCC website(https://www.mskcc.org/departments/epidemiology-biostatistics/biostatistics/decision-curve-analysis), we performed the DCA analysis of survival outcome [6]. The calibration curve analysis was conducted via “calibrate” function of “rms” R package [28]. The time-dependent ROC analysis for nomogram score was performed via R package “timeROC” [30].

Functional analysis and correlation analysis of genes in model

Gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis of candidate genes were performed by R package “clusterProfiler” [31, 32]. The threshold for analysis was set P-value < 0.05, indicating significantly enriched functional annotations. Bioinformatic methods “guilt by association” (GBA) [17] and GSEA were applied to conduct potential functional analysis. GSEA was conducted in R with R package “clusterProfiler” [32]. GBA was performed with Spearman method.

Statistical analysis

The samples in TCGA were randomly divided into training set and testing set with “sample” function of R. Heatmap of DEGs obtained in three datasets were plotted with R package “pheatmap” [33]. Two groups of boxplots were analyzed with Wilcoxon-test. The comparison of clinical parameters between training set and testing set was conducted with χ2 test or exact Fisher test. As for KM survival analysis, P-value and hazard ratio (HR) was generated via log-rank tests and univariate Cox proportional hazards regression. All analysis above and R packages were performed in R software version 3.6.3 (The R Foundation for Statistical Computing, 2020). All statistical tests were two-sided. P < 0.05 was regarded as statically.