Introduction

The classification of gastric cancer (GC) is essential for addressing the inter-tumoral heterogeneity, predicting the clinical outcome, and customizing treatments for different patients. Methods based on histological or molecular characteristics of GC have been proposed. Lauren classification, which is based on tumor histology, divides GC into diffuse or intestinal subtypes. Diffuse-type GC is generally more aggressive [1], yet the association between Lauren classification and prognosis is debated in different studies. Molecular subtype systems were proposed by The Cancer Genome Atlas (TCGA) [2, 3], dividing GC into Epstein-Barr-virus-positive (EBV), microsatellite-instable (MSI), genomically-stable (GS), chromosomal-instable (CIN), and hypermutated-single-nucleotide-variants (HM-SNV) subtypes. However, similar survival outcomes were observed in different subtypes. Alternative molecular classification systems (systematically reviewed by Serra et al. [4]) have identified expression (GS expression signatures [5] and mesenchymal phenotype [6, 7]), EBV status, and mutational signatures [

Results

Identification of oncogenic lncRNAs in GC using the FLORA workflow

Examining reads mapped to different genomic regions, we identified on average 2.9% of reads are properly mapped and derived from intergenic regions that could potentially encode lncRNAs (Fig. S1A, B). Motivated by the observation, we developed FLORA to efficiently reconstruct the noncoding transcriptome and prioritize cancer-relevant lncRNAs (Fig. S1C). FLORA started by pre-processing the alignment results and eliminating reads mapped to coding regions and other RNA species, which greatly reduce the data size and accelerate transcript assembly. After transcriptome assembly and merging, a transcript will be reported as a prospective lncRNA if it passes coding potential assessment, has a desirable length (over 200 bases), and at least two exons. The function of each lncRNA is then predicted via the construction of co-expression networks, followed by gene ontology (GO) enrichment analysis.

FLORA identified 28,507 loci that potentially encode lncRNAs from TCGA GC and normal samples (Table S1, Supplementary Materials and Methods). Among all the lncRNA-encoding genes reported by FLORA, 13,675 loci overlapped with annotated lncRNAs, such as H19 and PVT1, while 10,356 loci were not present in GENCODE [12], Ensembl [16] or RefSeq [17] annotation and were thus considered novel (Fig. S1D). Finally, we kept 4700 lncRNAs with mean expression level over 0.1 FPKM for downstream analysis, including 1547 novel and 3153 known lncRNAs in GENCODE (Table S2). Interestingly, the fraction of reads from known and potential lncRNAs was significantly higher in tumors compared to their paired normal samples, as well as across all samples (Fig. S1E, F). In contrast, the expression from coding regions in tumor and normal tissue was less divergent (Fig. S1G), suggesting lncRNAs exhibit tumor-specific expression patterns. Compared to normal gastric tissues, 1235 lncRNAs were significantly upregulated and 689 were downregulated in GC samples (Table S3), including several well-characterized oncogenic lncRNAs such as H19 [18], HOTAIR [19], and DUXAP8 [

Fig. 1: Molecular classification based on tumor-specific lncRNA expression defined novel GC subtypes with prognostic value.
figure 1

A Heatmap of differentially expressed lncRNAs in three GC subtypes. Each column represented one GC sample, and each row represented a lncRNA. The AJCC stage, histology, molecular subtype, and methylation-based subtype for each sample were marked on the top panel, with a white grid representing non-available (NA) information. B, C Kaplan–Meier curve illustrating the overall survival for GC patients of the three subtypes from the TCGA (B) and ACRG (C) cohort. P-values were calculated by log-rank test.

While the stage of GC (determined by the American Joint Committee on Cancer (AJCC)) and the age at diagnosis were also significantly associated with survival, the lncRNA-based subtypes were independent of the two variables (Fig. S3C, D). Multivariable Cox regression analysis revealed that the L3 subtype, as well as tumor stage and age at initial diagnosis, was a predictor of worse prognosis in GC (Fig. 2A). The combination of tumor stage and lncRNA-based subtype yielded greater predictive power (P = 4.78 × 10−5 by multivariate log-rank test) compared to considering individual factors alone (P = 2.3 × 10−4 and 1.4 × 10−2, for stage and lncRNA-based subtype, respectively) (Fig. S3E). Notably, among the stage I/II tumor cases, the L3 subtype demonstrated significantly worse survival than L1 and L2 subtype (P = 0.016 by log-rank test), suggesting lncRNA-based subtype can identify high-risk GC cases for early intervention. In concordance with the analysis of the TCGA dataset, the L3 subtype was identified as an independent prognosis factor in independent cohorts (Figs. S3F–H, S4A, B), demonstrating the robustness of lncRNA-based subtype in predicting survival outcome.

Fig. 2: LncRNA-based subtype L3 represents a subset of intestinal GC with a poor prognosis.
figure 2

A Value and 95%-confidence interval (95% CI) of the hazard ratio (HR) of different factors considered in the univariate and multivariate Cox regression analysis of the TCGA cohort. AJCC the American Joint Committee on Cancer. B Distribution of intestinal and diffuse tumors in lncRNA subtypes in the 375 TCGA GC samples. C, D Survival probability of TCGA (C) and ACRG (D) patients segregated by the combination of histology and lncRNA-based subtypes. The P-values were calculated by multivariate log-rank test.

Associations between lncRNA-based subtypes and histology

The histological subtypes (or Lauren classification) of GC has long been considered as a prognostic marker, with the diffuse subtype associated with poor survival and the intestinal subtype with better outcome [1]. Although consistent conclusions can be drawn from the cases in the ACRG cohort, no significant difference in patient survival was observed in patients from TCGA or Singapore cohort (Fig. S4C–E). Therefore, the prediction of clinical outcomes solely based on the Lauren classification may yield poor accuracy. Comparing Lauren classification and lncRNA-based subtypes, we observed enrichment of the diffuse-type GC in L2, while the intestinal-type is enriched in L1 and L3 (Figs. 2B and S4F, G). However, the intestinal-type GC demonstrated a vast difference in survival, with the intestinal-L3 subgroup demonstrating a worse prognosis than the intestinal-non-L3 subgroup (Fig. 2C, D). Collectively, our analysis demonstrated that the lncRNA-based L3 subtype further distinguished a subset of intestinal histology with adverse survival outcomes across cohorts, which were overlooked in previous studies.

Genomic and epigenomic characteristics of lncRNA-based GC subtypes

To characterize the genomic features for each lncRNA-based subtype, we integrated mutation and CNV profiles of GC in the TCGA and ACRG cohorts. Among the most common mutations in GC, TP53 mutations were significantly enriched in the L3 subtype from both the TCGA and ACRG cohort (Figs. 3A and S5A), while ARID1A, PIK3CA, KMT2B, KRAS, and FBXW7 mutations were more frequent in the L1 subtype. Mutations in CDH1 were more common in the L2 subtype, which was consistent with previous studies that the diffuse-type GC frequently harbors dysfunctional CDH1 [27]. In addition, higher mutation load was observed in the L1 subtype, while chromosomes of L3 subtype samples were more unstable and carried significantly more CNV than other subtypes (Figs. 3B, C and S5B–E). Compared to L1 and L2 subtypes, 19q12(CCNE1), 17q12(ERBB2), 7p11.2(EGFR), and 20q13.2(ZNF217) were more frequently amplified in the L3 subtype, while 6p25.3(FOXC1) was frequently deleted (Fig. S5F–I). Among the frequently amplified oncogenes in L3, CCNE1 exhibited a significantly higher expression level in L3 (Fig. S5J).

Fig. 3: Characterization of mutations, copy number variations, and epigenetic alterations in lncRNA-based GC subtypes.
figure 3

A Frequency of somatic mutations that were significantly enriched in at least one lncRNA-based subtype (Fisher’s exact test, P < 0.001). Each dot represented one frequently mutated gene in GC, with size proportional to its overall mutation frequency. B Somatic mutation load in different lncRNA-based subtypes. ***P < 0.001 by unpaired two-tailed t-test. C The fraction of genome with copy number alterations in different lncRNA-based subtypes. ***P < 0.001 by unpaired two-tailed t-test. D Distribution of TCGA molecular subtypes in the three lncRNA-based subtypes. E Percentage of L1, L2, and L3 subtype samples in the EBV-CIMP, CIMP-H, CIMP-L, and non-CIMP categories. *Represented enrichment with P < 0.05 by Fisher’s exact test.

While these molecular features were utilized for GC classification, limited predictive power has been provided by molecular subtypes proposed by the TCGA study [2, 3] (Fig. S5K, L). We found that L3 was mostly comprised of CIN tumor (96%), L2 contained a higher proportion of EBV and GS tumor, while L1 was enriched with MSI tumor (Figs. 3D and S5M). Although both contained a large fraction of CIN-subtype GC, L1-CIN and L3-CIN subgroups demonstrated a vast difference in patient survival (Fig. S5N). Therefore, substantial heterogeneities were observed in previous subty** methods, hindering the prediction of clinical outcomes.

Apart from the genomic alterations, the L3 subtype also exhibitied a distinct DNA methylation profile. Based on the level of CpG island methylator phenotype (CIMP), gastrointestinal tumors were classified into EBV-CIMP (EBV-associated-CIMP), CIMP-H (high-CIMP), CIMP-L (low-CIMP), and non-CIMP [3]. The non-CIMP category was highly enriched in the L3 subtype (93%) (Fig. 3E). The expression levels of TET1, TET3, and TDG, all of which are involved in the DNA demethylation pathways, were significantly upregulated in the L3 subtype compared with the non-L3 samples (Fig. S6A, B), which may partially explain the non-CIMP methylation phenotype in L3 subtype. Moreover, epigenomic regulations may play important roles in generating the unique lncRNA expression profile of the L3 subtype.

Interplay of TP53 mutations and DNA demethylation drives oncogenic lncRNA expression and aggressive phenotype in L3 subtype

Out of the L3-enriched lncRNAs, 137 demonstrated hypomethylation in the L3 subtype, including H19, HOXC13-AS, NOVA1-AS1, and DUXAP8 (Fig. S6C–F and Table S3). The methylation level of H19 was negatively correlated with H19 expression, as well as the level of TET1, TET3, and TDG (Fig. S6G), suggesting epigenetic regulations contribute to the lncRNA over-expression phenomenon in the L3 subtype. Moreover, as a guardian of the genome, p53 is involved in the regulation of the expression of various lncRNAs [28], including suppression of H19 promoter activity [29]. In the TCGA dataset, H19 expression and methylation were both significantly altered in GC samples with TP53 mutation (Fig. S6H, I). Therefore, we propose that mutations in TP53 and genome-wide hypomethylation in the L3 subtype may contribute to the hyper-expression of several L3-enriched lncRNA species (Table S3).

Out of the lncRNAs with elevated expression and hypomethylation in the L3 subtype, H19 is a molecular marker for GC diagnosis [18, 30] and plays important functions in promoting GC cell proliferation, metastasis [5C–E). While MSI and diffuse histology were also enriched in the LINC01614-high L1 subgroup in the ACRG cohort, the chance of metastasis was not significantly different in LINC01614-high versus LINC01614-low L1 subgroup (Fig. S9A–C). Nevertheless, LINC01614 expression was an independent prognostic factor in the L1 subtype of both the TCGA and ACRG cohort (Fig. 5F, G). L1 subtype was consisted of around 40% of GC cases and was relatively heterogeneous, and LINC01614 can robustly segregate a subgroup of L1 patients with worse survival.

Fig. 5: LINC01614 stratified lncRNA-based subtype L1 into patient clusters with different survival outcomes.
figure 5

A, B High expression of LINC01614 in L1 subtype was associated with unfavorable prognosis in TCGA (A) and ACRG (B) cohorts. The cut-off for defining the high expression group and low expression group were selected to maximize the survival difference (with the minimum P-value by log-rank test (Plog-rank) of all cut-off values tested). Pcox was calculated by the Cox regression test. CE Distribution of TCGA molecular subtypes (C), histological subtype (D), and metastasis status (E) in L1 patients with high and low expression of LINC01614. *Enrichment with P < 0.05 by Fisher’s exact test. F, G Hazard ratio (HR) of different factors considered in the univariate and multivariate cox regression analysis of TCGA cohort (F) and ACRG cohort (G). The significant factors were highlighted in bold.

Therefore, LINC01614 robustly predicted survival outcome and potentially contribute to GC metastasis. We applied the functional prediction module and revealed over 100 genes that were significantly associated with LINC01614 (Fig. S9D). Moreover, a large fraction of genes in the co-expression network of LINC01614 were involved in ECM organization (Fig. S9E), suggesting LINC01614 is functionally involved in regulating ECM remodeling and promoting cell migration.

Experimental validation of LINC01614 in promoting GC

Using semi-quantitative PCR, we validated that LINC01614 is highly expressed in GC cell lines and low in normal cells (GSE1) (Fig. S10A). Ectopic expression of LINC01614 in MKN28, MKN1, GES1, and MGC803 cell lines resulted in accelerated cell proliferation, colony formation, and migration (Figs. 6A–D and S10B–M). Furthermore, we performed CRISPR-Cas9 knockout of LINC01614 in MKN28, MKN1, and GES1 cells and observed attenuated cell proliferation, defects in colony formation, and significantly reduced migration in wound healing assay (Figs. 6E–H and S11), supporting the oncogenic role of LINC01614 in promoting GC cell growth and migration. Recent studies have shown that LINC01614 is implicated in lung cancer [37, 38] and oesophageal squamous cell carcinoma [39], and our study further revealed the oncogenic functions of LINC01614 in GC and suggested potentials of LINC01614 as a biomarker of GC.

Fig. 6: Validation of oncogenic roles of LINC01614 in gastric cancer.
figure 6

A Colony formation capability of MKN28 cell line with and without LINC01614 overexpression. B, C Proliferation rate (B) and migration capability (C) under LIN01614 overexpression (red) and control (black) in MKN28 cell line. A570: absorbance at 570 nm. D Representative images of control or LINC01614-overexpressed MKN28 cells migrating into the wounded areas at 0 h, 48 h, and 72 h. The scale bar indicates 20 µm. E Colony formation capability of MKN28 cell line transfected with CRISPR-Cas9 vector for LINC01614 knockout and with control vector. F, G Proliferation rate (F) and migration capability (G) in MKN28 cells with LINC01614 knockout (red) and control (black). H Representative images of control or LINC01614-knockout MKN28 cells migrating into the wounded areas at 0 h, 48 h, and 72 h. The scale bar indicates 20 µm. In all experiments, three biological replicates were performed for each group. Data are represented as mean ± SD. *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001 by two-tailed unpaired t-test.

To examine the downstream effectors of LINC01614, we generated RNA-seq data of the GC cell lines after LINC01614 CRISPR-Cas9 knockout or over-expression. Over-expression of LINC01614 in GES1 and MGC803 cell lines resulted in the upregulation of 234 genes, including SERPINE1, LAMC2, and STC2. A significant portion (63.2%) of the up-regulated genes was also identified by the co-expression network of LINC01614 generated using the TCGA dataset (Fig. S12A, B). In contrast, after LINC01614 knockout, 227 genes were downregulated in both the MKN1 and GES1 cell lines, among which 158 genes (69.6%) overlapped with genes in the co-expression network of LINC01614 (Fig. S12C, D). Pre-ranked gene-set enrichment analysis also revealed pathways that were strongly correlated with LINC01614 manipulation, including epithelial-mesenchymal transition, MYC targets, unfolded protein response, IL2/STAT5 signaling, KRAS signaling, and TNFA signaling pathways (Fig. S12E and Table S8), which is consistent with the functions of LINC01614 in promoting cell proliferation and migration. Combining the co-expression network of LINC01614 and transcriptome data collected before and after LINC01614 manipulation, 520 and 28 genes were identified to have a positive and negative association with LINC01614, respectively (Table S9). Altogether, our study revealed the oncogenic functions of LINC01614 in GC and identified the downstream genes that are potentially regulated by LINC01614 through direct or indirect mechanisms.

Discussion

The recent explosive growth of whole transcriptome sequencing data has provided valuable resources for the discovery of a comprehensive set of disease-associated lncRNAs. To build an ideal tool for the big data, we constructed the FLORA pipeline, which by removing reads mapped to the coding regions early in the workflow, can accelerate the process of transcript assembly and lncRNA identification. Considering that most of the whole transcriptomic data are not strand-specific, the accurate reconstruction of lncRNA species that overlap with known coding genes is challenging. Thus, although the prefiltering step excludes reads derived from potential antisense and intronic lncRNA species, it also reduces the chance of false discovery and aids the identification of novel intergenic lncRNAs. In addition, as lncRNA with high expression often yields better reconstruction, we further filtered the list of novel lncRNAs by expression level to select candidates with high confidence. Applying the pipeline to 407 samples collected by TCGA, we discovered 1924 lncRNAs with dysregulated expression in GC, among which 547 have not been previously annotated. In addition to the previously reported oncogenic lncRNAs, several novel lncRNAs also exhibited tumor-specific expression patterns and were associated with poor survival, such as FLORA.GC7366 and FLORA.GC1247, suggesting they may exert oncogenic functions in GC. We believe that FLORA would be a useful tool in lncRNA analysis, especially with the increasing amount of data collected by large international consortiums such as TCGA.

The capability of lncRNAs to reflect subtle changes in cellular states inspired us to identify subtypes of GC by lncRNA expression. Based on the differentially expressed lncRNAs, three GC subtypes were identified, and the low expression levels of lncRNAs did not hinder the robustness of the subtypes. The power of the L3 subtype in identifying high-risk GC cases was robust across tumor stages, cohorts, and expression profiling platforms. While other histological and molecular features, including the Lauren classification [1] and EBV status [

Materials and methods

To comprehensively characterize the noncoding transcriptome of GC, we reanalyzed the whole transcriptome sequencing data of 407 TCGA samples, including 375 GC and 32 tumor-adjacent samples from 380 patients (Table S1) using the FLORA pipeline. The detailed description of FLORA and experiments were included in the Supplementary Materials and Methods.