Introduction

Tumors are defined as new creatures formed by the abnormal proliferation of somatic cells with disordered growth regulation under the influence of tumorigenic factors1. Around the world, tumors have been reported to be the second killer of human health, ranked only behind cardiovascular disease. However, it is still not clear how tumor tissues initiate and invade during the precancerous lesion stage2. Specific genetic alterations have been detected in tumor cells of different types. Some well-known genes, such as p53, K-Ras, etc., have been reported in various tumor types, which have been regarded as genomic markers for the given tumors and may be the original mutation related to tumor initiation and progression3,4.

In 2012, the theory of “cancer drivers” was first presented at the RAOF (Round Asia Oncology Forum), which connects tumor initiation with several specific mutations in the so-called cancer driver genes5. Such theory attributes tumor initiation to several original specific genomic alterations, which sequentially induce metabolic and functional disorders in somatic cells. As we know, based on the differentiation of four basic levels between the tumor and adjacent normal tissues, we can divide cancer driver genes into four clusters: (1) Methylated CpG site genes, (2) microRNA target genes, (3) somatic mutation genes and (4) mRNA genes.

First, the level of methylation of driver genes may change during tumor initiation. Generally, the methylation and demethylation of peculiar regions in chromosomes reflects the regulation of gene expression on the transcriptional level6. The demethylation of oncogenes and/or the methylation of tumor suppressors may induce the proliferation and genomic instability of tumor cells7. Such processes may be the driving procedures of tumor initiation and progression. In addition to methylation and demethylation, another level cancer driver genes contribute to is associated with microRNA expression8. microRNA is a small non-coding RNA molecule that contributes to post-transcriptional regulation by RNA silencing9. During tumor initiation and progression, the specific microRNA expression level changes and may further regulate its functional target genes44. Accordingly, we obtained six sets of the shortest paths. For each set, we extracted genes that occurred in at least one path as the candidate genes. Furthermore to distinguish them, a measurement, namely betweenness45, was conducted for each candidate gene, which is defined as the number of paths containing the gene. Betweenness is a measure of centrality of a vertex within a graph which counts the number of times a node acts as a bridge in the shortest path between two other nodes, which in this study can be used to judge whether the candidate genes can drive tumor initiation on two levels46. For convenience, the set consisting of candidate genes for Gi and Gj is denoted by .

Permutation test

For Gi and Gj (i ≠ j), we can obtain a set of candidate genes, making up the gene set , by the method described in Section “SP method for searching new candidate cancer drivers”. However, not all of them have the potential to become novel driver genes. False positives are inevitable. Among them, some are produced by the construction of the network. Taking this into consideration, we randomly produced two groups of gene sets; each group contained 1,000 gene sets, denoted by and . is the same size as Gi , while is the same size as Gj . For and (k = 1,2, …, 1000), all the shortest paths connecting one gene in and one gene in were searched in the constructed network. We counted the betweenness of the candidate genes in based on these paths. Thus, each candidate gene had one betweenness on Gi and Gj and 1,000 betweenness on and (k = 1,2, …,1000). Another measurement, namely permutation FDR, was computed for each candidate gene, which is defined as the ratio of the 1,000 betweenness on and (k = 1,2, …,1000) that are larger than the betweenness on Gi and Gj. It can be seen that a candidate gene with a high permutation FDR is more likely to be a false positive produced by the construction of the network and should be excluded. Thus, candidate genes with a permutation FDR larger than or equal to 0.05 were discarded. The remaining candidate genes, making up the gene set , were further evaluated by the method in the following section.

Further selection using betweenness and PPI

For Gi and Gj (i ≠ j), some candidate genes remained after executing the permutation test. However, some of them may have strong associations with cancer, while others have weak associations. To reflect this fact, we proposed some rules in this section and selected the important candidate genes based on these rules.

It has been elaborated that the betweenness of a candidate gene is the number of paths connecting genes in Gi and in Gj including the candidate gene. Clearly, a candidate gene with high betweenness suggests it has strong associations with genes in Gi and Gj, thereby having a high likelihood of being a novel cancer driver gene. To build a uniform rule, we must also consider the sizes of Gi and Gj because a candidate gene with a small betweenness for small Gi and Gj is not always less important than another candidate gene with a large betweenness for large Gi and Gj. Thus, we set a betweenness ratio R(g) to measure the importance of a candidate gene g based on its betweenness and sizes of Gi and Gj, which is defined as

It can be seen that a high betweenness ratio of a candidate gene means that the majority of the shortest paths connecting genes in Gi and in Gj contains the candidate gene. We set a threshold of 0.01 for the betweenness ratio to select important candidate genes.

Another rule was built based on the PPIs and their interaction scores. It has been reported that two proteins in a PPI with a high score are more likely to share similar functions33,47,48. Thus, for a candidate gene g, if g and genes in Gi and Gj can comprise PPIs with high interaction scores, g has strong associations with genes in Gi and Gj. Thus, we computed the following value, namely the min-max interaction score:

Similarly, a candidate gene with a high min-max interaction score implies that it has strong associations with at least one gene in Gi and at least one gene in Gj, indicating that it has a high linkage with cancer. Similar to the proportion mentioned in the paragraph above, we can also set a threshold of 400 for the min-max interaction score to select important candidate genes.

For the gene sets Gi and Gj, genes in were filtered by the above two rules. The remaining genes constituted the set , which were deemed to be significant for two levels.

Results

In this study, we proposed a computational method to identify candidate cancer driver genes that can drive tumor initiation on multiple levels. The flowchart of our method is shown in Fig. 1. The results of our method are described in the following sections.

Figure 1
figure 1

Flowchart of our method.

(A) Four gene sets consisting of dysfunctional genes on four levels; (B) SP method to search candidates in a network. Yellow nodes represent dysfunctional genes on different levels and the dashed lines represent the shortest path connecting a and i, e and h are selected; (C) Six candidate gene sets obtained by the SP method; (D) Permutation test to filter some false positives. Two randomly produced sets {d, f} and {c, g} were shown in the network (highlighted in red and green), in detail, red nodes d and c replace yellow node a, while green nodes g and f replace yellow node i, dotted lines represent the shortest path connecting d and f, dashed-dotted lines represent the shortest path connecting c and g and e (highlighted in pink) is removed by the permutation test; (E) Six candidate gene sets filtered by the permutation test; (F) Six candidate gene sets filtered by further selection using betweenness and PPI.

Results of the SP method

As mentioned in Section “Identification of the differentially expressed mRNA genes, microRNAs, methylated CpG sites and somatic mutation genes”, we employed four gene sets G1, G2, G3 and G4 with different expression levels for cancer. For each pair, e.g., Gi and Gj (i ≠ j), we searched all the shortest paths connecting any gene in Gi with any gene in Gj in a network constructed in Section “Network construction” and extracted genes on these paths. The obtained six sets of candidate genes, , are provided in the Supplementary Material II. The numbers of genes in these sets are listed in column 2 of Table 1. It can be seen that many candidate genes were included in each set, meaning that further filtering was necessary. Furthermore, the betweenness of each candidate gene in was calculated and also provided in the Supplementary Material II.

Table 1 Number of candidate genes obtained by the SP method and filtered by the permutation test and further selection.

Results of the permutation test

To control for the false positives produced by the construction of the network in each candidate set , a permutation test was adopted. A permutation FDR was calculated for each candidate gene in , which is also listed in the Supplementary Material II. By setting a threshold of 0.05 for the permutation FDR, we extracted a candidate gene subset from . The detailed genes in the six gene sets are provided in the Supplementary Material III and the numbers of genes in these sets are listed in column 3 of Table 1, from which we can see that the number of candidate genes decreased significantly and became close to the reality.

Results of further selection

To select the core genes in , we calculated the betweenness ratio (cf. Equation 1) and the min-max interaction score (cf. Equation 2) for each candidate gene using a threshold of 0.01 for the betweenness ratio and a threshold of 400 for the min-max interaction score. These two measurements of each candidate gene are provided in the Supplementary Material III and the remaining genes are listed in the Supplementary Material IV. The number of genes in the gene sets is listed in column 4 of Table 1. Compared to the number of candidate genes listed in column 3 of Table 1, the candidate genes were again decreased significantly. It is believed that these candidate genes have few false positives and have strong associations with cancer. Some of them are discussed in the following section.

Discussion

Analysis of candidate genes of two levels

Cancer driver genes as we have mentioned above have been widely reported to be the driving force for the tumorigenesis. According to our method, we obtained various genes that contribute to the initiation and progression of lung adenocarcinoma on at least two levels of the four basic driver levels (methylation, mutations, microRNA expression and expression/mRNA). The important candidates involving any two driver levels are listed in Tables 2, 3, 4, 5, 6, 7. Here, we only provide the brief analyses, the detailed analyses are provided in Supplementary Material V.

Table 2 Important candidate genes in (based on methylated CpG site genes and microRNA target genes) identified by our method.
Table 3 Important candidate genes in (based on methylated CpG site genes and somatic mutation genes) identified by our method.
Table 4 Important candidate genes in (based on methylated CpG site genes and mRNA genes) identified by our method.
Table 5 Important candidate genes in (based on microRNA target genes and somatic mutation genes) identified by our method.
Table 6 Important candidate genes in (based on microRNA target genes and mRNA genes) identified by our method.
Table 7 Important candidate genes in (based on somatic mutation genes and mRNA genes) identified by our method.

For the levels of methylation diversity and microRNA expression abundance, 27 genes were predicted to driver the lung adenocarcinoma in such two levels. Among them, six of them have the evidences to support the claim and are listed in Table 2. Interacting with specific microRNAs such as microRNA-142-3p, functional genes like TCF3, MEN1, MLL, EFNA4, PBX1 and SHH have all been confirmed to contribute to tumorigenesis via methylation alteration and microRNA regulation (The detailed analysis of the important candidates can be seen in Supplementary Material V). Take TCF3 as an example. The methylation alteration of TCF3 has been confirmed to contribute to the proliferation of A549 cells, a typical lung adenocarcinoma cell line in vitro experiments, implying that such gene may also contribute to the initiation and progression of lung adenocarcinoma on such level49. As for the microRNA level, the interactions between TCF3 and a group of microRNAs (miR-590, miR-17 and miR-18) has been confirmed, validating that TCF3 may also contribute to the initiation and progression of lung adenocarcinoma on such level50. For the methylation diversity and mutation differentiation, there are still several genes (such as TCF3, MLL, MEN1, SHH, CTNNB1 and FZD1, listed in Table 3) that have been reported to participate in the tumor associated pathways (The detailed analysis of the important candidates can be seen in Supplementary Material V). The methylation and mutation status of such genes have been confirmed to be abnormal during the progression of lung adenocarcinoma and similar solid tumors. Take FZD1 as an example, FZD1 is a functional gene on our list, which has been widely reported to be related to various tumor subtypes51,52. The methylation status of this gene has been associated with prostate cancer and age-associated diseases53,54. There are only a few reports of FZD1-associated mutations. However, it has been proven that the mutations of FZD-1 may be crucial for specific diseases including tumors, validating our prediction55,56. Considering that the gene expression is regulated by specific methylation process in the genome, this level, corresponding to the mRNA level diversity between the tumor tissue and the adjacent normal tissue, is associated with the first level (methylation diversity). Genes like MEN1, TCF3 and SHH, listed in Table 4, are all crucial genes that contribute to the initiation and progression of tumors on multiple levels (The detailed analysis of the important candidates can be seen in Supplementary Material V). TCF3 as we have mentioned above turns out to be confirmed to contribute to lung adenocarcinoma on methylation alteration49. What’s more, considering the expression regulation function of microRNAs, such identified microRNA-associated cancer driver may also contribute to tumor genesis on mRNA level. Some of the candidate genes have also been predicted to be related to both microRNA expression differentiation and mutation diversity of malignant and somatic cells. Interacting with microRNA-365 and microRNA-27b, the most important candidates for these two levels are listed in Table 5. COL1A2, SHC1, FKBP1A, TTN and NGF are all crucial cancer driver genes that contribute to lung adenocarcinoma in their respective ways (The detailed analysis of the important candidates can be seen in Supplementary Material V). The fifth group of genes (including PTCH1, SHH, ITGA2, ITGA5, GRB2, EP300 and SMC3, listed in Table 6) contribute to the progression of lung adenocarcinoma on at least the microRNA regulation and mRNA expression level (The detailed analysis of the important candidates can be seen in Supplementary Material V). As for the last set of genes, such genes ITGA2, ITGA5, NOTCH1, PXN and DYNLL1, listed in Table 7, contributes to tumors on both the mutation and mRNA levels (The detailed analysis of the important candidates can be seen in Supplementary Material V). All of our predicted genes that contribute to at least two levels have been confirmed to be real driver genes by recent publications

Analysis of candidate genes of high frequencies

In Section “Results”, six sets of candidate genes were obtained that were deemed to induce tumor initiation and progression on two levels. We took the union operation of these six sets and obtained 110 candidate genes. Among them, some genes occurred many times, meaning that they may drive tumor initiation and progression on multiple levels. Thus, the frequencies of 110 candidate genes were counted and listed in the Supplementary Material VI. Because there were totally six sets of candidate genes, six is the maximum value of frequencies for each candidate gene. This section gives a detailed discussion of the genes with frequencies greater than three (half of the maximum frequency), which are listed in Table 8. These candidate genes occurred in more than half of the candidate gene sets and have been reported and confirmed to participate in and contribute to the process of tumorigenesis. Here, their brief discussion is provided. Readers can found the detailed analyses in Supplementary Material VII.

Table 8 Frequencies of some core candidate genes.

Genes occurred in more than three gene sets have been analyzed. Three genes, PTCH1, CTNNB1 and FYN, have been predicted to contribute to the initiation and progression of lung adenocarcinoma in all the six gene sets. Such three genes have all been regarded as core functional cancer driver genes. Associated with proliferation and adhesion associated pathways such a PI-3K cascade, such three genes not only participate in the initiation and proliferation of the tumor , but regulate the metastasis processes as well57 (The detailed analysis of such genes can be seen in the Supplementary Material VII). As for the five genes (BCAR1, SHH, NGF, VEGFA and GCG) that can be identified to be shared in five gene sets, they are also confirmed to be significant driver genes. Take BCAR1 as an example, such gene involves in crucial regulatory pathways like tyrosine kinase signaling pathways and further contribute to the survival, proliferation and invasion processes during tumorigenesis58 (The detailed analysis of such genes can be seen in the Supplementary Material VII). Quite more genes have been clustered in the group with the regulatory level frequency of four. Such genes regulate the abnormal pathways during the tumorigenesis processes such as the cell-cell adhesion regulation (CDH1), proliferation (STAT3, SRC) and chronic inflammatory reaction (LEP). All of such genes can be confirmed to be cancer driver genes by recent publications (The detailed analysis of such genes can be seen in the Supplementary Material VII).

As we have mentioned above, we identified a group of candidate cancer drivers that contribute to lung adenocarcinoma in multiple levels, which are all proved by recent literatures. Here, we may propose a new hypothesis for the initiation and progression of lung adenocarcinoma: the real core driver of lung adenocarcinoma (and maybe other cancers) may contribute to tumor genesis simultaneously on multiple levels. Considering the complicated regulatory system of human bodies, a single abnormal variation that contribute to the genomic alterations of a single level (e.g. mutations) may not be functional and significant enough to initiate the tumor genesis. The real core driver of cancer (lung adenocarcinoma) may contribute to tumor genesis on at least two levels to insure the initiation of malignant changes of normal cells. For example, the well-known typical core drivers for lung adenocarcinoma like EGFR all contribute to lung adenocarcinoma on multiple levels, though haven’t been identified and analyzed in the same publications19,59. All in all, based on our newly presented computational methods, we not only identified a group of novel cancer drivers for lung adenocarcinoma, but presented a new perspective for the underlying mechanisms of tumor genesis, providing a new sight into the initiation and progression of lung adenocarcinoma.

Conclusions

This contribution investigated the so-called cancer driver genes. A computational method was built to identify new potential candidate cancer driver genes. The analyses indicate that some of the obtained genes have the potential to drive tumorigenesis on multiple differentiation levels. It is hopeful that the findings presented in this study will promote the study of cancer driver genes and provide new insights into the investigation of tumor initiation. In this study, we used the protein information (protein-protein interaction) to investigate cancer driver genes. In future, we will consider adding some other information, such as microRNA related to cancer60, into our method, which may yield more useful information for the study of cancer driver gene.

Additional Information

How to cite this article: Chen, L. et al. Identification of novel candidate drivers connecting different dysfunctional levels for lung adenocarcinoma using protein-protein interactions and a shortest path approach. Sci. Rep. 6, 29849; doi: 10.1038/srep29849 (2016).