Background

MicroRNAs (miRNAs) are an abundant class of small noncoding RNAs (20-24 nts) that can affect gene expression by post-transcriptional regulation of mRNAs [1]. They typically base pair with sequences in the 3' UTR of mRNAs to inhibit mRNA translation or to promote their degradation. miRNAs have been shown to play important roles in various cellular and pathogenic processes, including development, cell death, immunological response, and carcinogenesis [2, 3]. Since the functional characterization of miRNAs depends heavily on identification of their specific target mRNAs, numerous bioinformatics methods have been developed for this task (see e.g., [411], for a review see [1214]).

In recent years, it has been shown that viruses also encode miRNAs [15, 16]. Although for most of the viral miRNAs no functions have yet been described, it is conjectured that these miRNAs also take part in the complex interactions between viruses and their hosts, and several scenarios have been proposed [17] (see Figure 1, where scenario numbers are written in rhombi). In the first scenario, viral miRNAs regulate viral gene expression for maintaining, e.g., replication, latency, or evading the host-immune system. For example, miR-BART2 of EBV exhibits perfect complementarity to the 3'UTR of BALF5, which encodes the viral DNA polymerase [18], and such an interaction might be essential for maintaining EBV latency. In the second scenario, host miRNAs interact with viral RNAs, thereby inhibiting virus replication, e.g., miR-32 can limit the replication of the retrovirus primate foamy virus (PFV) in cell culture through an interaction with PFV mRNAs [19]. In the third scenario, viral miRNAs regulate host genes to induce a more favorable environment for the virus. The regulated genes are most likely related to antiviral response, apoptosis, interferon system, signal transduction, or cellular proliferation [20, 21]. Stern-Ginossar et al. [22] showed that human cytomegalovirus (HCMV) miRNA, miR-UL112, inhibits the translation of a cellular gene MICB, which is normally activated when cells are subjected to severe stress, such as viral infection. MICB marks these cells for destruction by natural killer cells. The resulting absence of MICB protein protects HCMV-infected cells against lysis by natural killer cells.

Figure 1
figure 1

Four scenarios for miRNA involvement in host-virus interactions. (1) Viral miRNAs inhibit the expression of viral mRNAs. (2) Host miRNAs inhibit the expression of viral mRNAs. (3) Viral miRNAs inhibit the expression of host mRNAs. (4) Host miRNAs inhibit the expression of host mRNAs. In this case, viruses may regulate host miRNA expression and indirectly regulate host gene expression. (This figure is partially adapted from [81].)

Viruses may affect host miRNAs for their own advantage, by regulating up or down the expression of these miRNAs. This happens in the fourth scenario reported in [23]. In their research, host miR-17 and miR-20a were found to be downregulated following human immunodeficiency virus (HIV-1) infection. These miRNAs target the 3'UTR of host gene PCAF, which has been proposed to promote HIV-1 transcriptional elongation. Hence, this downregulation is needed for efficient replication of the virus.

In our work we will examine the combination of the last two scenarios, assuming that both human and viral miRNAs play an active role in regulating genes that either promote or inhibit viral replication or life cycle.

Cooperativity of miRNAs in gene regulation

Previous studies show that one miRNA may have several target genes. Furthermore, one mRNA can be targeted by multiple miRNAs, reflecting cooperative translational control [5, 24, 25]. Experimental evidence indicates that multiple target sites (of one or more miRNAs) in the same 3'UTR can potentially increase the degree of translational repression [9]. In addition, microarray analyses [26] reveal that most of the miRNAs only modestly affect their mRNA targets. Recent findings show that some virus encoded miRNAs have sequence homology to their host human/mouse miRNAs, mainly in the seed region (see review [21]). This is one of many mechanisms that are used by viruses to exploit their host [2730]. We were intrigued by an additional possibility in which viral and host encoded miRNAs target the same mRNA, possibly in different target sites, to promote gene silencing. It would make sense that viral miRNAs that share targets with human miRNAs may contribute to increasing the translational repression and tighten the regulation which already exists at a low level in the cell (by the host regulation machinery). This important question motivated us to study and predict the cooperativity of viral and host miRNAs on host genes.

In this paper, we present a new computational method for examining cooperative effects of viral and host miRNAs on the regulation of host genes. We find modules consisting of both viral and human miRNAs, and their common target genes, that are involved in similar biological processes (according to GO). The mathematical formulation of our problem is an extension of a related problem, where the sought modules consist of miRNAs and genes of the same species. Below we present some studies that worked to solve the latter problem. Yoon and De Micheli [31], the first to address that problem, represent the multiple relations between miRNAs and target genes by a weighted bipartite graph, and find bi-cliques in the graph that represent the miRNA-mRNA modules (see Figure 2). They predict modules using target prediction based on sequence information.

Figure 2
figure 2

Bipartite graph. Bi-cliques in the graph are circled.

Two other approaches combine several information sources to extract the modules, including predictions of miRNA target genes and their respective expression profiles. Joung et al. [32] apply a genetic algorithm, based on coevolutionary learning, to find an optimal miRNA-mRNA module. Tran et al. [33] use a rule-based learning method to identify the modules.

In general, finding bi-cliques in bi-partite graphs can be formulated as a bi-clustering problem (reviewed in [34]), which is known to be NP-complete. Therefore, most of the methods that address bi-clustering are based on heuristic approaches, which may miss good solutions. Alternatively, applying naive exhaustive enumeration to this problem is extremely time consuming to the point of being impractical. The problem becomes even more complicated when searching for cooperative human-viral miRNA modules. Directly applying one of the existing bi-clustering algorithms to a united set of both human and viral miRNAs may be insensitive to the contribution of viral miRNAs. This is due to the wide imbalance between the numbers of human miRNAs and viral miRNAs (hundreds of human miRNAs versus tens of viral miRNAs identified to date).

In order to be able to control the composition of the sought modules in terms of viral and human miRNA participants, a new type of clustering algorithm is needed. And indeed, in this paper we describe a bi-targeting algorithm which we developed (see Methods). We apply a branch and bound approach to develop and prune an enumeration tree of groups of human genes that are targeted by the same set of viral and human miRNAs. This algorithm is very efficient since we exploit the fact that the number of known viral miRNAs is relatively small, and that we demand a quorum on the number of miRNAs in the modules (a minimum number of viral and human miRNAs). This "pruning by quorum" allows us to apply enumeration in practical run time. We implemented our bi-targeting algorithm into a software system that combines a variety of biological data sources, such as: genomic sequences, miRNA expression measurements, GO annotations, and tissue context. The combination of several information sources in the task of module detection minimizes noise and errors accompanying each information source, resulting in more accurate results.

In addition, we developed a very flexible and efficient target prediction algorithm, which we briefly describe in the next section, and in more detail in the Appendix Section. We use it as a preprocessing step to the module search.

Using the bi-targeting system, we study the cooperative regulation of human and Epstein Barr Virus (EBV) miRNAs on human mRNAs in various lymphomas related to EBV. The significant modules are picked by a sampling procedure (described in the next section). We validate two of these modules by surveying information from the biological and biomedical literature. The rest of the significant modules can be found at http://www.cs.bgu.ac.il/~vaksler/BiTargeting.htm.

We believe that the results of this research may have far-reaching implications, as they may be applied to the diagnosis and treatment of viral infections, as well as for a wider understanding of viral-induced diseases and the role that miRNAs plays in them. Our study can contribute to the discovery of genes that regulate virus-host interactions, especially, but not only, in chronic infections, and can serve as targets for therapy [35, 36].

Methods

Our method consists of two stages: a target prediction stage and a module search stage. In the first stage, we perform all against all dynamic programming on two sets of sequences (miRNAs and mRNAs) in order to predict, for every miRNA in the set, its target mRNAs. In the second stage we look for modules that are enriched in some biological process, using the results from the first stage and information from Gene Ontology and the miRNA Atlas. We elaborate more on the second stage since this is the main goal of our paper. The interested reader may find a description of the first stage in the Appendix Section.

Stage 1 - Predicting microRNA targets

The purpose of this stage is to perform an efficient genome-wide target prediction of all the miRNAs against all the possible subsequences of a set of mRNAs. miRNA target prediction is a well-studied problem, with numerous tools and approaches available. Each of these tools accommodates a variety of principles and features of target recognition, for example typical base pairing patterns, thermodynamic analysis of the miRNA-target duplexes, conservation of the target site among related species, and accessibility of the target site. For a review see [13, 14].

The amount of the data we need to analyze is huge (thousands of genes and hundreds of miRNAs) and thus applying existing tools to this analysis would be too time consuming. Hence, we developed our own efficient tool for this task (which can serve as a plug-in filter to other target prediction tools). The method we propose extends the threshold all-against-all sequence alignment algorithm [37, 38].

We store the miRNA sequences in a prefix tree (trie) and the mRNA sequences in a separate prefix tree (for details see the Appendix Section). We then apply dynamic programming to compute hybridization error between a prefix from the mRNA tree and a prefix from the miRNA tree, for each such pair. If, in one comparison, the number of errors we encounter is larger than a given threshold, then we stop comparing this pair and its descendants in the respective trees (prune the subtrees). Those miRNA-target pairs that survived the error threshold can be further checked for duplex energy by the RNAduplex program [39]. This results in linking each miRNA in our data set to its predicted target genes.

This method is efficient since it simultaneously checks sets of mRNA sequences (and, respectively, miRNA sequences) that share a prefix in the mRNA (respectively, miRNA) prefix trees. Moreover, it goes over all the duplexes, and prunes those with a hybridization error above a threshold, allowing a very efficient filtration.

Comparison with existing approaches

We compare our target prediction method with three existing tools (those that provide downloadable code). The tools are miRanda [5], PITA [40] and RNAhybrid [10]. Our dataset consists of 873 experimentally validated miRNA-gene pairs out of 99 human miRNAs and 640 mRNAs (from miRecords [

Table 1 Constraint parameters for the target prediction algorithm.
Table 2 A comparison between our method and other target prediction tools.

Stage 2 - Finding modules - the enumeration algorithm

In this section we describe our enumeration algorithm for finding modules that are statistically enriched in biological processes (GO categories). An important objective in our study is to ensure that the modules found by our method contain miRNAs that are mutually expressed and thus can cooperatively regulate gene expression. In order to ensure this condition, we use the miRNA Atlas [43]. This Atlas provides miRNA expression profiles in different conditions (i.e., different organ systems and cell types, of normal and malignant tissues). We focus on a set of conditions where both human and viral miRNAs (of the virus of interest) are expressed. Based on these data sources, we address the following problem.

Given a GO category C, a miRNA atlas condition A, expression level threshold t, a p-value threshold p, the minimal number of human and viral miRNAs required in a sought module q1and q2(the quorums), respectively. Find all the modules composed of miRNAs whose expression in A is greater than t, which include at least q1and q2human and viral miRNAs, respectively, and whose intersection of the target-sets of human genes yields an enrichment p-value smaller than p in C.

The enrichment is measured using a hypergeometric p-value [44]. The hypergeometric p-value measures the statistical significance of the overlap between the target genes and the genes in the considered category (for more details see the Appendix Section).

For each GO category C and each Atlas condition A of interest, we perform the enumeration process as follows. Let H and V be the set of human and viral miRNAs. Let H A and V A be the subset of human and viral miRNAs expressed in A. Let T A be the set of their target genes (each gene in T A is targeted by at least q1 human and q2 viral miRNAs); and let TC A = T A C (see Figure 3).

Figure 3
figure 3

An illustration of the connection between Atlas condition and GO category. H is all the known human miRNAs and V is all known viral miRNAs. H A (V A ) are the human (viral) miRNAs expressed in Atlas category A. They target gene set T A . TC A consists of the targets of T A that belong to GO category C.

Our enumeration algorithm dynamically constructs and prunes an enumeration tree, which is built from the genes in TC A . A path in the tree represents a potential module. The module consists of the genes in the inner nodes of the path and a list of miRNAs (which target all these genes) in the leaf (see, e.g., Figures 4 and 5). At the end of the run of the algorithm, the enumeration tree stores in its leaves all the modules that satisfy the quorum constraints. The pseudocode of our algorithm is found in the Appendix Section. Below we supply an illustration of the algorithm applied on the data from Table 3, with the quorum constraints q1 = 2 and q2 = 1. We initialize the tree to consist of a root node and one dummy leaf node with all the human and viral miRNAs in A (see the leftmost leaf in Figure 4(a)). Genes are inserted to the tree one by one in order of increasing number of hits of human miRNAs on the gene (b, c, d, a). Each inserted gene, g, is connected by an edge to the root and becomes a root of newly generated copies of all its preceding siblings in the tree. The leaves of the new subtree contain intersections of g's miRNA sets (human and viral) with the corresponding sets in the siblings' leaves. When these intersections do not satisfy the quorum constraints we prune the unsatisfying edge from the tree. (See the node for gene b, and then for gene c, in Figure 4(a). The leaves of the subtree rooted at c contain the intersections of the set of miRNAs hitting c with the set of the dummy leaf, and of b. Notice that the respective intersections of the miRNA sets hitting b and c are empty. Thus the edge from c to b is pruned from the tree).

Table 3 miRNA target information.
Figure 4
figure 4

The process of building the enumeration tree for target information in Table 3. The rhombus represents the root of the tree. The circles represent the genes inserted into the tree. The squares represent the leaves, which store two sets of miRNAs - human and viral. These miRNAs are common "targeters" to all the genes in the path from the leaf to the root. Viral miRNAs are labeled with uppercase letters and the human miRNAs are labeled with numbers. The rectangles indicate modules that are pruned from the tree for one of two reasons: (1) modules that break the quorum are in solid rectangles (in this example the required quorum is 1 viral miRNA and 2 human miRNAs); (2) redundant modules (which are fully contained in another module) are in dotted rectangles. The arrow indicates the containment, starting from the redundant module and ending in the bigger one. The three sub-figures (a), (b), and (c) show the insertion of genes b and c, d, and a into the tree, respectively.

Figure 5
figure 5

The final enumeration tree. The modules found by the enumeration algorithm (encircled by ellipses).

In Figure 4(b) we show the tree after inserting d and making it the root of all its preceding siblings: the dummy node, b node, and c node. In the leaves of the subtree rooted at d are the intersections of the respective miRNA sets. Now two prunings occur in the enumeration tree. The original node b (in the dotted rectangle) and its targeting miRNAs, are now fully contained in the subtree rooted in d, thus b is now redundant and deleted from the enumeration tree. As in 4(a), the sets of miRNAs of c and of d are both empty and the edge from d to c is pruned (solid rectangle).

In Figure 4(c) we illustrate adding a, making it the root of the subtrees of its siblings. Intersecting a's miRNA sets with the siblings' miRNA sets, resulting in deleting c (dotted rectangle) and pruning d (solid rectangle).

At the end of the process, the enumeration tree contains all the modules (see Figure 5). The leaves of the tree are traversed and their p-value is computed. The modules are then filtered (see the sampling procedure below) and reported according to ascending p-value.

The sampling procedure

To set a cutoff p-value for significant modules, we estimate the distribution of the hyper-geometric p-value scores for every <A, C> pair as follows. We sample, from the full set of miRNAs, two subsets of human and viral miRNAs, of same size as H A and V A , respectively. Then we find their target genes that satisfy the quorums, and intersect them with C. Next, we perform the enumeration algorithm on the genes in the intersection and store the hypergeometric p-values of the obtained modules. We repeat this sampling process 10,000 times, and compute the distribution of the obtained p-values. This distribution allows us to pick a significant p-value for our reported modules. We pick the p-value that is in the 0.01 quantile of the distribution to be the p-value cutoff for our reported modules.