Background

With advancements in high-throughput technologies, the cost of analyzing multiple gene expression profiles in the same individual has dropped substantially. Many studies have described attempts to analyze RNA and microRNA (miRNA) profiles simultaneously. To date, only a few miRNAs and their target genes have been validated through biological experiments [1], and thus prediction algorithms have been widely used to identify potential miRNA-gene interaction pairs. The most popular method for predicting target genes of miRNAs is that of matching the 3’UTR of an mRNA to the “seed region” (a conserved sequence of 2 to 8 nucleotides) of an miRNA, which assumes perfect Watson-Crick complementarity between the mRNA 3’UTR and the miRNA. In addition, the seed region can be used to classify families and species of miRNAs. Consequently, the seed region has been demonstrated to be a key element of miRNA-target gene prediction [2]. However, other research has shown that the pairing mechanism between miRNA and mRNA can occur anywhere along the entire mRNA, which suggests that algorithms focusing on the seed regions can only identify a subset of all potential miRNA-target gene pairs [3]. To address this issue, other popular computational approaches, such as free energy minimization and machine learning, have been developed. Measuring the minimum free energy can help to assess the stability of binding sites between miRNA and target genes. A predicted miRNA-target pair with lower free energy indicates that the binding is more stable and thus more likely to be a true result [2, 4]. With the rapid accumulation of massive amounts of data, machine learning algorithms have also been implemented in many prediction algorithms through training processes containing the entire dataset of known miRNA targets [4]. In such approaches, important features that can facilitate identification of possible target genes from miRNAs can be revealed. Good performance has been reported for support vector machine [5] and generalized linear model [6] algorithms. Furthermore, two machine learning based algorithms were implemented to dissect the associations between miRNAs and diseases [7,

Fig. 2
figure 2

The proposed workflows in the anamiR package. (a) The general workflow provides six steps including normalization (optional), differential expression analysis, conversion of miRNA names (optional), correlation analysis, database intersection, and functional analysis. miRNA-gene interaction pairs and correspondingly enriched pathways can be identified from the workflow using both mRNA and miRNA expression data. (b) The function-driven analysis workflow is performed to identify significantly dysregulated pathways and to obtain potential miRNA-gene interaction pairs using genome-wide expression profiles. Each box shown in blue represents a function in the anamiR package. The input data and output results are shown in green. Yellow cylinders indicate databases for query and numbers of dataset collections are shown in parentheses

In addition to the general workflow, the anamiR package provides another workflow called function-driven analysis (Fig. 2b). Users can select the biological functions/pathways of interest on which to perform these integrated analyses. All genes in the corresponding functional pathways are analyzed, with no filtering based on their statistical P-values. Such an approach can not only reduce the number of tests required to obtain an answer, but also take moderate gene expression changes into consideration, even if they do not reach statistical significance.

Statistical approaches

In the anamiR package, we have provided four statistical tests and/or algorithms by which to identify differentially expressed genes and/or miRNAs: limma [24], t-test, Wilcoxon rank sum test, and DESeq2 [25]. The default P-value threshold for the statistical tests is defined as 0.05 and the default threshold for the expression ratio between two phenotypes is set at 0.5. Both parameters are user-adjustable. The Benjamini-Hochberg method is performed to address the multiple testing issue. Three measurements of the trend in the expression levels between gene and miRNA are provided: Pearson correlation coefficient, Kendall rank correlation coefficient, and Spearman’s rank correlation coefficient. Since the expression level of a miRNA is negatively correlated to the expression level of its target gene, the default cutoff of the correlation (also user-adjustable) is defined as − 0.5, which corresponds to moderate correlation (Additional file 1: Table S4 and Table S5). For those genes showing negative correlation with their regulatory miRNA, a hypergeometric test is performed to identify significantly enriched pathways. Lastly, to address the possibility of identifying a significant pathway through miRNA enrichment analysis by random chance [26], a permutation test is embedded in the anamiR package. The random chance of identifying a significant pathway is obtained by randomly selecting the same number of genes in the enriched pathway and calculating the enriched P-value. This procedure is repeated 5000 times to generate a null distribution, and the empirical P-value is determined by comparing the enrichment P-value obtained from the real data with the P-value from the null distribution. For the function-driven analysis workflow, we employed the gage package [27] to perform genome-wide functional enrichment analysis, and users can manually set the number of functional terms of interest.