Abstract
Background
With advancements in high-throughput technologies, the cost of obtaining expression profiles of both mRNA and microRNA in the same individual has substantially decreased. Integrated analysis of these profiles can help to elucidate the functional effects of RNA expression in complex diseases, such as cancer. However, fundamental discrepancies are observed in the results from microRNA-mRNA target gene prediction algorithms, and few packages can be used to analyze microRNA and mRNA expression levels simultaneously.
Results
To address these issues, an R package, anamiR, was developed. A total of 10 experimental/prediction databases were integrated. Two analytical functions are provided in anamiR, including the single marker test and functional gene set enrichment analysis, and several parameters can be changed by users. Here we demonstrate the potential application of the anamiR package to 2 publicly available microarray datasets.
Conclusion
The anamiR package is effective for an integrated analysis of both RNA and microRNA profiles. By characterizing biological functions and signaling pathways, this package helps identify dysregulated genes/miRNAs from biological and medical experiments. The source code and manual of the anamiR package are freely available at https://bioconductor.org/packages/release/bioc/html/anamiR.html.
Similar content being viewed by others
Background
With advancements in high-throughput technologies, the cost of analyzing multiple gene expression profiles in the same individual has dropped substantially. Many studies have described attempts to analyze RNA and microRNA (miRNA) profiles simultaneously. To date, only a few miRNAs and their target genes have been validated through biological experiments [1], and thus prediction algorithms have been widely used to identify potential miRNA-gene interaction pairs. The most popular method for predicting target genes of miRNAs is that of matching the 3’UTR of an mRNA to the “seed region” (a conserved sequence of 2 to 8 nucleotides) of an miRNA, which assumes perfect Watson-Crick complementarity between the mRNA 3’UTR and the miRNA. In addition, the seed region can be used to classify families and species of miRNAs. Consequently, the seed region has been demonstrated to be a key element of miRNA-target gene prediction [2]. However, other research has shown that the pairing mechanism between miRNA and mRNA can occur anywhere along the entire mRNA, which suggests that algorithms focusing on the seed regions can only identify a subset of all potential miRNA-target gene pairs [3]. To address this issue, other popular computational approaches, such as free energy minimization and machine learning, have been developed. Measuring the minimum free energy can help to assess the stability of binding sites between miRNA and target genes. A predicted miRNA-target pair with lower free energy indicates that the binding is more stable and thus more likely to be a true result [2, 4]. With the rapid accumulation of massive amounts of data, machine learning algorithms have also been implemented in many prediction algorithms through training processes containing the entire dataset of known miRNA targets [4]. In such approaches, important features that can facilitate identification of possible target genes from miRNAs can be revealed. Good performance has been reported for support vector machine [5] and generalized linear model [6] algorithms. Furthermore, two machine learning based algorithms were implemented to dissect the associations between miRNAs and diseases [7, The proposed workflows in the anamiR package. (a) The general workflow provides six steps including normalization (optional), differential expression analysis, conversion of miRNA names (optional), correlation analysis, database intersection, and functional analysis. miRNA-gene interaction pairs and correspondingly enriched pathways can be identified from the workflow using both mRNA and miRNA expression data. (b) The function-driven analysis workflow is performed to identify significantly dysregulated pathways and to obtain potential miRNA-gene interaction pairs using genome-wide expression profiles. Each box shown in blue represents a function in the anamiR package. The input data and output results are shown in green. Yellow cylinders indicate databases for query and numbers of dataset collections are shown in parentheses
In addition to the general workflow, the anamiR package provides another workflow called function-driven analysis (Fig. 2b). Users can select the biological functions/pathways of interest on which to perform these integrated analyses. All genes in the corresponding functional pathways are analyzed, with no filtering based on their statistical P-values. Such an approach can not only reduce the number of tests required to obtain an answer, but also take moderate gene expression changes into consideration, even if they do not reach statistical significance.
Statistical approaches
In the anamiR package, we have provided four statistical tests and/or algorithms by which to identify differentially expressed genes and/or miRNAs: limma [24], t-test, Wilcoxon rank sum test, and DESeq2 [25]. The default P-value threshold for the statistical tests is defined as 0.05 and the default threshold for the expression ratio between two phenotypes is set at 0.5. Both parameters are user-adjustable. The Benjamini-Hochberg method is performed to address the multiple testing issue. Three measurements of the trend in the expression levels between gene and miRNA are provided: Pearson correlation coefficient, Kendall rank correlation coefficient, and Spearman’s rank correlation coefficient. Since the expression level of a miRNA is negatively correlated to the expression level of its target gene, the default cutoff of the correlation (also user-adjustable) is defined as − 0.5, which corresponds to moderate correlation (Additional file 1: Table S4 and Table S5). For those genes showing negative correlation with their regulatory miRNA, a hypergeometric test is performed to identify significantly enriched pathways. Lastly, to address the possibility of identifying a significant pathway through miRNA enrichment analysis by random chance [26], a permutation test is embedded in the anamiR package. The random chance of identifying a significant pathway is obtained by randomly selecting the same number of genes in the enriched pathway and calculating the enriched P-value. This procedure is repeated 5000 times to generate a null distribution, and the empirical P-value is determined by comparing the enrichment P-value obtained from the real data with the P-value from the null distribution. For the function-driven analysis workflow, we employed the gage package [27] to perform genome-wide functional enrichment analysis, and users can manually set the number of functional terms of interest.
Results and discussion
Dataset description
To demonstrate the potential applications of the anamiR package, 2 examples from real microarray datasets are provided below. The first example (GSE16558) has 60 patients with multiple myeloma and 5 normal controls [28], whereas the second example (GSE60371) contains 48 prostate cancer patients and 6 normal controls [29]. Both datasets are composed of paired miRNA and mRNA expression data.
Example 1: multiple myeloma (GSE16558)
We analyzed the GSE16558 dataset by the general workflow algorithm (Fig. 2a), using limma with the default parameters shown in the supplementary information (Additional file 1: Table S6). The 5 interaction pairs with the highest correlation coefficients between miRNA and gene are shown in Table 1. A previous study has identified miR-485-5p as having dysregulated expression levels in multiple myeloma [30], and its target gene, JMY, is reported as a mediating and regulatory protein of TP53 [31]. Intriguingly, 3 out of the top 5 enriched pathways were related to TP53 (Table 2), suggesting the interaction between miR-485-5p and JMY deserves further investigation in multiple myeloma patients.
Alternatively, the top 5 enriched pathways from the function-driven analysis (Fig. 2b) are shown in Table 3. Notably, the most significant pathway was related to ribosome function, and 3 of the 5 interaction pairs with the largest negative correlations include miR-485-5p, in agreement with the results from the general workflow (Additional file 1: Table S7). Therefore, these results demonstrate that anamiR is able to identify important interaction pairs of miRNA and target genes in a specific disease.
Example 2: prostate cancer (GSE60371)
Similar to the previous example, we utilized the general workflow algorithm with the default parameters in GSE60371 (Additional file 1: Table S6). Notably, 2 miRNAs from the miR-320 family were identified (Table 4), and their down-regulation has been reported in prostate cancer [37]. Intriguingly, the overexpression of UBE2C was reported in prostate cancer [38] and its miRNA regulator is miR-320d. Taken together, the dysregulation of the miR-320 family and UBE2C deserve further investigation in prostate cancer.
As shown in Tables 5 and 6, the results showed that the proportion of miRNA-gene pairs showing negative Pearson correlation coefficients increase along with the number of analyzed algorithms, suggesting better prediction performances can be achieved by the integration of multiple algorithms.
Conclusions
The anamiR package provides an integrated approach for identifying paired mRNA and miRNA expression profiles. The general workflow is utilized to predict the target genes and their associated functional pathways for miRNA simultaneously. Within gene sets and pathways of interest, the function-driven analysis workflow is applied to identify miRNA-gene interaction pairs from among the significant gene sets and pathways. We believe that approaches considering the associations between mRNAs and miRNAs, as well as regulation of genes and pathways, can provide insight into dysfunction in cancers.
Availability and requirements
-
Project name: anamiR
-
Project home page: https://bioconductor.org/packages/release/bioc/html/anamiR.html
-
Operating system(s): Platform independent
-
Programming language: R
-
Other requirements: R (> = 3.3.3), SummarizedExperiment (> = 1.1.6), Bioconductor (> = 3.4), stats, DBI, limma, lumi, agricolae, RMySQL, DESeq2, SummarizedExperiment, gplots, gage, S4Vectors
-
License: GNU GPLv2
-
Any restrictions to use by non-academics: None