Background

MicroRNAs (MiRNAs) are a class of non-coding single-stranded RNA molecules. Their lengths are usually 18–24 nucleotides. Instead of synthesizing proteins, miRNAs participate in post-transcriptional regulation of gene expression in eukaryotes and viruses [1]. In spite of the first miRNA Line-4 was discovered in 1993 [2], the diversity and prevalence of these genes were revealed in recent years. To date, 38,589 miRNA have been found in animals, plants and viruses [3]. At the same time, miRNAs were discovered to play an important role in cell proliferation [4], differentiation [5], senescence [6], apoptosis [7], and so on. A study indicated that more than one third of human genes are regulated by miRNA [8]. Obviously, miRNA disorder could have severe impacts on humans.

Evidence shows that an increasing number of miRNAs are closely associated with diseases [9]. Since the first discovery of miR15 and miR16 deficiency in B cell chronic lymphocytic leukemia (B-CLL) [10], the research results of miRNA-disease associations are often reported. For example, the expression of miR-25 and miR-223 is significantly higher in patients with esophageal squamous cell carcinoma than the normal people, while the expression of miR-375 is significantly lower [11]. Studies show that miR-26a may be a regulatory factor that inhibits the progression and metastasis of c-Myc/EZH2 double height advanced HCC [12]. In addition, miR-340 has been suggested as a biomarker for cancer metastasis and prognosis [13]. At present, the research on miRNAs and diseases is becoming more extensive. Researchers have also developed a number of databases to store miRNA and disease data, such as dbDEMC [14], HMDD v3.0 [15] and miR2Disease [16]. Unfortunately, the known correlation data is not complete. Moreover, traditional methods to identify new miRNA-disease associations are time-consuming and laborious.

With the improvement of information technology and the development of a large number of miRNA data sets, many effective methods for predicting miRNA-disease associations have been proposed [

$$TPR = \frac{TP}{{TP + FN}},$$
(1)
$$FPR = \frac{FP}{{TN + FP}},$$
(2)

where \(TP\) is the number of samples that are actually positive and are also predicted to be positive. \(FN\) represents the number of samples that are actually negative and also predicted to be negative. However, \(TN\) and \(FP\) represent the number of samples for which the predicted results are inconsistent.

In order to make the performance evaluation more comprehensive, we also use other evaluation indicators, including the accuracy, precision, recall and f-measure. Their calculation formulas are defined as follows:

$$accuracy = \frac{TP + TN}{{TP + TN + FP + FN}},$$
(3)
$$precision = \frac{TP}{{TP + FP}},$$
(4)
$$recall = \frac{TP}{{TP + FN}},$$
(5)
$$f - measure = \frac{2 \times precision \times recall}{{precision \times recall}}.$$
(6)

Comparison with other methods

The AUC value is generally between 0 and 1. The higher the AUC value is, the better the prediction result will be. MCCMF finally obtains an AUC value of 0.9563 in the fivefold cross validation. MCCMF is compared with four advanced methods such as WBNPMD [43], RLSMDA [44], GRNMF [21] and CMF [45], which proves the superior performance of our method. The ROC curves are drawn in Fig. 1, and the comparison results are listed in Table 1. The results of other methods in Table 1 are obtained directly from the literature.

Fig. 1
figure 1

The ROC curves for each method in the fivefold cross validation experiment

Table 1 AUC results of cross validation experiments

In the Table 1, the highest value is highlighted in italic, with the standard deviation in parentheses. In the fivefold cross validation experiment, WBNPMD, RLSMDA, GRNMF, CMF and MCCMF obtain AUCs of 0.9173, 0.8389, 0.869, 0.8697 and 0.9569, respectively. Therefore, our method is superior to the other four methods.

WBNPMD with higher AUC value is selected for comparison with MCCMF, and accuracy, precision, recall and f-measure are presented as a bar graph in Fig. 2. Also, MCCMF is better than WBNPMD.

Fig. 2
figure 2

Comparison of the accuracy, precision, recall and f-measure with WBNPMD

Case studies

In the end, we carry out a simulation experiment to analyze the specific disease. First of all, the disease we want to explore is selected and the predicted score is ranked. Then, based on the predicted score after ranking, some miRNAs of high associations degree with the disease are found. Moreover, by comparing with the original miRNA-disease association matrix, they are determined whether the associations of high prediction score is known. Finally, the unknown associations are verified by searching existing data sets. Here, we choose three diseases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma for analysis. In addition, three popular data sets, dbDEMC [14], HMDD v3.0 [15] and miRCancer [46] are used for validation. These data sets store miRNA-disease associations that have been experimentally confirmed by some researchers over the years.

Gastrointestinal Neoplasms is a very common gastrointestinal disease with a high incidence. However, there are no obvious symptoms in the early growth stage of the neoplasms, which is very dangerous to human beings. We successfully predict 31 known associations and 9 new associations, 7 of which are confirmed by HMDD v3.0 and miRCancer. For example, Tazawa et al. [47] discovered the potential role of oncogenic miR-21 in Gastrointestinal Neoplasms. Other confirmed miRNAs have been reported in relevant data sets, and they are not listed here. There are still two unconfirmed ones that need further research. Table 2 describes the simulation results, where known associations are shown in italic, confirmed new predictions are written to the corresponding database, and unconfirmed ones are shown as “unconfirmed”. The predicted scores in the Table 2 are ranked according to the strength of the association between the miRNA and disease. There is a threshold to determine whether the prediction is accurate. Compared with known information and other databases, the prediction results of our method are generally accurate. Although two remain unconfirmed, these two could provide some insights for researchers.

Table 2 Predicted miRNAs for Gastrointestinal Neoplasms

Retinoblastoma is a malignant tumor that occurs in children under 3 years old, and has a familial predisposition. There are 38 known associations between the disease and miRNA in the known association data set, and 37 known associations are successfully predicted by us. At the same time, 23 new associations are predicted, seven of which are confirmed and the others are unconfirmed. Montoya et al. [48] found that the expression of miR-31 in Retinoblastoma is significantly reduced, which promotes the development of targeted therapy for Retinoblastoma. Table 3 shows the specific situation. The predictive sorting method in Table 3 is the same as that in Table 2.

Table 3 Predicted microbes for Retinoblastoma

Hepatoblastoma is the most common intraabdominal malignant tumor after neuroblastoma and nephroblastoma in childhood. In the existing miRNA-disease association data set, there are 8 known miRNA-disease associations, and all of them have been predicted. Besides, we predicted 12 new associations, seven of which are confirmed and 5 are not. We also find literatures confirming that miR-143 is a factor affecting Hepatoblastoma. The study of Zhang et al. [49] showed that blocking miR-143 could significantly inhibit local liver metastasis. Hepatoblastoma prediction results are shown in Table 4. The predictive sorting method in Table 4 is also the same as that in Tables 2 and 3.

Table 4 Predicted microbes for Hepatoblastoma

As can be seen from the simulation results above, most known miRNAs are successfully predicted, while a small number of unknown associations are in HMDDv3.0, miRCancer and dbDEMC data sets. Although a few have not been confirmed, they can be used as a reference for researchers. In addition, we used Cytoscape software to map the prediction network of these three diseases (Fig. 3). In the network, the ellipse represents miRNAs, and the remaining shapes represent diseases. The correlations are connected by line segments with arrows, and there are common miRNAs between diseases. According to the size of the predicted score, the color degree of the ellipse is set differently. The darker the color of the ellipse is set to, the stronger the correlation between miRNA and disease is.

Fig. 3
figure 3

The association network between disease and miRNA