Abstract
Background
MicroRNAs (miRNAs) are non-coding RNAs with regulatory functions. Many studies have shown that miRNAs are closely associated with human diseases. Among the methods to explore the relationship between the miRNA and the disease, traditional methods are time-consuming and the accuracy needs to be improved. In view of the shortcoming of previous models, a method, collaborative matrix factorization based on matrix completion (MCCMF) is proposed to predict the unknown miRNA-disease associations.
Results
The complete matrix of the miRNA and the disease is obtained by matrix completion. Moreover, Gaussian Interaction Profile kernel is added to the miRNA functional similarity matrix and the disease semantic similarity matrix. Then the Weight K Nearest Known Neighbors method is used to pretreat the association matrix, so the model is close to the reality. Finally, collaborative matrix factorization method is applied to obtain the prediction results. Therefore, the MCCMF obtains a satisfactory result in the fivefold cross-validation, with an AUC of 0.9569 (0.0005).
Conclusions
The AUC value of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are also added. The final experimental results demonstrate that MCCMF outperforms other methods in predicting miRNA-disease associations. In the end, the effectiveness and practicability of MCCMF are further verified by researching three specific diseases.
Similar content being viewed by others
Background
MicroRNAs (MiRNAs) are a class of non-coding single-stranded RNA molecules. Their lengths are usually 18–24 nucleotides. Instead of synthesizing proteins, miRNAs participate in post-transcriptional regulation of gene expression in eukaryotes and viruses [1]. In spite of the first miRNA Line-4 was discovered in 1993 [2], the diversity and prevalence of these genes were revealed in recent years. To date, 38,589 miRNA have been found in animals, plants and viruses [3]. At the same time, miRNAs were discovered to play an important role in cell proliferation [4], differentiation [5], senescence [6], apoptosis [7], and so on. A study indicated that more than one third of human genes are regulated by miRNA [8]. Obviously, miRNA disorder could have severe impacts on humans.
Evidence shows that an increasing number of miRNAs are closely associated with diseases [9]. Since the first discovery of miR15 and miR16 deficiency in B cell chronic lymphocytic leukemia (B-CLL) [10], the research results of miRNA-disease associations are often reported. For example, the expression of miR-25 and miR-223 is significantly higher in patients with esophageal squamous cell carcinoma than the normal people, while the expression of miR-375 is significantly lower [11]. Studies show that miR-26a may be a regulatory factor that inhibits the progression and metastasis of c-Myc/EZH2 double height advanced HCC [12]. In addition, miR-340 has been suggested as a biomarker for cancer metastasis and prognosis [13]. At present, the research on miRNAs and diseases is becoming more extensive. Researchers have also developed a number of databases to store miRNA and disease data, such as dbDEMC [14], HMDD v3.0 [15] and miR2Disease [16]. Unfortunately, the known correlation data is not complete. Moreover, traditional methods to identify new miRNA-disease associations are time-consuming and laborious.
With the improvement of information technology and the development of a large number of miRNA data sets, many effective methods for predicting miRNA-disease associations have been proposed [
where \(TP\) is the number of samples that are actually positive and are also predicted to be positive. \(FN\) represents the number of samples that are actually negative and also predicted to be negative. However, \(TN\) and \(FP\) represent the number of samples for which the predicted results are inconsistent.
In order to make the performance evaluation more comprehensive, we also use other evaluation indicators, including the accuracy, precision, recall and f-measure. Their calculation formulas are defined as follows:
Comparison with other methods
The AUC value is generally between 0 and 1. The higher the AUC value is, the better the prediction result will be. MCCMF finally obtains an AUC value of 0.9563 in the fivefold cross validation. MCCMF is compared with four advanced methods such as WBNPMD [43], RLSMDA [44], GRNMF [21] and CMF [45], which proves the superior performance of our method. The ROC curves are drawn in Fig. 1, and the comparison results are listed in Table 1. The results of other methods in Table 1 are obtained directly from the literature.
In the Table 1, the highest value is highlighted in italic, with the standard deviation in parentheses. In the fivefold cross validation experiment, WBNPMD, RLSMDA, GRNMF, CMF and MCCMF obtain AUCs of 0.9173, 0.8389, 0.869, 0.8697 and 0.9569, respectively. Therefore, our method is superior to the other four methods.
WBNPMD with higher AUC value is selected for comparison with MCCMF, and accuracy, precision, recall and f-measure are presented as a bar graph in Fig. 2. Also, MCCMF is better than WBNPMD.
Case studies
In the end, we carry out a simulation experiment to analyze the specific disease. First of all, the disease we want to explore is selected and the predicted score is ranked. Then, based on the predicted score after ranking, some miRNAs of high associations degree with the disease are found. Moreover, by comparing with the original miRNA-disease association matrix, they are determined whether the associations of high prediction score is known. Finally, the unknown associations are verified by searching existing data sets. Here, we choose three diseases of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma for analysis. In addition, three popular data sets, dbDEMC [14], HMDD v3.0 [15] and miRCancer [46] are used for validation. These data sets store miRNA-disease associations that have been experimentally confirmed by some researchers over the years.
Gastrointestinal Neoplasms is a very common gastrointestinal disease with a high incidence. However, there are no obvious symptoms in the early growth stage of the neoplasms, which is very dangerous to human beings. We successfully predict 31 known associations and 9 new associations, 7 of which are confirmed by HMDD v3.0 and miRCancer. For example, Tazawa et al. [47] discovered the potential role of oncogenic miR-21 in Gastrointestinal Neoplasms. Other confirmed miRNAs have been reported in relevant data sets, and they are not listed here. There are still two unconfirmed ones that need further research. Table 2 describes the simulation results, where known associations are shown in italic, confirmed new predictions are written to the corresponding database, and unconfirmed ones are shown as “unconfirmed”. The predicted scores in the Table 2 are ranked according to the strength of the association between the miRNA and disease. There is a threshold to determine whether the prediction is accurate. Compared with known information and other databases, the prediction results of our method are generally accurate. Although two remain unconfirmed, these two could provide some insights for researchers.
Retinoblastoma is a malignant tumor that occurs in children under 3 years old, and has a familial predisposition. There are 38 known associations between the disease and miRNA in the known association data set, and 37 known associations are successfully predicted by us. At the same time, 23 new associations are predicted, seven of which are confirmed and the others are unconfirmed. Montoya et al. [48] found that the expression of miR-31 in Retinoblastoma is significantly reduced, which promotes the development of targeted therapy for Retinoblastoma. Table 3 shows the specific situation. The predictive sorting method in Table 3 is the same as that in Table 2.
Hepatoblastoma is the most common intraabdominal malignant tumor after neuroblastoma and nephroblastoma in childhood. In the existing miRNA-disease association data set, there are 8 known miRNA-disease associations, and all of them have been predicted. Besides, we predicted 12 new associations, seven of which are confirmed and 5 are not. We also find literatures confirming that miR-143 is a factor affecting Hepatoblastoma. The study of Zhang et al. [49] showed that blocking miR-143 could significantly inhibit local liver metastasis. Hepatoblastoma prediction results are shown in Table 4. The predictive sorting method in Table 4 is also the same as that in Tables 2 and 3.
As can be seen from the simulation results above, most known miRNAs are successfully predicted, while a small number of unknown associations are in HMDDv3.0, miRCancer and dbDEMC data sets. Although a few have not been confirmed, they can be used as a reference for researchers. In addition, we used Cytoscape software to map the prediction network of these three diseases (Fig. 3). In the network, the ellipse represents miRNAs, and the remaining shapes represent diseases. The correlations are connected by line segments with arrows, and there are common miRNAs between diseases. According to the size of the predicted score, the color degree of the ellipse is set differently. The darker the color of the ellipse is set to, the stronger the correlation between miRNA and disease is.
Discussion
The above experimental results are enough to prove that our method is superior to the most advanced method. The excellent prediction performance of MCCMF can be attributed to several significant factors. Firstly, data is preprocessed by Weight K Nearest Known Neighbors method and matrix completion method to improve the prediction accuracy. Secondly, a collaborative matrix factorization model is applied to predicting miRNA-disease associations, which is a promising one among many collaborative filtering technologies. In bioinformatics, matrix factorization contributes to identifying hidden links among genes. However, the performance of our method needs to be further improved. For instance, there exists a better way to integrate data, rather than simply adding them together. In the future, we will improve the technology to use the latest version of the data set, such as HMDD v3.0.
Conclusions
In this paper, a collaborative matrix factorization method based on matrix completion (MCCMF) is developed for predicting miRNA-disease associations. Considering the sparse and incomplete similarity matrix of miRNA-disease, we use the matrix completion method to complete the matrix. Then the completed matrix is integrated with GIP kernel similarity to improve the data information and reduce the influence of noises. In addition, WKNKN is also introduced to pretreat the existing association matrix of miRNAs and diseases, so our method is suitable to practical problems. Finally, the idea of CMF is adopted to construct the objective function and obtain the predicted results. The AUC value (0.9569) of MCCMF is higher than other advanced methods in the fivefold cross validation experiment. In order to comprehensively evaluate the performance of MCCMF, accuracy, precision, recall and f-measure are applied to measure the performance, and results are 0.992, 0.779, 0.918 and 0.830, respectively. Compared with the other four methods, our method has the best performance. The analysis of Gastrointestinal Neoplasms, Retinoblastoma and Hepatoblastoma further verified the effectiveness of MCCMF. Since most of associations are unknown in reality, MCCMF can also be used to predict in this situation.
Methods
We develop a novel method for predicting miRNA-disease associations with MCCMF. MCCMF is divided into four main steps: Firstly, we use the matrix completion algorithm to complete the miRNA similarity matrix and the disease similarity matrix to generate a new completion similarity matrix. Secondly, the new completion similarity matrix is integrated with existing miRNA and disease similarity information. Thirdly, the WKNKN is used to convert the binary values of the miRNA-disease association matrix into the interaction likelihood values [41]. Finally, the Collaborative Matrix Factorization is used to predict the association of miRNA-disease. Figure 4 shows the complete process for MCCMF.
Human miRNA-disease associations
The initial miRNA-disease association data is downloaded from HMDD v2.0 [50]. HMDD v2.0 is an experimental data set supporting human miRNA-disease associations, and storing 5430 experimentally verified miRNA-disease associations between 495 miRNAs and 383 diseases. In this paper, the adjacency matrix \({\mathbf{MD}}\) is used to represent the miRNA-disease association network. The adjacency matrix \({\mathbf{MD}}\) is a sparse matrix composed of 0 and 1. If \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) is 1, disease \(d_{j}\) is correlated with miRNA \(m_{i}\); otherwise irrelevant.
MiRNA function similarity
According to the hypothesis that functionally similar miRNAs are more likely to be associated with phenotypic diseases, a method for calculating the functional similarity of miRNAs (MISIM) is proposed by Wang et al. [51]. Firstly, we need to define semantic similarity between one disease and one group of disease. The calculation formula is as follows:
Here \(d\) represents one disease and \({\mathbf{D}}\) represents one disease group. Then, we define the similarity of \(d\) and \({\mathbf{D}}\), \(S(d,{\mathbf{D}})\), as the maximum similarity.
Functional similarity of the two miRNAs is defined as
where \(M_{1}\) and \(M_{2}\) represent the related miRNAs of \({\mathbf{D}}_{1}\) and \({\mathbf{D}}_{2}\), respectively. \({\mathbf{D}}_{1}\) contains \(m\) diseases, and \({\mathbf{D}}_{2}\) contains \(n\) diseases.
In this paper, we download the miRNA function similarity from https://www.cuilab.cn/files/images/cuilab/misim.zip. And the matrix \({\mathbf{MF}}\) is used to represent the functional similarity network of the miRNA, in which the element \({\mathbf{MF}}(i,j)\) represents the similarity between miRNA \(m_{i}\) and miRNA \(m_{j}\). The self-similarity of each miRNA is 1, so the diagonal elements of the matrix \({\mathbf{MF}}\) are 1.
Due to incomplete miRNA data supported by the experiment, the similarity values calculated by MISIM may be biased. Some subsequent treatment of the matrix may be improved [52].
Disease semantic similarity
The relationship between different diseases is obtained from the MeSH database (https://www.ncbi.nlm.nih.gov/). Based on the previous literature [51], we represent the disease \(D\) as a Directed Acyclic Graph, \(DAG(D) = (D,T(D),E(D))\), where \(T(D)\) is the set of both a node \(D\) and its ancestor nodes, and \(E(D)\) is the set of edges that ancestor nodes pointing to node \(D\). For ancestor node \(t\) in \(DAG(A)\), its contribution to the semantic value of disease \(A\) is computed as follows:
In the above formula, \(\Delta\) is a semantic contribution factor. Based on the method of Wang et al., the value of \(\Delta\) is set to 0.5. For the disease \(A\), the contribution of itself to the disease \(A\) is 1, while the contribution of ancestor node \(t\) is decreasing with the increase of its layers.
Based on the contribution of ancestor diseases and disease \(A\) itself, the semantic value of disease \(A\) can be expressed as follows:
According to the hypothesis that the more shared part of the disease pairs in \(DAGs\) is, the higher similarity is. The semantic similarity between disease \(A\) and disease \(B\) is calculated as:
However, the above model is a little inadequacy, which is the setting of \(\Delta\) that causes the same layer of diseases with the same semantic contribution. Obviously, the incidence of various diseases is different, and the contribution of diseases with high incidence should be less than those with low incidence. To improve the above model, we combine the method of Xuan et al. [53] to define the semantic similarity calculation method. In this method, the contribution of ancestor node \(t\) in \(DAG(A)\) to the semantic value of disease \(A\) is as follows:
The semantic value of disease \(A\), and the semantic similarity between the disease \(A\) and the disease \(B\) are calculated as:
Finally, in order to calculate the semantic similarity more comprehensive and rational, we combine the two models to get Eq. (15).
Gaussian interaction profile kernel similarity for diseases and miRNAs
On the basis of the hypothesis that functionally similar miRNAs may be associated with similar diseases, and vice versa, the known miRNA-disease association network is used to construct the GIP kernel similarity for diseases and miRNAs [54]. GIP kernel similarity can increase the multiple and topological information of known correlations. The interaction profile of miRNA \(m(i)\) is represented by the binary vector \(M(i)\) of the i-th column of the adjacency matrix \({\mathbf{MD}}\). Similarly, the binary vector \(D(i)\) of the i-th row of the adjacency matrix \({\mathbf{MD}}\) denotes the interaction profile of disease \(d(i)\). Hence, we can define the GIP kernel similarity for miRNAs and diseases as follows:
Here, \(\gamma_{m}\) and \(\gamma_{d}\) are parameters to control the kernel bandwidth and obtained by the following formulas:
where \(\delta_{m}\) and \(\delta_{d}\) are also bandwidth parameters and they are set to 1 according to the previous study [55]. The \(nm\) and \(nd\) mean the number of all the miRNAs and diseases.
Matrix completion
The miRNA functional similarity matrix and disease semantic similarity matrix calculated by the above operations are still sparse and incomplete, and there are some redundant associations (i.e. inherent noise). So we use the matrix completion method to solve the problem [56]. Suppose the incomplete matrix is \({\mathbf{D}}\), which can be represented as a linear combination of \({\mathbf{D}}\) and the noise matrix \({\mathbf{N}}\). The formula is as follows:
where \({\mathbf{DR}}\) is a low-rank matrix, and specifically, it is a more refined or informative similarity matrix after removing noise from the existing similarity matrix.
In order to make \({\mathbf{R}}\) be low-rank, a nuclear norm on \({\mathbf{D}}\) is added. At the same time, the \(L_{2,1}\)-norm of the error term \({\mathbf{N}}\) is used to make noise matrix \({\mathbf{N}}\) more sparse. When the final low-rank matrix \({\mathbf{DR}}^{*}\) and sparse matrix \({\mathbf{N}}^{*}\) are calculated, \({\mathbf{DR}}^{*}\) or \({\mathbf{D}} - {\mathbf{N}}^{*}\) are used to describe a completed matrix. Therefore, a formula for solving convex optimization problem can be defined as follows:
Here, \(|| \cdot ||_{*}\) represents the nuclear norm, \(\omega \in (0,1)\) is the positive weighting parameter and \(|| \cdot ||_{2,1}\) is the noise regularization term.
When solving optimization problems under equality constraints, the ALM method is more effective [38]. Therefore, according to ALM, the Eq. (21) can be rewritten as:
Then switch the Eq. (22) to an unconstraint problem, which is the Lagrange function. The formula is as follows:
where \(\beta > 0\) is the penalty parameter, and \(\beta\) is updated by \(\beta = \min (\rho \beta ,\max_{\beta } )\). \(Y_{1}\) and \(Y_{2}\) are the Lagrange multipliers.
The ADM method is used to solve the Eq. (23) [39]. The ADM is a simple method to solve the decomposable convex optimization problem, especially in solving large-scale problems. The update iterations for ADM are as follows:
Based on the singular value shrinkage operator [40], \({\mathbf{X}}^{k + 1}\) and \({\mathbf{N}}^{k + 1}\) are represented as follows:
yet the minimization of \({\mathbf{R}}\) is a least squares problem, and its normal equation is as follows:
where \({\mathbf{I}} = {\mathbf{DD}}^{T}\) is widely used in matrix completion.
Then \({\mathbf{X}}\), \({\mathbf{R}}\) and \({\mathbf{N}}\) are updated by changing the Lagrange multipliers \(Y_{1}\) and \(Y_{2}\). Moreover, \(Y_{1}\) and \(Y_{2}\) can be obtained by the following formulas:
Finally, we can get the final low-rank matrix \({\mathbf{R}}^{*}\) and sparse matrix \({\mathbf{N}}^{*}\) until the convergence conditions \(||{\mathbf{D}} - {\mathbf{DR}} - {\mathbf{N}}||_{\infty } < \varepsilon\) and \(||{\mathbf{R}} - {\mathbf{X}}||_{\infty } < \varepsilon\) are satisfied. Here, \(\varepsilon\) is an extremely low number (set as \(1 \times 10^{ - 8}\) in this paper). As mentioned above, the refined matrix \({\mathbf{R}}^{*}\) and noise matrix \({\mathbf{N}}^{*}\) can be used to describe a completed matrix in the form of \({\mathbf{D}} \times {\mathbf{R}}^{*}\) or \({\mathbf{D}} - {\mathbf{N}}^{*}\). The specific process of matrix completion is shown in Fig. 5.
Based on the above matrix completion method, the disease semantic similarity matrix \({\mathbf{DS}}\) and miRNA functional similarity matrix \({\mathbf{MF}}\) are used as input matrices to replace matrix \({\mathbf{D}}\), so that we can obtain two refined similarity matrices \({\mathbf{CD}}\) and \({\mathbf{CM}}\), respectively.
The algorithm of Matrix completion is summarized in Algorithm 1.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs12859-020-03799-6/MediaObjects/12859_2020_3799_Figa_HTML.png)
Similarity information integrations
Subsequent work is to integrate the completed matrix with existing similarity matrices. Since similarity information integrations of diseases and miRNAs are similar, Fig. 6 only shows the process for integration of miRNA similarity.
The specific integration formulas are as follows:
WKNKN
WKNKN can be thought of as a voting or integration method: some potential classifiers (nearest neighbors) are aggregated by a (weight) majority vote, the results of which are used for prediction [41].
In this paper, \({\mathbf{MD}}\) expresses the miRNA-disease association matrix, which only represents the association between the miRNA and the disease verified by human experiment at the current stage. And we simply stipulate that if the miRNA is associated with the disease, \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) will be set to 1. However, there are still many unknown miRNAs and diseases in the world, and whether they can be used as a bridge between existing miRNAs and diseases or not are still unknown. Maybe existing miRNAs are correlated with existing diseases through these unknown miRNAs, so the \({\mathbf{MD}}\) regulation is obviously inappropriate.
Therefore, by estimating these unknown conditions through the correlation of its known neighbors, the WKNKN method preprocesses the matrix \({\mathbf{MD}}\) to get the pre-processed matrix of \({\mathbf{MD}}\) (\({\mathbf{PMD}}\)). If \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right) = 0\), WKNKN will give \({\mathbf{MD}}\left( {m_{i} ,d_{j} } \right)\) a value from 0 to 1 according to the corresponding similar information of miRNAs and diseases. The specific process of WKNKN is shown in Fig. 7.
MCCMF for MiRNA-disease association prediction
The CMF method proposed by Shen et al. [45] that can effectively predict the potential interactions between miRNAs and diseases. In this study, the idea of the CMF method is used to predict the miRNA-disease association. The specific steps of CMF are as follows: firstly, the input miRNA-disease association matrix \({\mathbf{PMD}}\) is decomposed into two low-rank matrices \({\mathbf{A}}\) and \({\mathbf{B}}\) by using the singular value decomposition.
where \({\mathbf{U}}\) and \({\mathbf{V}}\) is the unitary matrix. \({\mathbf{S}}\) is a negative real diagonal matrix, and there are k singular values on the diagonal.
Secondly, we write the objection function of MCCMF according to the idea of CMF, as follows:
Here, \(|| \cdot ||_{F}\) is the Frobenius norm to ensure that the feature vectors of similar miRNAs and similar diseases are similar. \(\lambda_{l}\), \(\lambda_{m}\) and \(\lambda_{d}\) are positive parameters, which are determined by the fivefold cross validation, and \(\lambda_{l} \in \left\{ {2^{ - 2} ,2^{ - 1} ,2^{0} ,2^{1} } \right\}\), \(\lambda_{m} /\lambda_{d} \in \left\{ {2^{ - 3} ,2^{ - 2} ,2^{ - 1} ,2^{0} ,2^{1} ,2^{2} ,2^{3} ,2^{4} ,2^{5} } \right\}\).
Thirdly, we use \(L\) to represent the Eq. (33), and derive two alternative update rules by setting \({{\partial L} \mathord{\left/ {\vphantom {{\partial L} {\partial {\mathbf{A}}}}} \right. \kern-\nulldelimiterspace} {\partial {\mathbf{A}}}} = 0\) and \({{\partial L} \mathord{\left/ {\vphantom {{\partial L} {\partial {\mathbf{B}}}}} \right. \kern-\nulldelimiterspace} {\partial {\mathbf{B}}}} = 0\).
where \({\mathbf{I}}_{k}\) is the \(k \times k\) identity matrix.
Finally, we update \({\mathbf{A}}\) and \({\mathbf{B}}\) iteratively until they converge to get the final \({\mathbf{A}}\) and \({\mathbf{B}}\). By \({\mathbf{A}}*{\mathbf{B}}^{T}\), the prediction matrix for miRNA-disease associations is obtained. The detail process of MCCMF can be seen in Fig. 8.
The algorithm of CMF is summarized in Algorithm 2.
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs12859-020-03799-6/MediaObjects/12859_2020_3799_Figb_HTML.png)