Background

MicroRNAs (miRNAs) are a class of non-coding RNAs that function as gene expression regulators across various animal model systems [1]. Typically, miRNAs interact with the 3’ untranslated region of target mRNAs, leading to mRNA degradation and the downregulation of gene expression [2]. The dysregulated miRNA expression is closely associated with a spectrum of diseases including cancer [3,4,5], cardiovascular diseases [6, 7], inflammatory diseases [8,9,Ablation models

We used two ablation models to assess the effectiveness of the proposed architecture. In ablation model 1 (AM1), the output from the first module was directly processed by the second module without addition of the pre-miRNA secondary structure embeddings. Conversely, in ablation model 2 (AM2), all CU1 components were removed, signifying that the original input information for each module was not enhanced prior to the convolution process. Both AM1 and AM2 followed the same training procedure as the proposed model with minor adjustments to the layer parameters.

Performance evaluation

In the context of binary classification, we computed several metrics to assess the model’s performance in the independent test sets. These metrics included accuracy (ACC), specificity (Spe), sensitivity (Sen), F1 score, and Matthews’ correlation coefficient (MCC) [45]. The definitions of these binary classification metrics are provided below:

$$Acc=\frac{TP+TN}{TP+TN+FP+FN},$$
$$Spe=\frac{TN}{TN+FP},$$
$$Sen=\frac{TP}{TP+FN},$$
$$F1= \frac{2\times TP}{2\times TP+FP+FN},$$
$$MCC= \frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\times \left(TP+FN\right)\times \left(TN+FP\right)\times (TN+FN)}},$$

where \(TP\), \(TN\), \(FP\), and \(FN\) denote the counts of true positives, true negatives, false positives, and false negatives, respectively.

For the evaluation of the multi-class classification models, we adopted a macro-averaging approach for \(Acc\), \(Spe\), and \(Sen\) to calculate the average of each metric across all classes. In addition, we utilized an extension of the binary MCC to the multi-class scenario for evaluation [46]. MCC for the multi-class classification is defined as below:

$${MCC}_{multi}=\frac{c\times s-{\sum }_{k}^{K}{p}_{k}\times {t}_{k}}{\sqrt{\left({s}^{2}-{\sum }_{k}^{K}{p}_{k}^{2}\right)\times ({s}^{2}-{\sum }_{k}^{K}{t}_{k}^{2})}},$$

where \({t}_{k}\) is the number of occurrences for class \(k\), \({p}_{k}\) signifies the number of predictions for class \(k\), \(c\) denotes the total number of samples correctly predicted, and \(s\) represents the total number of samples.

Results

Performance of binary classification models

The performance of the best models among replications is listed in Table 2. The test set was balanced, containing an equal number of positive and negative patterns (see Sect. "Methods"). There were no substantial differences between the performances of the models for 5’ and 3’ patterns; both models achieved accuracy, specificity, sensitivity, and F1 scores of 0.90 or higher, whereas MCC reached 0.82. Receiver operating characteristic (ROC) curves and precision-recall (PR) curves for DiCleave in the binary classification task are shown in Fig. 5. Although the area under the ROC curve (AUC) score of the 3’ pattern model was marginally higher than that of the 5’ pattern model, it could be concluded that both models demonstrated nearly identical performance. On the other hand, the performance on an unbalanced test set, which was created by randomly selecting 50 positive patterns and all negative patterns from the original test set, showed a slight decrease compared with that on the balanced dataset, although the F1 score and MCC remained above 0.80 and 0.75, respectively (Additional file 1: Tables S1 and S2).

Table 2 Performance of different models
Fig. 5
figure 5

Receiver-operating characteristic curves (a) and precision-recall curves (b) for DiCleave in the binary classification task. Blue and green lines indicate the performance of the 5’ classifier and the 3’ classifier, respectively

The proposed model outperformed the ablation models in all classification tasks (Table 2). AM1, which lacked pre-miRNA secondary structure embeddings, exhibited the lowest accuracy in all classification tasks. Whereas AM1 achieved the highest specificities in the binary classification task, its sensitivities were the lowest. Conversely, the inclusion of secondary structure embeddings notably enhanced the overall performance of AM2, resulting in significantly improved sensitivities at the acceptable cost of decreased specificity. However, the performance of AM2 was lagged behind that of DiCleave. This can be attributed to the absence of CU1, which hindered the preservation of information within the input.

To assess the effectiveness of our proposed models, we compared them with ReCGBM. We computed the average performance of DiCleave and ReCGBM across 10 replications with different initial conditions to provide a comprehensive view of their performance. As shown in Fig. 6 and Additional file 1: Tables S3 and S4, the performance of our models surpassed the average performance of ReCGBM in 5’ pattern prediction. In addition, our models demonstrated similar performance in 5’ and 3’ pattern predictions whereas ReCGBM exhibited unequal performance between 5’ and 3’ pattern predictions (i.e., inferior performance in 5’ pattern prediction compared with the 3’ pattern prediction). Figure 6 also includes a performance comparison with PHDcleav [28] and LBSizeCleav [29]. The PHDcleav model utilized binary input features and a window size of 14 [28], whereas the LBSizeCleav model employed parameter k = 1 and a window size of 14 [29]. We selected these two models because their input features closely resembled our architecture. As shown in Fig. 6, both SVM-based models struggled to compete with ReCGBM and DiCleave.

Fig. 6
figure 6

Performance comparison between DiCleave and other existing methods for binary classification tasks for both a the 5’ cleavage site classification task and b the 3’ cleavage site classification task. DiCleave and ReCGBM show average performances of 10 replications. The performances of PHDcleav and LBSizeCleav (indicated by an asterisk) are sourced from their original articles [28, 29]

Performance of the multi-class classification model

Our model can perform multi-class classification tasks by adopting the output layer. When presented with a cleavage pattern sequence, our model was able to predict whether the sequence represented a 5’ pattern, a 3’ pattern, or a negative pattern. The performance of the multi-class classification model is summarized in Table 2. It achieved high accuracy, sensitivity, F1 socre of 0.89, and specificity of 0.94. Furthermore, for an unbalanced dataset for the multi-class classification tasks, both the F1 score and MCC exceeded 0.85 and 0.75, respectively (Additional file 1: Tables S1 and S2).

The confusion matrix for the multi-class classification model is shown in Fig. 7. Only one 5’ pattern sequence was misclassified as a 3’ pattern, and no 3’ patterns were misclassified as 5’ patterns. This outcome highlights our model’s capability to differentiate positive cleavage patterns from different arms. However, there is substantial room for improvement when distinguishing between positive and negative patterns, particularly in the differentiation of 3’ patterns from non-cleavage patterns.

Fig. 7
figure 7

Confusion matrix for the multi-class classification model. Rows and columns represent the counts of true and predicted labels, respectively. The values along the diagonal represent the counts of correctly predicted labels by the model

The ROC and PR curves for the multi-class classification model using a one-vs-all strategy are shown in Fig. 8. The AUC scores for both 5’ and 3’ patterns reached 0.97, surpassing that of the negative pattern (= 0.93). In the PR curve, the precision for 3’ patterns decreased more rapidly as recall increased, compared with that for 5’ and non-cleavage patterns.

Fig. 8
figure 8

Receiver-operating characteristic curves (a) and precision-recall curves (b) for the multi-class classification model. In this experiment, the model was transformed into a binary classifier by designating one class as the positive class and the other two classes as the negative classes

Discussion

We introduced a DNN model to predict the presence of a human Dicer cleavage site within a short sequence segment, defined as a cleavage pattern. Our model’s input consisted of a combination of one-hot encodings for sequences, including pattern sequences and their complementary sequences, and secondary structure information. Rather than merging these inputs into twelve types of dimers that include combinations of the four bases (“A”, “C”, “G”, “U”) and the three secondary structure indicators (“(”, “.”, “)”) [47], we opted to stack them. This approach was chosen to avoid creating a sparse input space that could yield an unnatural convolution output. The combination of sequence patterns and the secondary structure embeddings extracted by the autoencoder significantly improved the discriminative capacity of the DNN model, resulting in accuracies of 0.91 and 0.89 for the binary and multi-class classification tasks, respectively. This result demonstrated that incorporating pre-miRNA secondary structure embeddings and leveraging the shortcut structure of CU1 significantly enhanced the performance of the model. It is worth noting that the multi-class classification model was trained with an unbalanced dataset in terms of the number of patterns for each class. Although no further measures were taken to address this imbalance apart from adjusting the weight of the negative patterns in the loss calculation, our model still yielded satisfactory results. This success could be attributed to the valuable information provided by the secondary structures, which proved effective in distinguishing between cleavage and non-cleavage patterns.

The computational experiments conducted in this study involved a random selection of training and test datasets from the main dataset. However, even when sequences with 80% or higher similarity were pre-filtered from the main dataset using CD-HIT-EST software [48, 49], DiCleave still demonstrated high performance for predicting cleavage patterns. Specifically, the best model among 10 replications achieved an F1 score of 0.85 or higher for both binary and multi-class classification tasks in the balanced case, and 0.80 or higher in the unbalanced case (Additional file 1: Table S5).

One limitation of this study was the relatively small size of the human pre-miRNA dataset used for training the deep learning model, which led to overfitting during training. To mitigate this, we employed a small batch size of 20 samples per mini-batch, as preliminary experiments indicated that a larger batch size caused the models to become stuck in local minima. We set the learning rate to 0.005, which is larger than the default value in PyTorch’s Adam implementation. Low learning rates (e.g., 1e-4 or 5e-4) were found to slow down the training process and hinder model convergence, whereas high learning rates led to overfitting. One potential solution for the small and unbalanced dataset is data augmentation, which can increase the number of training samples. However, effective data augmentation for nucleotide sequences requires expert knowledge in biology. Another limitation pertains to the definition of the cleavage pattern within pre-miRNA sequences. In this study, a cleavage pattern was defined as a 14-nt-long sequence with a Dicer cleavage site at the center. However, in real scenarios, the cleavage site could occur at any position within a given sequence, which could significantly increase the dataset size if each possible position were considered.

Finally, as identified by Liu and colleagues, the features near the pre-miRNA center were observed to have great significance [30], suggesting an intrinsic interplay among the bases within pre-miRNA. Consequently, our future endeavors include the integration of a Transformer-based model, which has the potential to harness these intrinsic features through the attention mechanism [47]. We also aim to create an end-to-end generative model to directly generate mature miRNA sequences from pre-miRNAs based on our cleavage prediction method in forthcoming research.

Conclusions

We have demonstrated the effectiveness of our deep learning models in predicting the presence of a human Dicer cleavage site within a given pre-miRNA sequence using both its sequence and secondary structure information. Our binary classification model exhibited superior or comparable performance compared with existing models. Furthermore, our model’s ability to function as a multi-class classifier is highly advantageous and practical. This versatility allows our model to make predictions without requiring prior information for any sequence segment, ensuring accessibility to a broad range of data in the miRBase database even when the available information is incomplete.