Background

Feature extraction from protein sequences plays an important role in protein classification [1,2,3,4] of many areas, such as identification of plant pentatricopeptide repeat coding protein [5], prediction of bacterial type IV secreted effectors [Variable selection

Variable selection is accomplished at the sixth step. In each dimension, the established ensemble classifier is applied to the testing samples. The accuracy (Acc) expressed in Eq. (2) and the area under curve (AUC) of the receiver operating characteristic (ROC) are calculated. Accordingly, a line chart is obtained with its horizontal and vertical coordinates corresponding to the variable indices in their descending order and the corresponding Accs and AUCs in different dimensions. A dimension threshold can be made when Accs and AUCs are kee** almost the same with dimension incrementally increasing. Thus, the variables that really help to recognize proteins with specific functions are selected from the encoded feature.

Measure

Evaluation metrics are made to estimate the effectiveness of selected variables at the seventh step. The classification error rate is expressed as follows,

$$\begin{aligned} Err={{FN+FP} \over {TP+FN+TN+FP}}, \end{aligned}$$
(1)

where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative, respectively. On the contrary, Acc is shown as follows,

$$\begin{aligned} Acc={{TN+TP} \over {TP+FN+TN+FP}}. \end{aligned}$$
(2)

Besides, we choose four widely used quantitative measurements. The confusion matrix illustrates TP, TN, FP and FN together. Besides, Precision and Recall are computed as follows,

$$\begin{aligned} Precision= & {} {{TP} \over {TP+FP}}, \end{aligned}$$
(3)
$$\begin{aligned} Recall= & {} {{TP} \over {TP+FN}}. \end{aligned}$$
(4)

In addition, \(F1-measure\) is a harmonic average of Precision and Recall, which is expressed as

$$\begin{aligned} F1-measure = {{2*Precision*Recall} \over {Precision+Recall}}. \end{aligned}$$
(5)

Moreover, the ROC and AUC are also provided here as qualitative measurements.