Introduction

Phosphorylation is one of the most important protein post-translation modification (PTMs) in eukaryotes. It plays essential roles in the majority of biological pathways, regulating cellular processes like metabolism, proliferation, differentiation and apoptosis1. More than 30% of all eukaryotic proteins are estimated to undergo reversible phosphorylation2. Biochemically, phosphorylation results in a transfer of a phosphate moiety from adenosine triphosphate (ATP) to the acceptor residue, thereby generating adenosine diphosphate (ADP) whilst inducing the residue to be phosphorylated. The process of protein phosphorylation usually involves distinct short peptide motifs, or patterns including phosphorylation of substrate sites, being recognized by different protein kinases which then leads to a phosphate moiety being typically attached to either Serine (Ser), Threonine (Thr) or Tyrosine (Tyr) residues.

Conventional experimental identifications and recent advances in high throughput Mass Spectrometry (MS) techniques have generated a large number of phosphorylated substrates with confirmed phosphorylation sites. In parallel, a series of algorithms have also been developed to predict phosphorylation sites from amino acid sequence. These range from simple motif or pattern searches to more complex machine learning methods like Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Examples of such predictive algorithms include NetPhos3, NetPhosK4, KinasePhos5, DISPHOS6, Scansite7, PPSP8, GPS9, PredPhospho10 and Musite11.

Most computational phosphorylation site predictors are not organism-specific predictors. However, with the increases in experimentally verified protein phosphorylation sites for different organisms, an increasing goal is to develop organism-specific phosphorylation predictors, such has occurred for yeast12, Arabidopsis13 and rice1. The yeast-specific predictor, NetPhosYeast, outperforms existing generic predictors in the identification of phosphorylation sites in yeast12. PhosPhAt predicting phosphorylated-Serine sites for Arabidopsis is found to perform better with Arabidopsis sequences than other generic predictors13. Furthermore, a protein family specific phosphorylation site predictor, PhosTryp, was developed specifically for the trypanosomatidae family in parasitic protozoa14.

We have focused our efforts on Rice (Oryza sativa L.). Rice is considered a model plant species of the monocots group, it has a sequenced genome15 and serves as a cornerstone for the study of functional genomics in cereal plants16. Phosphorylation proteins have been identified in rice treated with various hormones17 and under different environmental conditions, including high salinity18, drought19 and high temperature20. Many phosphorylation sites in rice were identified by Nakagami et al.21 However, current predictors perform poorly when individually used to predict phosphorylation sites in rice phosphoproteins1. We have therefore established a meta-predictor for rice-specific phosphorylation sites1. However, this rice-specific predictor was not trained directly by the rice phosphorylation sites data, but was developed by integrating six newly predicting programs, including NetPhosK, NetPhos2.0, KinasePhos, PrePhospho 1.0, Scansite and DISPHOS. This paper augments this earlier work by building a Support Vector Machine (SVM) prediction model using experimentally identified rice phosphorylation sites directly.

Results

Performance of the 6 encoding schemes

The performance of the three sole encoding schemes was measured by using different sizes of datasets and with SVM used as the classifier. CKSAAP performed best among the three sole encoding schemes (Fig. 1). However, with the size of dataset increasing, the performance of SVM with CKSAAP decreased, SVM with AF kept fluctuating, while that of SVM with KNN increased (Table 1). The same changing trends (CKSSAP decreasing, AF fluctuating and KNN increasing) in performance of SVM with AF, KNN or CKSAAP was also true when the ratio of (+) sites to (−) sites increased (Table 1).

Table 1 Performance of 3 sole encoding schemes on different size of dataset.
Figure 1
figure 1

ROC curves of predicting performance of SVM with 3 different sole encoding schemes.

*In the diagrams, the increased area under the ROC indicates the improved classification performance. The same below.

The performance of AF combined with CKAAP (AF-CKSAAP) was better than the sole encoding scheme, AF or CKSAAP (Fig. 2A). The same was true for AF combined with KNN (AF-KNN) (Fig. 2B). However, CKSAAP combined with KNN (CKSAAP-KNN) outperformed KNN, but did not outperform CKSAAP (Fig. 2C). In the preliminary experiment, we found that the combination of all the three encoding schemes did not significantly outperform CKSAAP (Data not shown) but increased feature dimensions. This result implies that AF, KNN and CKSAAP might be complementary to each other to some extent, especially AF and CKSAAP.

Figure 2
figure 2

ROC curves of predicting performance of SVM with the combining encoding schemes.

*A. ROC curves of SVM with AF-CKSAAP, AF and CKSAAP. B. ROC curves of SVM with AF-KNN, AF and KNN. C. ROC curves of SVM with CKSAAP-KNN, KNN and CKSAAP.

Performance of 4 different classifiers

The performance of the classifiers with the six different encoding schemes were firstly compared. The best results of a DT classifier were for CKSAAP, with an ACC of 71.14% and MCC of 0.314 (Fig. 3). The best results of a KNN classifier were for CKSAAP-KNN, with an ACC of 73.71% and MCC of 0.402 (Fig. 3). The best results of a RF classifier were for AF-KNN, with an ACC of 75.1% and MCC of 0.458 (Fig. 3). The best results of a SVM classifier were for AF-CKSAAP, with an ACC of 80.90% and MCC of 0.617 (Fig. 3).

Figure 3
figure 3

MCC of predicting performance of different classification algorithms with different encoding schemes.

We then compared classifying ability of the 4 different classifiers on phosphorylation sites of proteins in rice. As shown in Fig. 3, SVM performed best on the phosphorylation sites among the 4 classifiers with any of the 6 encoding schemes. The SVM with AF-CKSAAP, CKSAAP and CKSAAP-KNN lay in the top 3 predicting models.

Performance of the top three predicting models on different phospho-amio acids

We used different phospho-amino acids datasets with different ratio of (+) sites to (−) sites to detect the performance of the top three predicting models (SVM with AF-CKSAAP, SVM with CKSAAP and SVM with CKSAAP-KNN). For phospho-serine, phospho-threonine or phospho-tyrosine sites, SVM with AF-CKSAAP out-performed SVM with CKSAAP or CKSAAP-KNN, even though the ratio of (+) sites to (−) sites in the dataset was unbalanced (Table 2).

Table 2 Predicting performance of SVM with 3 different encoding schemes on different phospho-amino acid sites.

Blom et al. (2004) suggested that a real non-phosphorylation site had to be solvent-inaccessible4. After discarding the predicting solvent-accessible non-phosphorylation sites from Table S2 and composing a new negative dataset (Table S3), we used the new training dataset which was extracted from Table S3 and its balancing positive dataset to re-train the top three predicting models. Table 3 also indicated that the overall performance of SVM with AF-CKSAAP was better than that of SVM with CKSAAP or CKSAAP-KNN.

Table 3 Predicting performance of the three top SVM models trained by the negative dataset in Table S3 and its balancing positive dataset.

Assessment of the predictor, Rice_Phospho 1.0, with the newly existing predictors

We used SVM with AF-CKSAAP to develop a new rice-specific predictor, Rice_Phospho 10. We applied the independent test dataset to compare the predicting performance of Rice_phospho 1.0 with the newly existing predictors, including Scansite, Musite, PlantPhos and PhosphoRice. The MCC of the prediction performance of Rice_phosphos 1.0 in comparison to Scansite, Musite and PhosphoRice were shown in Table 4. Rice_Phospho 1.0 had higher MCC value than the existing predictors, indicating that the performance of Rice_Phospho 1.0 was significantly better than that of Scansite, Musite and PhosphoRice. The Area Under ROC Curve (AUC) of Rice_Phospho1.0 was higher than that of PlantPhos (Fig. 4), implying that Rice_Phospho 1.0 also outperformed PlantPhos.

Table 4 Predicting performance of SVM models and newly developed predictors.
Figure 4
figure 4

ROC curves of predicting performance of Rice_Phospho 1.0 and PlantPhos.

Construction online predictor, Rice_Phospho 1.0

We constructed the online tool, Rice_Phospho 1.0, which was a specific SVM predictor on the protein phosphorylation sites in rice (Oryza sativa L.). The potential phosphorylation sites are retrieved after the user uploads a protein sequence in FASTA format into the text area and selects one of the encoding schemes (Fig. 5). Rice_Phospho 1.0 is accessible via http://bioinformatics.fafu.edu.cn/rice_phospho1.0.

Figure 5
figure 5

Interface of the online predictor, Rice_Phospho 1.0, on rice protein phosphorylation sites.

Discussion

Our analysis indicates that CKSAAP encoding can extract the character around the phoshorylation sites more concisely than AF and KNN. Generally, AF and KNN methods select the position-specific feature of a sequence fragment, while the CKSAAP encoding pays attention to the co-location of amino acid pairs at different positions surrounding phosphorylation sites22. The sequence character extracted by CKSAAP can also reflect the composition of short linear motifs, which have been widely reported to be involved in many biological processes such as the communication of protein-protein interactions23. The results in this paper imply that short linear motifs maybe more important than position-specific patterns in recognizing protein phosphorylated substrates. CKSAAP encoding has been reported to predict the structural property of a sequence fragment24 and PTM sites, including ubiquitylated sites22 and mucin-type O-glycosylated sites25. Therefore, we might also expect a better performance of the CKSAAP encoding in the prediction of protein phosphorylated sites.

More importantly, the reasonably good performance of SVM with AF-CKSAAP reflected that the combining encoding schemes, AF and CKSAAP, can effectively capture the information of enriched/depleted residue pairs around phosphorylation sites in rice. The AF encoding scheme clearly characterizes amino acids in different positions surrounding a potential phosphorylated site, but it is weak in reflecting the coupling effect of amino acid pairs at different positions. On the other hand, the CKSAAP has the ability to detect the relationship between amino acid pairs at different positions, but it cannot capture the position specific amino acid information22. Therefore, the complementary capability of AF to CKSAAP results in a better performance for AF-CKSAAP in extracting the sequence character surrounding a potential phosphorylated site when compared with the individual encoding scheme. Meanwhile, in terms of different phospho-amino acid sites, SVM with AF-CKSAAP performed better than others, even though the ratio of (+) sites to (−) sites in the test dataset was unbalanced. This is because the accuracy of the predictors may be overestimated when the ratio of (+) sites to (−) sites in the training dataset was optimized1.

We used the model of SVM with AF-CKSAAP to develop an online tool, Rice_Phospho 1.0, which was a specific predictor on the phosphorylation sites in rice. To verify the performance of Rice_Phospho 1.0, one experimentally identified phosphorylated protein in rice which did not appear in the training dataset was used as a query sequence. A α-tubulin isoform (LOC_Os03g51600.1) was experimentally identified to be phosphorylated at Thr349 site by a comprehensive mutagenesis method26. Rice_Phospho 1.0 successfully predicted the experimentally identified pThr at the position 349. Moreover, three more sites, including Ser216, Tyr432 and Ser439, were predicted as novel phophorylated sites.

In summary, we have benchmarked the combination of several encoding schemes and classification routines to establish their relative effectiveness. This led to the choice of a SVM with AF-CKSAAP encoding scheme which we have incorporated into the development of an effective rice specific protein phosphorylation site predictor, Rice_phospho 1.0 (http://bioinformatics.fafu.edu.cn/rice_phospho1.0). Rice_Phospho 1.0 provides state of the art levels of reliability in predicting protein phosphorylation sites in rice and will be a useful tool to the community.

Methods

Preprocessing of dataset

We collected rice phosphorylation sites from the recent literature of Nakagami et al.21. We also used the feature table of Swiss-Prot database, from which records annotated as ‘predicted’ or ‘similarity’ were excluded. After removing the redundant phosphorylation sites, the number of serine (S), threonine (T) and tyrosine (Y) substrates were 4220, 605 and 141 respectively and these phosphorylation sites were involved in 2162 proteins1.

The 25-mer sequences (−12 to +12) surrounding the phosphorylation sites were extracted from the protein sequences1. Because all of these phosphorylation sites were experimentally verified, they were regarded as (+) sites and compiled within a positive dataset (Table S1). The Ser, Thr and Tyr residues that were not annotated as phosphorylation sites within the dataset were regarded as (−) sites (i.e., non-phosphorylation sites) and the 25-mer sequences surrounding them were extracted and compiled within a negative dataset (Table S2). We extracted one-third of the data from each of these two datasets to compose an independent test dataset. We used the remainder of the data in the two datasets to construct a training dataset. The phosphorylation and non-phosphorylation sites were randomly chosen from the training dataset to compile a different ratio of (+) sites to (−) sites dataset during the cross-validation processes.

Because the residues buried in the core of a protein would not be accessible to any kinases4, the NetSurfP program27 was used to predict the surface accessibility of each non-phosphorylated site in Table S2. The solvent-inaccessible non-phosphorylation sites were compiled in Table S3. We randomly selected one third of the data to compose another independent test dataset and re-trained a Support Vector Machine (SVM) by using the Composition of K-Spaced Amino Acid Pairs (CKSAAP), Amino acid occurrence frequency combined with CKSAAP (AF-CKSAAP), CKSAAP combined with K-Nearest Neighbor (CKSAAP-KNN) as feature selection methods.

A standard 10-fold cross validation was used to train the classifiers. We calculated the Sensitivity (Sn), Specificity (Sp), Accuracy (ACC) and the Matthew’s Correlation Coefficient (MCC) of each predictor1. The dataset was randomly partitioned into 10 subsets, including one testing subset and nine training subsets. Each predictor was trained by shifting the test subset stepwise so that all data is used for training and testing.

Encoding schemes and feature selection

K-Nearest Neighbor (KNN)

The KNN method is used to classify the samples based on their distances. For each sequence, the distances between all the sequences of positive datasets or negative datasets were measured using the formula introduced by Gao et al. (2009)28.

where two protein sequences s1 = {s1(−w), s1(−w + 1),…, s1(w)} and s2 = {s2(−w), s2(−w + 1),…, s2(w)}, w = 12, sim—amino acid similarity matrix—is derived from the normalized BLOSUM62.

The K nearest distances were selected and the average distance among them was calculated. This process was repeated for different values of K (0.1%, 0.2%, 0.5%, 1%, 2%, 5% and 10% of the positive datasets or negative datasets). The ratios of the average distances between positive datasets and negative datasets were extracted as the feature values28.

Composition of K-Spaced Amino Acid Pairs (CKSAAP)

CKSAAP has been successfully used to represent sequence fragments22. A sequence fragment may contain 400 types (AxA, AxC, AxD, …, OxO) of K-spaced amino acid pairs (i.e. the pairs separated by K other amino acids). The flowchart and the calculation used for the CKSAAP feature selection approach are shown in Fig. S1. The value of NAA is the composition of the corresponding amino acid pairs in the sequence fragment, while Ntotal represent the total composition of amino acid pairs in the sequence fragment. For instance, if there are n AxA pairs in the sequence fragment, the value of corresponding component of NAA is n (Fig. S1).

When the value of K is increased, the prediction accuracy and sensitivity increases, but so does the computational complexity and the time required for training the models22,29. In this paper, we considered the CKSAAP encoding scheme with K = 0, 1, 2, 3, 4 and 5, meaning the total dimension of the 6-spaced feature vector is 2400.

Amino acid occurrence frequency (AF)

The frequency of one amino acid in each sequence fragment was calculated by the following equation:

where ci and len (seq) denote the number of instances of amino acid i in the sequence fragment and the length of the sequence fragment, respectively. vi illustrates the frequency of the amino acids in the sequence.

Combined encoding schemes

We combined the three sole encoding schemes (KNN, CKSAAP and AF) to construct three bi-encoding schemes. Because of the high dimensionality of the CKSAAP, relief-F was used to decrease the total dimension for the combined methods. Relief-F is the extension to the original Relief algorithm, which is able to deal with noisy and multi-class problems rather than two-class problem30. Relif-F was run in Waikato Environment for Knowledge Analysis (WEKA) to decrease the dimension of the combining encoding schemes31, leading to total dimensions for AF-KNN, AF-CKSAAP and CKSAAP-KNN of 27, 952 and 939, respectively.

Classification Algorithms

Support Vector Machine (SVM)

A SVM is a supervised learning algorithm for two-group classification problems, whose goal is to find a rule that best maps each member of the training set to the correct classification25. Briefly, SVM constructs a hyperplane that separates two different groups of feature vectors in the training set with a maximum margin. The orientation of a test sample relative to the hyperplane gives the predicted score and hence the predicted class can be derived32. Because of its solid mathematical foundation in statistics theory and the ability of overcoming over-fitting, SVMs are popular and have been used to predict protein PTMs sites28, protein localization33 and protein-protein interaction34. In this paper, LibSVM in WEKA with Radial Basis Kernels (RBF) used with K (xi, yi) = exp(−γ||xi − yi||2)29.

Random Forest (RF)

The RF is an ensemble of unpruned decision trees35 which have already been used to predict protein-protein interactions36,37 and long disordered regions in proteins38. In RF, the number of trees in the forest is adjustable and each tree is grown to full length using a subset of the training dataset. To classify an instance with an unknown class label, each tree casts a unit classification vote. The forest selects the classification having the most votes over all the trees in the forest. Therefore, there are two key parameters in RF. One is the number of the trees, M, the other is the number of features selected randomly, m. In this paper, we selected the optimal value of M = 100 and determined m based on the result of a preliminary evaluation.

Decision Tree (DT)

Decision trees are an attractive predictive modeling procedure because of their easy interpretation by non-statisticians. In DT algorithms, the classification tree analysis generates groups of individuals on the basis of a selected criterion, the Gini index, for splitting a group into two to maximize the probability of a single outcome, namely substantial renal deterioration39. The recursive process of partitioning data continues until the Gini index indicates that the tree fits, without overfitting, the information contained in the dataset. It can provide a practical model for dichotomous outcomes if the validity of the obtained model is proved sufficient. The missing values were replaced with values minimizing the impurity of the nodes, median values for continuous variables and most frequent categories for categorical variables, or distribution-based estimates. The decision trees and random forests were implemented using R platform with package e1071.

Performance Assessment

Predictors comparison

The current rice-specific phosphorylation sites predictor, PhosphoRice, two non-organism specific predictors, Musite and Scansite and one plant-specific predictor, PlantPhos, were compared with the new predictor in this paper. In Musite online prediction, General phospho-serine/threonine and General phospho-tyrosine (Green plants) were selected and 25mer was input to the predictor with default settings. In Scansite, the setting of medium stringency level was selected and resulted in the production of Scansite_medium predictor. In PlantPhos, phosphorylation sites will be predicted with the score over −3.0 (HMMER bit score).

Evaluation

Sn, Sp, ACC and MCC were employed to evaluate the performance of the different predictors.

where TP, FP, FN and TN denote true positives, false positives, false negatives and true negatives. Sn and Sp illustrate the correct prediction ratios of positive and negative datasets, respectively. Because MCC is much less susceptible to the ratio of positive samples and negative samples in the dataset, it is the most widely used prediction measure for two-class prediction programs1.

Statistics

We used SPSS 16.0 to create receiver operating characteristic (ROC) curves to measure the performance of different predictors. For each possible threshold, the sensitivity and specificity were evaluated and the ROC curves [sensitivity versus (1-specificity) curve] were used to compare the predictive performance of different classifiers with different encoding schemes.

Additional Information

How to cite this article: Lin, S. et al. Rice_Phospho 1.0: a new rice-specific SVM predictor for protein phosphorylation sites. Sci. Rep. 5, 11940; doi: 10.1038/srep11940 (2015).