A comprehensive resource for integrating and displaying protein post-translational modifications

Lee, Tzong-Yi; Hsu, Justin Bo-Kai; Chang, Wen-Chi; Wang, Ting-Yuan; Hsu, Po-Chiang; Huang, Hsien-Da

doi:10.1186/1756-0500-2-111

A comprehensive resource for integrating and displaying protein post-translational modifications

Data Note
Open access
Published: 23 June 2009

Volume 2, article number 111, (2009)
Cite this article

Download PDF

You have full access to this open access article

BMC Research Notes Aims and scope Submit manuscript

A comprehensive resource for integrating and displaying protein post-translational modifications

Download PDF

Tzong-Yi Lee^1,5,
Justin Bo-Kai Hsu¹,
Wen-Chi Chang^1,6,
Ting-Yuan Wang¹,
Po-Chiang Hsu² &
…
Hsien-Da Huang^1,3,4

5427 Accesses
30 Citations
Explore all metrics

Abstract

Background

Protein Post-Translational Modification (PTM) plays an essential role in cellular control mechanisms that adjust protein physical and chemical properties, folding, conformation, stability and activity, thus also altering protein function.

Findings

dbPTM (version 1.0), which was developed previously, aimed on a comprehensive collection of protein post-translational modifications. In this update version (dbPTM2.0), we developed a PTM database towards an expert system of protein post-translational modifications. The database comprehensively collects experimental and predictive protein PTM sites. In addition, dbPTM2.0 was extended to a knowledge base comprising the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains, protein intrinsic disorder region, and protein variations. Moreover, this work compiles a benchmark to construct evaluation datasets for computational study to identifying PTM sites, such as phosphorylated sites, glycosylated sites, acetylated sites and methylated sites.

Conclusion

The current release not only provides the sequence-based information, but also annotates the structure-based information for protein post-translational modification. The interface is also designed to facilitate the access to the resource. This effective database is now freely accessible at http://dbPTM.mbc.nctu.edu.tw/.

Web-Based Computational Tools for the Prediction and Analysis of Posttranslational Modifications of Proteins

A homology-based pipeline for global prediction of post-translational modification sites

Article Open access 13 May 2016

jEcho: an Evolved weight vector to CHaracterize the protein’s posttranslational modification mOtifs

Article Open access 01 June 2015

Background

Protein Post-Translational Modification (PTM) plays a critical role in cellular control mechanism, including phosphorylation for signal transduction, attachment of fatty acids for membrane anchoring and association, glycosylation for changing protein half-life, targeting substrates, and promoting cell-cell and cell-matrix interactions, and acetylation and methylation of histone for gene regulation [1]. Several databases collecting information about protein modifications have been established through high-throughput mass spectrometry in proteomics. UniProtKB/Swiss-Prot [2] collects many protein modification information with annotation and structure. Phospho.ELM [3], PhosphoSite [4] and Phosphorylation Site Database [5] were developed for accumulating experimentally verified phosphorylation sites. PHOSIDA [6] integrates thousands of high-confidence in vivo phosphorylation sites identified by mass spectrometry-based proteomics in various species. Phospho 3D [7] is a database of 3D structures of phosphorylation sites, which stores information retrieved from the phospho.ELM database and is enriched with structural information and annotations at the residue level. O-GLYCBASE [8] is a database of glycoproteins, most of which include experimentally verified O-linked glycosylation sites. UbiProt [9] stores experimental ubiquitylated proteins and ubiquitylation sites, which are implicated in protein degradation through an intracellular ATP-dependent proteolytic system. Moreover, the RESID protein modification database is a comprehensive collection of annotations and structures for protein modifications and cross-links, including pre-, co-, and post-translational modifications [10].

dbPTM [11] was developed previously to integrate several databases to accumulate known protein modifications, as well as the putative protein modifications predicted by a series of accurately computational tools [12, 13]. This updated version of dbPTM was enhanced to become a knowledge base for protein post-translational modifications, which comprises a variety of new features including the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains and protein variations. We also collected literature related to PTM, protein conservations and the specificity of substrate site. Especially for protein phosphorylation, the site-specific interactions between catalytic kinases and substrates are provided. Furthermore, a variety of prediction tools have been developed for more than ten PTM types [14], such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation. This work constructed a benchmark data set for computational studies of protein post-translational modification. The benchmark data set can provide a standard for measuring the performance of prediction tools that have been presented for identifying post-translational modification sites of proteins. The web interface of dbPTM is also redesigned and enhanced to facilitate the access to the proposed resource.

Data construction and content

As shown in Figure 1, the system architecture of dbPTM2.0 database comprises three major components: the integration of external PTM databases, the computational identification of PTMs, and the structural and functional annotations of PTMs. We integrated five PTM databases, including UniProtKB/Swiss-Prot (release 55.0) [1], Phospho.ELM (version 7.0) [15], O-GLYCBASE (version 6.0) [8], UbiProt (version 1.0) [9] and PHOSIDA (version 1.0) [6] for obtaining experimental protein modifications. The description and data statistics of these databases are briefly given in Table S1 (see Additional file 1 – Table S1). Additionally, Human Protein Reference Database (HPRD) [16], which compiles invaluable information relevant to functions and PTMs of human proteins in health and disease, was also integrated.

In the part of computational identification of PTMs, KinasePhos-like method [11–13, 17] was applied for identifying 20 types of PTM, which contain at least 30 experimentally verified PTM sites. The detailed processing flow of KinasePhos-like methods is displayed in Figure S1 (See Additional file 1 – Figure S1). The learned models were evaluated using k-fold cross validation. Table S2 (See Additional file 1 – Table S2) lists the predictive performance of these models. To reduce the number of false positive predictions, the predictive parameters were set to ensure a maximal of predictive specificity.

The statistics of the experimental PTM sites and putative PTM sites in this integral PTM database is given in Table 1. After removing the redundant PTM sites among six databases, there are totally 45833 experimental PTM sites in this update version. All experimental PTM sites are further categorized by PTM types. For instance, there are 31, 363 experimental phosphorylation sites and 2,080 experimental acetylation sites in the database. In addition to the experimental PTM sites, UniProtKB/Swiss-Prot provides putative PTM sites by using sequence similarity or evolutionary potential. Moreover, KinasePhos-like methods [11–13, 17] were adopted to construct the profile hidden Markov models (HMMs) for twenty types of PTMs. These models were applied to identify the potential PTM sites against protein sequences obtained from UniProtKB/Swiss-Prot. As given in Table 1, 2,560,047 sites for all PTM types were identified. The structural and functional annotations of protein modifications were obtained from UniProtKB/Swiss-Prot [18], InterPro [19], Protein Data Bank [14]. To understand the predictive performance of these tools previously developed, it is crucial to have a common standard for evaluating the predictive performance among various prediction tools. Therefore, we constructed a benchmark, which comprise the experimental substrate sequences for each PTM type.

The process to compile the evaluation sets is described in Figure S3 (See Additional file 1 – Figure S3), based on criteria developed by Chen et al. [30]. To remove the redundancy, the protein sequences containing the same type of PTM sites are grouped by a threshold of 30% identity by BLASTCLUST [31]. If the identity of two protein sequences is greater than 30%, we re-aligned the fragment sequences of the substrates by BL2SEQ. If the fragment sequences of two substrates with the same location are identical, only one of the substrate was included in the benchmark data set. Therefore, twenty PTM types containing more than 30 experimental sites were complied in the benchmark data set.

Enhanced web interface

A user-friendly web interface is provided for simple searching, browsing, and downloading of protein PTM data. In addition to the database query by the protein name, gene name, UniProtKB/Swiss-Prot ID or accession, it allows the input of protein sequences for similarity search against UniProtKB/Swiss-Prot protein sequences (See Additional file 1 – Figure S4). To provide an overview of PTM types and their modified residues, a summary table is provided for browsing the information and the annotations about the post-translational modification types, which are referred to the UniProtKB/Swiss-Prot PTM list http://www.expasy.org/cgi-bin/lists?ptmlist.txt and RESID [10].

Figure 3 shows an example that users can choose the acetylation of lysine (K) to obtain more detailed information such as the position of modified amino acid, the location of the modification in protein sequence, the modified chemical formula, the mass difference, and the substrate site specificity, which is the preference of amino acids surrounding the modification sites. Furthermore, the structural information, such as solvent accessibility and secondary structure surrounding the modified sites, are provided. All the experimental PTM sites and putative PTM sites can be downloaded from the web interface.

Conclusion

The proposed server enables both wet-lab biologists and bioinformatics researchers to easily explore the information about protein post-translational modifications. This study not only accumulates the experimentally verified PTM sites with relevant literature references, but also computationally annotates twenty types of PTM sites against UniProtKB/Swiss-Prot proteins. As given in Table 2, the proposed knowledge base provides effective information of protein PTMs, including sequence conservation, subcellular localization and substrate specificity, the average solvent accessibility and the secondary structure surrounding the modified site. Moreover, we construct a PTM benchmark data set that can be adopted for computational studies in evaluating the predictive performance of various tools about determining PTM sites. Previous investigations have indicated that many protein modifications cause binding domains for specific protein-protein interaction to regulate cellular behavior [32]. All the experimental PTM sites and putative PTM sites are available and downloadable in the web interface. Prospective work of dbPTM is to integrate protein-protein interaction data.

Availability and requirements

Project name: dbPTM 2.0: A Knowledge Base for Protein Post-Translational Modifications

ASMD project home page: http://dbPTM.mbc.nctu.edu.tw/

Operating system(s): Platform-independent

Programming Language: PHP, Perl

Other requirements: a modern web browser (with CSS and JavaScript support)

Restrictions to use by non-academics: None

Abbreviations

PTM:: Post-Translational Modification
HMMs:: hidden Markov models
PDB:: Protein Data Bank
SNP:: single nucleotide polymorphism.

References

Farriol-Mathis N, Garavelli JS, Boeckmann B, Duvaud S, Gasteiger E, Gateau A, Veuthey AL, Bairoch A: Annotation of post-translational modifications in the Swiss-Prot knowledge base. Proteomics. 2004, 4 (6): 1537-1550. 10.1002/pmic.200300764.
Article CAS PubMed Google Scholar
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1): 365-370. 10.1093/nar/gkg095.
Article PubMed Central CAS PubMed Google Scholar
Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho.ELM: a database of phosphorylation sites–update 2008. Nucleic Acids Res. 2008, D240-244. 36 Database
Hornbeck PV, Chabra I, Kornhauser JM, Skrzypek E, Zhang B: PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics. 2004, 4 (6): 1551-1561. 10.1002/pmic.200300772.
Article CAS PubMed Google Scholar
Wurgler-Murphy SM, King DM, Kennelly PJ: The Phosphorylation Site Database: A guide to the serine-, threonine-, and/or tyrosine-phosphorylated proteins in prokaryotic organisms. Proteomics. 2004, 4 (6): 1562-1570. 10.1002/pmic.200300711.
Article CAS PubMed Google Scholar
Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M: PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007, 8 (11): R250-10.1186/gb-2007-8-11-r250.
Article PubMed Central PubMed Google Scholar
Zanzoni A, Ausiello G, Via A, Gherardini PF, Helmer-Citterich M: Phospho3D: a database of three-dimensional structures of protein phosphorylation sites. Nucleic Acids Res. 2007, 35: D229-231. 10.1093/nar/gkl922.
Article PubMed Central CAS PubMed Google Scholar
Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE: O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic Acids Res. 1999, 27 (1): 370-372. 10.1093/nar/27.1.370.
Article PubMed Central CAS PubMed Google Scholar
Chernorudskiy AL, Garcia A, Eremin EV, Shorina AS, Kondratieva EV, Gainullin MR: UbiProt: a database of ubiquitylated proteins. BMC Bioinformatics. 2007, 8: 126-10.1186/1471-2105-8-126.
Article PubMed Central PubMed Google Scholar
Garavelli JS: The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics. 2004, 4 (6): 1527-1533. 10.1002/pmic.200300777.
Article CAS PubMed Google Scholar
Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006, 34: D622-627. 10.1093/nar/gkj083.
Article PubMed Central CAS PubMed Google Scholar
Huang HD, Lee TY, Tzeng SW, Wu LC, Horng JT, Tsou AP, Huang KT: Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J Comput Chem. 2005, 26 (10): 1032-1041. 10.1002/jcc.20235.
Article CAS PubMed Google Scholar
Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 2005, 33: W226-229. 10.1093/nar/gki471.
Article PubMed Central CAS PubMed Google Scholar
Zhou F, Xue Y, Yao X, Xu Y: A general user interface for prediction servers of proteins' post-translational modification sites. Nat Protoc. 2006, 1 (3): 1318-1321. 10.1038/nprot.2006.209.
Article CAS PubMed Google Scholar
Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ: Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004, 5 (1): 79-10.1186/1471-2105-5-79.
Article PubMed Central PubMed Google Scholar
Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, et al: Human protein reference database–2006 update. Nucleic Acids Res. 2006, 34: D411-414. 10.1093/nar/gkj141.
Article PubMed Central CAS PubMed Google Scholar
Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 2007, 35: W588-594. 10.1093/nar/gkm322.
Article PubMed Central PubMed Google Scholar
Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004, 23 (5): 464-470. 10.1002/humu.20021.
Article CAS PubMed Google Scholar
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, et al: InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002, 3 (3): 225-235. 10.1093/bib/3.3.225.
Article CAS PubMed Google Scholar
Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, **e L, Chen L, Feng Z, et al: The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2005, 33: D233-237. 10.1093/nar/gki057.
Article PubMed Central CAS PubMed Google Scholar
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.
Article CAS PubMed Google Scholar
Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics. 2003, 19 (14): 1849-1851. 10.1093/bioinformatics/btg249.
Article CAS PubMed Google Scholar
McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16 (4): 404-405. 10.1093/bioinformatics/16.4.404.
Article CAS PubMed Google Scholar
Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337 (3): 635-645. 10.1016/j.jmb.2004.02.002.
Article CAS PubMed Google Scholar
Gustafson TA, He W, Craparo A, Schaub CD, O'Neill TJ: Phosphotyrosine-dependent interaction of SHC and insulin receptor substrate 1 with the NPEY motif of the insulin receptor via a novel non-SH2 domain. Mol Cell Biol. 1995, 15 (5): 2500-2508.
Article PubMed Central CAS PubMed Google Scholar
Hers I, Bell CJ, Poole AW, Jiang D, Denton RM, Schaefer E, Tavare JM: Reciprocal feedback regulation of insulin receptor and insulin receptor substrate tyrosine phosphorylation by phosphoinositide 3-kinase in primary adipocytes. Biochem J. 2002, 368: 875-884. 10.1042/BJ20020903.
Article PubMed Central CAS PubMed Google Scholar
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.
Article PubMed Central CAS PubMed Google Scholar
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.
Article PubMed Central PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.
Article PubMed Central CAS PubMed Google Scholar
Chen H, Xue Y, Huang N, Yao X, Sun Z: MeMo: a web tool for prediction of protein methylation modifications. Nucleic Acids Res. 2006, 34: W249-253. 10.1093/nar/gkl233.
Article PubMed Central CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed Central CAS PubMed Google Scholar
Seet BT, Dikic I, Zhou MM, Pawson T: Reading protein modifications with interaction domains. Nat Rev Mol Cell Biol. 2006, 7 (7): 473-483. 10.1038/nrm1960.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under contract No. NSC 95-2311-B-009-004-MY3 and NSC 97-2627-B-009-007. Special thanks for financial support from the National ResearchProgram for Genomic Medicine (NRPGM), Taiwan. This work was also partially supported by MOE ATU. Funding to pay the Open Access publication charges for this article was provided by National Science Council of the Republic of China and MOE ATU.

Author information

Authors and Affiliations

Department of Biological Science and Technology, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsin-Chu, 300, Taiwan
Tzong-Yi Lee, Justin Bo-Kai Hsu, Wen-Chi Chang, Ting-Yuan Wang & Hsien-Da Huang
Department of Biological Science and Technology, Institute of Biochemical Engineering, National Chiao Tung University, Hsin-Chu, 300, Taiwan
Po-Chiang Hsu
Department of Biological Science and Technology, National Chiao Tung University, Hsin-Chu, 300, Taiwan
Hsien-Da Huang
Core Facility for Structural Bioinformatics, National Chiao Tung University, Hsin-Chu, 300, Taiwan
Hsien-Da Huang
Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 320, Taiwan
Tzong-Yi Lee
Institute of Tropical Plant Science, National Cheng Kung University, Tainan, 701, Taiwan
Wen-Chi Chang

Authors

Tzong-Yi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Justin Bo-Kai Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Chi Chang
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Yuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Po-Chiang Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Hsien-Da Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hsien-Da Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HDH conceptualized the project. TYL and HDH designed and built the database. TYL, PCH and WCC performed data analysis. TYL and JBKH designed and built the interfaces. TYL, JBKH and TYW compiled a previous version of the database. HDH, TYL and WCC wrote the draft. All authors tested the database and interfaces. All authors read and approved the final manuscript.

Electronic supplementary material

13104_2008_249_MOESM1_ESM.doc

Additional file 1: Supplementary figures (S1, S2, S3, and S4) and tables (S1, S2, and S3). The data provided 4 figures and 3 tables. The description of each figures and tables are given below. Figure S1. The detailed processing flow of KinasePhos-like methods. Figure S2. The multiple sequence alignment of orthologous conserved regions. Figure S3. The flowchart to remove data redundance. Figure S4. Example of search web pages. Table S1. Data statistics of the integrated resources. Table S2. The parameters and predictive performance of the trained models with best accuracy for each PTM type. Table S3. The list of integrated databases and programs. (DOC 1024 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lee, TY., Hsu, J.BK., Chang, WC. et al. A comprehensive resource for integrating and displaying protein post-translational modifications. BMC Res Notes 2, 111 (2009). https://doi.org/10.1186/1756-0500-2-111

Download citation

Received: 18 November 2008
Accepted: 23 June 2009
Published: 23 June 2009
DOI: https://doi.org/10.1186/1756-0500-2-111

A comprehensive resource for integrating and displaying protein post-translational modifications