Background

Peptides have emerged as important affinity ligands for diagnostic and therapeutic medical uses as well as materials for a host of applications in biotechnology. While many excellent databases exist that provide protein sequence data [13], protein interaction data [14]. In this paper, we use the term "peptides" as a common synonym for oligopeptides, which are defined as having "fewer than about 10–20 residues"[14]. We thus currently use an IUPAC-IUB length cut-off of 20 amino acid residues or less. Many of the peptides used as pharmaceutical and diagnostic agents fall within this cut-off.

Naturally occurring peptides function as hormones, transmitters, and modulators of numerous biological processes [15]. Both naturally occurring and synthetic peptides are used in therapeutic applications [15], for example somatostatin analogs in tumor radiotherapy [16, 17] and oxytocin to induce labor [18]. Examples of diagnostic uses include membrane-translocating agents [19], receptor targeting agents [20], and enzyme substrates [21]. Driven by the great interest in the diverse applications of peptides, the new peptidomics field is rapidly emerging [22]. The functions of peptides, including their interacting partners, are determined by their sequence and similar to longer proteins, can be predicted based on sequence similarity.

Prior knowledge can be used to predict or shorten the list of possible binding partners of a given peptide of interest, provided a peptide shares significant sequence similarity with other peptides or proteins whose binding partners are known [20, 23]. One can also use a sequence similarity search to remove peptides with similarity to other peptides with known, undesirable properties such as non-specific binding [24] or toxicity. Computational predictions are relatively fast and inexpensive, but require a peptide sequence database with links to peptide data, for use with sequence similarity search methods such as basic local alignment search tool (BLAST) [25, 26] or Smith-Waterman search [27, 28]. The non-sequence (text) data in such a peptide database can be queried with text search tools for biological, therapeutic or diagnostic applications, for example to find peptides that are enzyme inhibitors and whose sequences are available.

We searched through the existing bioinformatics sources, and found no single source that fully suited our needs. With the exception of the Receptor Ligand Contacts (RELIC) database and web-server [10] and Artificially Selected Proteins/Peptides Database (ASPD) [11], most large protein sequence and interaction databases that allow both sequence similarity and text annotation searches have two major drawbacks. First, most of their sequences are of biological origin, while many phage display [29, 30] or combinatorial screens yield non-biological sequence hits. There is no large repository of chemically generated unnatural sequences, similar to what PubChem [2] or ChemBank [31] are for compounds. Second, there exists less data on short peptides than on longer proteins, and usually no facile way to restrict the search to short sequences only. This is important because performing an unrestricted sequence similarity search often results in a large proportion of false positives due to hits to proteins in which the peptide sequence is buried and not accessible for binding, or is in a conformation different from that in a shorter peptide. The same sequence may have different binding properties when displayed on a phage versus when presented as part of the native protein [32]. Sequence similarity based predictions are further hampered for conformationally constrained peptides, designed specifically to have properties different from the same sequence in linear form [33]. ASPD [11] and RELIC [10] databases do not have these drawbacks, are well curated, but are relatively small compared with the large amount of sequence data in the MEDLINE abstracts. For example, the ASPD database has 1,717 entries of 20 amino acid or shorter sequences. RELIC (a server with many useful peptide sequence analysis tools) has 3,632 peptide sequences that result from phage display selections, but only 7 distinct targets to which they bind. Other peptide databases have different purposes and are more specialized by design, for example antimicrobial (the Antimicrobial Peptide Database (APD) [13], and others [12, Step 3. Clean-up

Words that matched peptide sequence patterns were cleaned in a series of steps and converted to 1 letter amino acid symbols, as follows. The terminal marks and modifications, such as 'H(2)N-' or '-CO-Ph', were removed. Numbers representing amino acid positions were removed. Other modifications, such as phosphate in 'pY' were removed. Motifs such as '(L/I)' or 'L/I' were resolved. Amino acids that do not have a 1 letter IUPAC symbol were replaced with X. As a result, a large variety of different sequence formats were resolved, including 'N-acetyl-l-aspartyl-l-glutamyl-l-valyl-l-aspartyl-7-amino-4-methylcoumarin' to 'DEVD', 'Gly1-Val2-Thr3-Ser4' to 'GVTS', '(Arg-Glu(EDANS)-Ser-Gln)' to 'RESQ', 'TRDI-pY-ETD-pY-pY-RK' to 'TRDIYETDYYRK', and others.

To estimate precision of text mining, 50 sequences with the combined score above the threshold for inclusion in PepBank were selected at random from the text mining output. Each of these positive predictions was manually verified, whether or not the word contained a peptide sequence (40 out of 50 were found correctly, precision = 0.8), and whether or not the word contained a peptide sequence AND the sequence was parsed 100% correctly (35 out of 50 correct, precision = 0.7). If the identified sequence was a partial protein sequence, rather than a peptide or a phage display sequence, it was considered an error: such sequences are typically entered in protein databases and do not need to be mined from text (most of the errors in precision were of this type). One or more incorrect amino acid was also considered an error.

For estimating recall, we created a separate test set of 50 sequences by searching in PubMed for recent review articles using as a query "peptide OR peptides" alone or in combination with "sequence OR sequences", and followed the PubMed abstract links for the references cited in the reviews. Peptide sequences were manually extracted from the abstracts without any automated pattern matching. The text mining output with the combined score above the threshold for inclusion in PepBank was matched against these positive real cases. Again, for each case we manually verified whether or not the algorithm found the word, which contained this peptide sequence (12 out of 50 correct, recall = 0.24), and whether or not the algorithm found the word AND the sequence was parsed 100% correctly (10 out of 50 correct, recall = 0.2). Most of the errors in recall were due to blanks (often typos) inside peptide sequences or due to unrecognized amino acid modifications.

The pioneering method to identify DNA and protein sequences in text, based on Markov models was described by Wren and co-workers [57]. Our text mining method, while similar in spirit, has different goals and thus uses a different sequence identification strategy. One of our main goals was to rapidly identify peptides with potential therapeutic and diagnostic utility (including those derived from phage display peptides), rather than identifying peptide epitopes and providing an aid to their manual curation. We also use extensive context information from the abstract, and collect peptide motifs in addition to sequences. We clean the sequences and provide access to the data for biologists through a simple web-based interface for text and sequence similarity searches. We do not place a minimum length restriction on sequences, such as 6 amino acids, because many therapeutic peptides are relatively short, for example the well-known RGD motif and many others found in phage display. Due to the substantial differences in goals and methods between our approach and that of others, it may be interesting to develop in the future a hybrid method combining the strengths of both approaches.

Other sources

All peptide sequences with length 20 or below were extracted from ASPD [11] and UniProt [1], and fields that mapped to PepBank were parsed and stored (for example, interactor fields from ASPD, peptide fields from UniProt). The links from PepBank to the source databases were provided for all entries. Many of the peptides were stored in UniProt as part of the longer precursor proteins, producing peptides on cleavage. These peptide sequences were extracted using the UniProt feature table by selecting those with feature key "peptide" or "chain" and feature length under 20. Additional entries were manually curated, capturing the available interaction data, from the full text articles on phage display in PDF format. The articles were chosen to represent a small but diverse selection of reports within this field.

Utility and discussion

User interface

The web-based user interface to PepBank offers text search (both Quick and Advanced), as well as sequence similarity search (BLAST and Smith-Waterman algorithms). The Quick Search function offers a simple, Google-like search for biologists looking for peptide data in all fields. Advanced Search options include querying data by individual fields. Exact search, wildcard (*) and any single character (_) are supported in text search, which enables, for example, searching for a sequence pattern as a query. The results of the text search are displayed as a table sortable in the browser, with hyperlinks to the original sources (MEDLINE/PubMed, ASPD, UniProt) and to more detailed information.

Text search example: VEGFR related peptides

To illustrate the utility of PepBank, we use the example of identifying peptides with affinity to VEGFR1, an important therapeutic target [58]. The user can search for VEGFR using either Quick or Advanced Search, obtain a set of peptide sequences related to this target, and view details for the selected sequences. In the example shown in Figure 2, sequence 'WHSDMEWWYLLG' is identified [59]. Prompted by these results, the user of PepBank may be interested in testing this peptide sequence in novel forms (for example, dendrimers, or conjugated to nanoparticles), or for novel biomedical applications (imaging different tumor types, atherosclerosis, or arthritis). There is currently no database where the user can easily obtain such information as it relates to molecular targets and peptide sequences. One can also query directly for a biological process (such as apoptosis or angiogenesis) or for the target cell line or tissue (such as BICR-H1 or U937).

Figure 2
figure 2

Web-based user interface of PepBank. Illustration of a typical user workflow. The user enters the query with Quick or Advanced Search. The results are returned in a table sortable in the browser. The user selects the entry or entries of interest. The sequence in the example shown was obtained by text mining and was then manually curated. The score, between 0 and 1, reflects the degree of confidence in the interaction (higher score for more confidence). Manually curated entries receive higher score than entries from automated text mining.

To determine whether the database would yield target leads against known drug targets, we randomly chose a set of 20 defined drug targets from the 547 approved drug target data set in DrugBank [60]. The randomly chosen drug targets were not skewed towards peptide receptors and included: squalene epoxidase, RAF proto-oncogene serine/threonine-protein kinase, muscarinic acetylcholine receptor M4, opioid mu receptor (OP3), adenosine A1 receptor, GABA transaminase, amidophosphoribosyltransferase precursor, tryptophan 5-hydroxylase 1, apoptosis regulator Bcl-2, matrix protein M2, vascular endothelial growth factor receptor 2 precursor, amiloride-sensitive sodium channel gamma-subunit, ribonucleotide reductase, cAMP phosphodiesterase, coagulation factor VIII, high affinity immunoglobulin epsilon receptor alpha-subunit precursor, retinol-binding protein I, glycine alpha 2 receptor, cytochrome P450 51, GABA-A receptor subunit (C. elegans). Relevant peptides were defined as those interacting with the target or its ortholog, or modulating the function of the target, for example by acting as a competitor. Relevant peptides in our database were identified in approximately 25% of the above drug targets.

Sequence similarity search examples

As an illustrative example, we performed an all-against-all BLAST search of PepBank sequences. One of the surprises was the discovery of an exact match to sequence 'GETRAPL' from phage display selection for peptides that bind to secreted protein acidic and rich in cysteine (SPARC) [61]. The sequence had a BLAST hit with an E-value of 0.06 to an isolate from phage display selection of peptides that bind human saphenous vein smooth muscle cells [62]. Following the BLAST results, we then found that in addition to these 2 selections, the exact same sequence was isolated independently multiple times by different groups in selections with unrelated targets. GETRAPL was found in phage display selections of peptides that bind human immunodeficiency virus type 1 (HIV-1) accessory viral protein (Vpr) [63], chromatin high mobility group protein 1, box A (HMGB1) from rat [64], mouse skeletal muscle tissue in vivo [65], and mouse brain cells in vivo [66].

We suggest that one of the utilities for PepBank is to search the peptide sequences of interest to the user with BLAST or Smith-Waterman algorithms to find any important similarities to the known peptides collected in our database. In this example, the search can be used to remove a relatively nonspecific binder GETRAPL. Note that searching PepBank with these tools is a unique resource: an exact match may be easy to find, but using a partial match such as GETRA as a query finds GETRAPL only in PepBank, but not in PubMed [2] or on Google. Searching with BLAST [67] or with Smith-Waterman/SSEARCH methods [47] using GETRAPL as a query against nr database [2] gives no peptide hits cited above. A large interactions database IntAct [6] gives no hits for GETRAPL query at all.

Another surprise discovery in the all-against-all BLAST search of PepBank sequences was the multiple occurrence of the sequence SVSVGMKPSPRP. The sequence had several exact matches over its entire length of 12 amino acids, with an E-value of 1 × 10-6. It was isolated in phage display selection for peptides that bind to DNA [68]. In this selection SVSVGMKPSPRP was the only sequence studied due to its dominance (9 out of 10) in the selected pool. The exact same sequence was isolated in phage display selection for peptides binding to human monoclonal IgM [69], and to the mirror image of Alzheimer's disease amyloid peptide Abeta(1–42) [70]. The sources for these sequences were MEDLINE abstract text mining, ASPD database, and manually curated full text articles, respectively. In addition, SVSVGMKPSPRP occurs in several patents [71, 72]. Several groups note multiple isolation of this remarkable sequence in their own and other, unrelated, experiments [73, 74]. The sequence has also been identified in a recent excellent review [24] which covers the important topic of target-unrelated sequences in phage display. Interestingly, all of the studies with both GETRAPL and SVSVGMKPSPRP were done with the phage display libraries from the same manufacturer, thus suggesting a library- or methodology-specific phenomenon. Both sequences illustrate one of the suggested utilities for PepBank, namely that one can search it with a sequence query using BLAST or Smith-Waterman algorithms to find any important similarities to the known peptides.

Conclusion

A new text mining tool was developed and used to identify peptide sequences in MEDLINE abstracts. These data were combined with two of the public sources of peptide sequence data, ASPD and UniProt, as well as with manually curated peptide data. The database application was developed to query the data using text and sequence similarity search through a web-based user interface. The utility of PepBank was demonstrated using different examples of peptide sequences. The results show that the database has valuable biological and medical applications. In the future, we plan to add other public sources of peptide data, such as the peptide subset of the Molecular Interaction database (MINT) [5], and other sources for text mining, such as full-text journal articles. Also, in the future we will apply machine learning techniques to improve the accuracy of text mining to extract sequences. In the next release, we plan to add the functionalities to download the data in a standard format, such as PSI MI, and to search the database for peptide motifs.

Availability and requirements

The database is freely available on http://pepbank.mgh.harvard.edu/, and the text mining source code (Peptide::Pubmed) is freely available above as well as on CPAN http://www.cpan.org/.