Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics

Jiang, **nning; Jiang, **aogang; Han, Guanghui; Ye, Mingliang; Zou, Hanfa

doi:10.1186/1471-2105-8-323

Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics

Research article
Open access
Published: 31 August 2007

Volume 8, article number 323, (2007)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics

Download PDF

**nning Jiang¹,
**aogang Jiang¹,
Guanghui Han¹,
Mingliang Ye¹ &
…
Hanfa Zou¹

9880 Accesses
26 Citations
Explore all metrics

Abstract

Background

In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, ΔCn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now.

Results

In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data.

Conclusion

Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.

TIDD: tool-independent and data-dependent machine learning for peptide identification

Article Open access 30 March 2022

Calibr improves spectral library search for spectrum-centric analysis of data independent acquisition proteomics

Article Open access 07 February 2022

Preprocessing Tandem Mass Spectra Using Genetic Programming for Peptide Identification

Article 25 April 2019

Background

Because of the high sensitivity, mass spectrometry has been widely used for protein identification and characterization in proteome researches within the past decade[1, 2]. Shotgun proteome approach, which is based on analysis using liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS), can be applied to analyze complex protein mixtures directly even without any prior purification step. Large-scale proteome profiling using multidimensional LC-MS/MS has become increasingly applied for the analysis of many biological samples, including various mammalian tissues, cell lines, and serum/plasma [3–8]. In shotgun proteomics, complex protein mixtures are first digested by the enzyme (e.g. trypsin) to produce peptide mixtures. Then the peptide mixtures are subjected to extensive separations such as strong cation exchange chromatography (SCX) coupling with on-line or off-line reversed-phase capillary LC (RPLC). Peptides eluting from the reversed phase capillary LC column are sprayed into tandem mass spectrometer to produce MS/MS spectra. And then peptide sequences are assigned to experimental MS/MS spectra by database searching algorithm.

SEQUEST[9], Mascot[10] and other database searching algorithms match experimental spectra with theoretical spectra which are generated from peptide sequences in silico, and then calculate scores to evaluate how well they match. These scores help discriminating between correct and incorrect peptide assignments. One of the major issues in database search for proteome analysis is to determine the false-discovery rate (FDR) of the identifications. FDR is the rate at which significant identifications are actually null[11]. A variety of methods were developed to determine FDR for peptide identifications. Some efforts have been made on establishing statistical analysis methods [11–17] to determine the possibility of positive identifications, e.g. PeptideProphet[12]. Complicated statistical algorithms are often needed in these methods. Another simpler way to evaluate FDR is using decoy proteome approach which was introduced by Peng et al[18]. Determination of FDR in this method is based on the database searching using a composite database including original protein database and its reversed version. Statistically, the probability that a peptide is identified incorrectly from reversed database is expected to be same as the probability that it is identified incorrectly from original protein database as the sizes of reversed database and original database are the same [19–21]. Therefore, FDR can be calculated using the following equation:

FDR = 2*n(rev)/(n(rev)+n(forw)), (1)

where n(forw) and n(rev) are the number of peptides identified in proteins with forward (original) and reversed sequences, respectively[18, 22]. The database searching strategy using composite database is also known as reversed database searching strategy. Because of the simple usage, it has been widely used in the evaluation of proteomic search results[18, 22–26] including post-translation modification (PTM) researches[19, 27, 28].

SEQUEST[9] is one of the commonly used database searching algorithms. It first counts the peaks which are common in experimental and theoretical spectra, and computes a preliminary score (Sp). Then it selects a proportion of top candidate peptides based on the rank of preliminary score (Rsp) for cross-correlation analysis. So, for each candidate peptide identification, several scores and rankings are determined. To distinguish correct identifications from incorrect identifications, filters using a set of database searching scores are applied, including two commonly used scores, Xcorr and ΔCn. In order to evaluate FDR of the identifications, reversed database searching could be performed and the FDR could be determined by Equation (1). To control FDR, many research groups usually use fixed Xcorr values and manually increase ΔCn to get peptide identifications with specific FDR[36–38]. And the corresponding FDR were determined as 0.75% and 1.13% for these two datasets by employing reversed database searching strategy. There were 29,951 and 14,101 peptides identified by PeptideProphet for liver tissue sample and plasma sample, respectively. Compared with PeptideProphet, the numbers of peptides identified for the two human proteome samples by SFOER were nearly the same (29,934 vs 29,951 for liver tissue sample and 14,218 vs 14,101 for plasma sample). There was 91.2% overlap of the peptide identifications between PeptideProphet and SFOER, which means majority of the identified peptides were same for both approaches (Figure 4). Detail comparison of the performances on human liver lysate between conventional criteria, PeptideProphet and SFOER is shown in Table 1. The total numbers of identified proteins are also given in Table 1. Because of the increase of peptide identifications, the protein identifications also increased obviously when SFOER was used.

Compared with the conventional approach, the numbers of identified peptides increased significantly when the filtering criteria optimized by SFOER were applied. A concern for this is that whether the increased peptide identifications are true identifications. For datasets from human liver tissue sample, 5,588 extra peptide identifications were achieved when the filtering criteria optimized by SFOER were applied. It is impossible to manually validate all of these peptide identifications. A practical way is to randomly select small portion of the increased peptide identifications and manually check with their spectra. Thus 300 out of from 5,588 extra peptides identifications were randomly selected. Each of these spectra was assessed for acceptable signal-to-noise ratio and the presence of at least three consecutive b or y ion fragments[39]. Finally 98.3% (295 out of 300) of these peptides were true positive and the false-discovery rate was very close to the overall predicted FDR. It was found that 84% (4,693 out of 5,588) of the increased peptides can also be detected by PeptideProphet at a probability cutoff of 0.9 for which the empirical error rate was 1.1%. Above results clearly demonstrated that the additional peptide identifications obtained by SFOER were quite confident. (MS/MS spectra of the increased peptide identifications using our optimized criteria can be downloaded from our website[40]).

Classification performance of SFOER was further validated by standard protein mixture. Tryptic digest of seven standard proteins was selected as the sample. And the acquired MS/MS spectra were searched against a composite database containing both forward and reversed sequences of all control proteins (including trypsin) as well as forward and reversed protein sequences from yeast, chosen for its low homology with readily available control proteins. Because the proteins present in the sample were known, correct and incorrect peptide assignments can be easily distinguished by the rule whether it is from known standard proteins. Thus actual FDR, i.e. the observed FDR, can be determined by the percentage of peptide identifications not from standard proteins among all peptide identifications, while predicted FDR was determined by Equation (1). If not otherwise stated, FDR refers to the predicted FDR. The classification performance of SFOER could be evaluated by comparing the actual and predicted FDR.

LC-MS/MS analyses of 7 standard protein mixture digest resulted in a collection of 105,000 spectra. Performance of SFOER was also compared with that of PeptideProphet using this standard protein dataset. A series sets of filtering criteria were optimized by SFOER with FDR increased from 0.005 to 0.32. Then peptide identifications with different confidence levels were generated by utilizing these optimized criteria. For PeptideProphet, manual adjustment of the probability threshold was used to generate peptide identifications with different FDR. The number of correct peptide identifications (peptide from standard proteins) and the number of incorrect peptide identifications (peptide from forward protein sequences in yeast database) are shown in Figure 5A. With the increase of FDR, SFOER showed nearly same performance with PeptideProphet except a slight improvement in the number of correct peptide identifications. And PeptideProphet showed a small increase of power in trading-off incorrect peptide identifications. Plot in Figure 5B are the observed FDR as function of the predicted FDR. It can be seen that the observed and predicted FDR matched very well for both SFOER and PeptideProphet. However, small increases of observed FDR were found for both cases. This probably because that our evaluation method didn't take commonly contaminants such as keratins into account. On the basis of above results, reversed database searching algorithm essentially provided a reasonable estimation of the actual error. The optimization by SFOER based on reversed database strategy was reasonable and FDR of peptide identifications evaluated by reversed database strategy can essentially reflect the actual FDR.

GA is a very efficient algorithm and is widely used in searching for optimal or near optimal solutions. Thus, SFOER which employing GA should inherit this advantage. Approximately 277,000 spectra (12 LC-MS/MS runs) were processed by PeptideProphet and SFOER on a Pentium 4 (3.0 GHz) computer separately. The optimization procedure using SFOER took less than 4 min (10 s for 1+, 100 s for 2+ and 99 s for 3+), while the procedure for calculation of probability by PeptideProphet took about 38 min. And the IO procedures (for PeptideProphet, it consisted of assembling peptides from out files to html files and the conversion of files from html format to xml format, while for SFOER it only included the assembling of peptides from out files to plain text files) took about 40 min and 28 min for PeptideProphet and SFOER, respectively. Evidently, SFOER was much faster than PeptideProphet for which only 1/10 of time was needed for the searching of optimal criteria (without consideration of IO procedures).

For model based algorithm like PeptideProphet, accuracy relies on the fitness between the empirical model and obtained datasets. If the model accurately reflects the physical processes by which the data are generated, it can work well even for a small amount of training data. On the other hand if the data distributes in a significant way, classification errors proportional to the degree of divergence result. However, SFOER is less risky for that it does not rely on model. The pre-knowledge on the property of the dataset or making assumptions about the dataset is not required. Therefore, this approach is equally applicable to many datasets with different characteristics. However, there is one requirement for application of SFOER. As FDR for peptide identification is required during the optimization, SFOER can only process database search results performed with decoy database.

SFOER can also be easily extended to some special applications by slightly revision. Currently, SFOER only takes several SEQUEST scores such as Xcorr, ΔCn, Sp and Rsp as its weights. It was reported that some peptide properties obtained from the experiments of proteome analysis could be used to increase the confidence of peptide identifications. These properties including the pI values obtained from the isoelectric focusing (IEF)[41], hydrophobicity or elution times obtained from reversed phase LC separation (NET)[24], high accurate masses obtained from using of FT mass spectrometer[42] and so on. In principle, these properties as well as SEQUEST scores can be optimized simultaneously for filtering criteria by this software suite. And significant improvement in proteome coverage for proteome analysis is expected. Though SFOER was developed to optimize filtering criteria for SEQUEST database search, after slightly revision it should also be easily applied to the optimization of filtering criteria for other database search engines such as Mascot as long as the decoy database search strategy is applied.

Conclusion

A software suite, named as SFOER, was developed using predictive genetic algorithm (GA) to optimize filtering criterion for SEQUEST database searching. The optimization was based on reversed database search where FDR can be easily determined. It was demonstrated that SFOER was able to maximize the number of identified peptides without increase of FDR. Compared with statistical approach – PeptideProphet, SFOER has nearly the same classification performance but cost much less processing time. Moreover, as it did not rely on possibly unfounded assumptions about the data, SFOER can create tailored criteria for datasets which are obtained from different samples, generated from different mass spectrometers, even searched with different database searching algorithms (weights need to be altered).

Methods

Materials and reagents

Magic C18AQ (5 μm, 100 Å pore size) was purchased from Michrom BioResources (Auburn, CA, USA), and Polysulfoethyl Aspartamide (5 μm, 200Å pore) was from PolyLC Inc (Columbia, MD, USA). PEEK tubing, sleeves, microtee and microcross were obtained from Upchurch Scientific (Oak Harbor, WA, USA). Fused-silica capillaries (50, 75 and 100 μm I.D.) were purchased from Polymicro Technologies (Phoenix, AZ, USA). All the water used in the experiment was purified using a Mill-Q system (Millipore, Bedford, MA, USA). Dithiothreitol (DTT), iodoacetamide were all purchased from Sino-American Biotechnology Corporation (Bei**g, China). Urea, ammonium acetate, ammonium bicarbonate and acetic acid were obtained from Sigma (St. Louis, MO, USA). Trypsin was from Promega (Madison, WI, USA). Tris was from Amersco (Solon, Ohio, USA). Formic acid was obtained from Fluka (Buches, Germany). Acetonitrile (ACN, HPLC grade) was from Merck (Darmstadt, Germany). Protease inhibitor cocktail tablets (Complete Mini) were purchased from Roche.

Sample preparation

Human blood plasma was obtained from one healthy male donor (age 37, O type), provided by Zhuanghe Blood Center (Dalian, China). An initial protein concentration of ~95 mg/mL was determined in plasma using Bardford method. Human liver tissue was homogenized in lysis buffer (40 mM Tris, 6 M guanidine HCl, 65 mM DTT, 310 mM NaF, 3.45 mM NaVO₃, protease inhibitor cocktail) and then sonicated for 180 s followed by centrifugation at 25,000 g for 1 h. The supernatant was collected as protein sample and the concentration was determined by Braford assay.

The human plasma sample and human liver tissue lysate were reduced by DTT and alkylated by iodoacetamide. Then the solutions were diluted to 1 M guanidine-HCl, and pH values were adjusted to 8.1. Finally, trypsin was added (trypsin:protein, 1:50) and the protein samples were incubated at 37°C for 20 h. Tryptic digests were desalted with a C18 solid – phase cartridge.

Tryptic digests of standard proteins were prepared by digesting of 500 pmol reduced, iodoacetamide alkylated bovine serum albumin, horse myoglobin, horse cytochrome c, chick ovalbumin, human hemoglobin, bovine β-casein and bovine α-casein. Bovine serum albumin was purchased from Roche and all other standard proteins were from Sigma-Aldrich. These digests were pooled to prepare seven protein digest mixture. The final concentrations of these proteins were ranged from 16 to 300 fmol per microliter.

LC-MS/MS analysis and database search

The configurations for 1D and 2D LC-MS/MS analysis were set as reported previously[34]. Therein, a Finnigan LTQ linear ion trap mass spectrometer (Thermo, San Jose, CA) was coupled with capillary reversed phase LC for collection of MS/MS spectra. The tryptic digest of 7 standard proteins was analyzed by 1D LC-MS/MS with 7 replicate runs and the Human sample digests were analyzed by 2D LC-MS/MS.

The acquired MS/MS spectra were searched using Turbo SEQUEST in BioWorks 3.2 software suite (Thermo Finnigan, San Jose, CA). For 7 standard proteins, database was the composite of protein sequences from yeast (9,492 entries) in forward and reverse orient as well as the forward and reversed sequences of all control proteins with trypsin and α-s2-casein (for the impurity of α-casein). The database used for two human proteome samples was a composite of normal IPI human database (v3.04, 49,078 entries) from European Bioinformatics Institute with reversed version of the same database attached in the end. MS/MS spectra were searched using fully tryptic cleavage constraints and up to two missed cleavage sites were allowed. Cysteine residues were set as static modification of +57.0215 Da and methionine residues were set as variable modification of +15.9949 Da. Mass tolerances were 2 Da for peptide and 1 Da for fragment. FDR was determined by Equation (1).

Development of software suite SFOER using GA

A Java software suite named SFOER was developed to optimize filtering criteria using GA[29]. In GA, genes (SEQUEST scores for the criteria in this study) are generally encoded into binary character strings including only 0 and 1. Chromosome is composed of a single binary string where encoded genes are assembled one by one. Each chromosome in a generation is called an individual. For our GA, four cutoff values including Xcorr, ΔCn, Sp and Rsp were encoded into binary strings respectively. And chromosome which indicated filtering criterion was encoded into a 30-bit-long string. Details are shown in Table 4.

Table 4 Parameter settings for the genetic algorithm

Full size table

Definition of a fitness function for evaluating individual members of a population is perhaps the most crucial step in designing genetic algorithm. The goal in this study was to derive optimized filtering criteria that achieved maximal separation between correct and incorrect peptide identifications and generated maximum sensitivity for true positive peptide identifications under specified confidence level (e.g. >99%). However, in most proteome researches, numbers of total positive peptides were commonly unknown. Thus, we utilized the following fitness function:

F(p) = n(p), (2)

where F(p) was the fitness value for a given filtering criterion which was consisted of several cutoff values for different scores, n(p) would be the number of overall positive peptide identifications passed this filtering criterion. And when FDR of peptide identifications filtered by a criterion was higher than specification, fitness of this criterion was set to zero. This function indicates the sensitivity of a specific criterion.

The genetic algorithm makes an optimization within a cycle of several stages. It includes creation of a population of individuals (criteria), evaluation of these individuals, selection of individuals and breeding aided by genetic manipulation to create offspring population (schematic shown in Figure 6):

1.
Creation of the starting population: The starting point in genetic algorithm of the initial population was randomly generated. One complete chromosome was assembled of a certain number of different SEQUEST scores and the population size was set as 100.
2.
Selection: Roulette wheel selection pattern was chosen for the determination of each individual's probability for reproduction and breeding, concerning the policy that the better a chromosome of a parent was the more descendants with the same chromosomes were reproduced. When the fitness of an individual became zero, this individual was selected as death, and replaced by a new initial individual.
3.
Genetic manipulation: Two new breed chromosomes were then performed by a single-point cross-over, whereas genes were randomly altered along the length of a chromosome at one point according to a natural occurring cross-over. The cross-over rate was set to 0.2 and the rate of a subsequently performed point mutation, thus a binary character was changed from 1 to 0 or vice versa, was set to 0.01.

Steps 2, 3 were repeated until termination of the optimization. A stop criterion was not pre-defined, owing to limited data known about the search space. In this study, we used specific generations which can be set manually to terminate optimizations.

All database search results were processed by SFOER to generate optimized criteria on different confidence levels, and then peptide identifications were filtered by these sets of criteria. PeptideProphet which was downloaded as part of Trans-Proteomics Pipeline (TPP)[43] from The Seattle Proteome Center was also used to process these datasets. All peptides assigned from database searching were parsed by PeptideProphet to generate PeptideProphet-probability using default parameters. Manual adjustment of peptide probability threshold was used to generate peptide identifications with different confidence levels.

Availability and requirements

The SFOER is developed using Java 2 Platform Standard Edition (J2SE) Development Kit 5.0 (Sun Microsystems, Inc) and is platform independent. Java Runtime Environment 1.5.0 or higher is required. It is distributed under a GNU General Public License (GPL) and is available at http://bioanalysis.dicp.ac.cn/proteomics/software/SFOER.html.

References

Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422 (6928): 198-207. 10.1038/nature01511.
Article CAS PubMed Google Scholar
Yates JR: Mass spectral analysis in proteomics. Annu Rev Biophys Biomolec Struct. 2004, 33: 297-316. 10.1146/annurev.biophys.33.111502.082538.
Article CAS Google Scholar
Koller A, Washburn MP, Lange BM, Andon NL, Deciu C, Haynes PA, Hays L, Schieltz D, Ulaszek R, Wei J, Wolters D, Yates JR: Proteomic survey of metabolic pathways in rice. Proc Natl Acad Sci U S A. 2002, 99 (18): 11969-11974. 10.1073/pnas.172183199.
Article PubMed Central CAS PubMed Google Scholar
Wu CC, MacCoss MJ, Howell KE, Yates JR: A method for the comprehensive proteomic analysis of membrane proteins. Nat Biotechnol. 2003, 21 (5): 532-538. 10.1038/nbt819.
Article CAS PubMed Google Scholar
Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL, Witney AA, Wolters D, Wu YM, Gardner MJ, Holder AA, Sinden RE, Yates JR, Carucci DJ: A proteomic view of the Plasmodium falciparum life cycle. Nature. 2002, 419 (6906): 520-526. 10.1038/nature01107.
Article CAS PubMed Google Scholar
Jessani N, Niessen S, Wei BQQ, Nicolau M, Humphrey M, Ji YR, Han WS, Noh DY, Yates JR, Jeffrey SS, Cravatt BF: A streamlined platform for high-content functional proteomics of primary human specimens. Nat Methods. 2005, 2 (9): 691-697. 10.1038/nmeth778.
Article CAS PubMed Google Scholar
Chen EI, Hewel J, Felding-Habermann B, Yates JR: Large scale protein profiling by combination of protein fractionation and multidimensional protein identification technology (MudPIT). Mol Cell Proteomics. 2006, 5 (1): 53-56. 10.1074/mcp.T500013-MCP200.
Article CAS PubMed Google Scholar
Washburn MP, Wolters D, Yates JR: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001, 19 (3): 242-247. 10.1038/85686.
Article CAS PubMed Google Scholar
Eng JK, McCormack AL, Yates IIIJR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994, 5 (11): 976-989. 10.1016/1044-0305(94)80016-2.
Article CAS PubMed Google Scholar
Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20 (18): 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.
Article CAS PubMed Google Scholar
Weatherly DB, Atwood JA, Minning TA, Cavola C, Tarleton RL, Orlando R: A heuristic method for assigning a false-discovery rate for protein identifications from mascot database search results. Mol Cell Proteomics. 2005, 4 (6): 762-772. 10.1074/mcp.M400215-MCP200.
Article CAS PubMed Google Scholar
Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002, 74 (20): 5383-5392. 10.1021/ac025747h.
Article CAS PubMed Google Scholar
Nesvizhskii AI, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003, 75 (17): 4646-4658. 10.1021/ac0341261.
Article CAS PubMed Google Scholar
Sadygov RG, Liu H, Yates JR: Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal Chem. 2004, 76 (6): 1664-1671. 10.1021/ac035112y.
Article CAS PubMed Google Scholar
Moore RE, Young MK, Lee TD: Qscore: An algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom. 2002, 13 (4): 378-386. 10.1016/S1044-0305(02)00352-5.
Article CAS PubMed Google Scholar
Baczek T, Bucinski A, Ivanov AR, Kaliszan R: Artificial neural network analysis for evaluation of peptide MS/MS spectra in proteomics. Anal Chem. 2004, 76 (6): 1726-1732. 10.1021/ac030297u.
Article CAS PubMed Google Scholar
Ulintz PJ, Zhu J, Qin ZHS, Andrews PC: Improved classification of mass spectrometry database search results using newer machine learning approaches. Mol Cell Proteomics. 2006, 5 (3): 497-509. 10.1074/mcp.M500233-MCP200.
Article CAS PubMed Google Scholar
Peng JM, Elias JE, Thoreen CC, Licklider LJ, Gygi SP: Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J Proteome Res. 2003, 2 (1): 43-50. 10.1021/pr025556v.
Article CAS PubMed Google Scholar
Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP: A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol. 2006, 24 (10): 1285-1292. 10.1038/nbt1240.
Article CAS PubMed Google Scholar
Elias JE, Gygi SP: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007, 4 (3): 207-214. 10.1038/nmeth1019.
Article CAS PubMed Google Scholar
Higdon R, Kolker E: A predictive model for identifying proteins by a single peptide match. Bioinformatics. 2007, 23 (3): 277-280. 10.1093/bioinformatics/btl595.
Article CAS PubMed Google Scholar
Elias JE, Haas W, Faherty BK, Gygi SP: Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods. 2005, 2 (9): 667-675. 10.1038/nmeth785.
Article CAS PubMed Google Scholar
Park GW, Kwon KH, Kim JY, Lee JH, Yun SH, Kim SI, Park YM, Ch SY, Paik YK, Yoo JS: Human plasma proteome analysis by reversed sequence database search and molecular weight correlation based on a bacterial proteome analysis. Proteomics. 2006, 6 (4): 1121-1132. 10.1002/pmic.200500318.
Article CAS PubMed Google Scholar
Qian WJ, Liu T, Monroe ME, Strittmatter EF, Jacobs JM, Kangas LJ, Petritis K, CampIi DG, Smith RD: Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: The human proteome. J Proteome Res. 2005, 4 (1): 53-62. 10.1021/pr0498638.
Article CAS PubMed Google Scholar
**e HW, Griffin TJ: Trade-off between high sensitivity and increased potential for false positive peptide sequence matches using a two-dimensional linear ion trap for tandem mass spectrometry-based proteomics. J Proteome Res. 2006, 5 (4): 1003-1009. 10.1021/pr050472i.
Article CAS PubMed Google Scholar
Kislinger T, Cox B, Kannan A, Chung C, Hu PZ, Ignatchenko A, Scott MS, Gramolini AO, Morris Q, Hallett MT, Rossant J, Hughes TR, Frey B, Emili A: Global survey of organ and organelle protein expression in mouse: Combined proteomic and transcriptomic profiling. Cell. 2006, 125 (1): 173-186. 10.1016/j.cell.2006.01.044.
Article CAS PubMed Google Scholar
Lu BW, Ruse C, Xu T, Park SK, Yates J: Automatic validation of phosphopeptide identifications from tandem mass spectra. Anal Chem. 2007, 79 (4): 1301-1310. 10.1021/ac061334v.
Article PubMed Central CAS PubMed Google Scholar
Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, Mortensen P, Mann M: Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell. 2006, 127 (3): 635-648. 10.1016/j.cell.2006.09.026.
Article CAS PubMed Google Scholar
Goldberg DE: Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Westey: New York. 1989
Google Scholar
Li LH, Tang H, Wu ZB, Gong JL, Gruidl M, Zou J, Tockman M, Clark RA: Data mining techniques for cancer detection using serum proteomic profiling. Artif Intell Med. 2004, 32 (2): 71-83. 10.1016/j.artmed.2004.03.006.
Article PubMed Google Scholar
Heredia-Langner A, Cannon WR, Jarman KD, Jarman KH: Sequence optimization as an alternative to de novo analysis of tandem mass spectrometry data. Bioinformatics. 2004, 20 (14): 2296-2304. 10.1093/bioinformatics/bth242.
Article CAS PubMed Google Scholar
Jeffries NO: Performance of a genetic algorithm for mass spectrometry proteomics. BMC Bioinformatics . 2004, 5: 180-10.1186/1471-2105-5-180.
Article PubMed Central PubMed Google Scholar
Wilmarth PA, Riviere MA, Rustvold DL, Lauten JD, Madden TE, David LL: Two-dimensional liquid chromatography study of the human whole saliva proteome. J Proteome Res. 2004, 3 (5): 1017-1023. 10.1021/pr049911o.
Article CAS PubMed Google Scholar
Jiang XG, Feng S, Tian RJ, Han GH, Jiang XN, Ye ML, Zou HF: Automation of nanoflow liquid chromatography-tandem mass spectrometry for proteome analysis by using a strong cation exchange trap column. Proteomics. 2007, 7 (4): 528-539. 10.1002/pmic.200600661.
Article CAS PubMed Google Scholar
Qian WJ, Jacobs JM, Camp DG, Monroe ME, Moore RJ, Gritsenko MA, Calvano SE, Lowry SF, **ao WZ, Moldawer LL, Davis RW, Tompkins RG, Smith RD: Comparative proteome analyses of human plasma following in vivo lipopolysaccharide administration using multidimensional separations coupled with tandem mass spectrometry. Proteomics. 2005, 5 (2): 572-584. 10.1002/pmic.200400942.
Article PubMed Central CAS PubMed Google Scholar
Bodenmiller B, Mueller LN, Mueller M, Domon B, Aebersold R: Reproducible isolation of distinct, overlap** segments of the phosphoproteome. Nat Methods. 2007, 4 (3): 231-237. 10.1038/nmeth1005.
Article CAS PubMed Google Scholar
Na SJ, Paek E: Quality assessment of tandem mass spectra based on cumulative intensity normalization. J Proteome Res. 2006, 5 (12): 3241-3248. 10.1021/pr0603248.
Article CAS PubMed Google Scholar
Tao WA, Wollscheid B, O'Brien R, Eng JK, Li XJ, Bodenmiller B, Watts JD, Hood L, Aebersold R: Quantitative phosphoproteome analysis using a dendrimer conjugation chemistry and tandem mass spectrometry. Nat Methods. 2005, 2 (8): 591-598. 10.1038/nmeth776.
Article CAS PubMed Google Scholar
Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ, Morris DR, Garvik BM, Yates JR: Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol. 1999, 17 (7): 676-682. 10.1038/10890.
Article CAS PubMed Google Scholar
DTA files. [http://bioanalysis.dicp.ac.cn/proteomics/software/SFOER.dta.rar]
Krijgsveld J, Gauci S, Dormeyer W, Heck AJR: In-gel isoelectric focusing of peptides as a tool for improved protein identification. J Proteome Res. 2006, 5 (7): 1721-1730. 10.1021/pr0601180.
Article CAS PubMed Google Scholar
Everley PA, Bakalarski CE, Elias JE, Waghorne CG, Beausoleil SA, Gerber SA, Faherty BK, Zetter BR, Gygi SP: Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation. J Proteome Res. 2006, 5 (5): 1224-1231. 10.1021/pr0504891.
Article CAS PubMed Google Scholar
TPP project. [http://tools.proteomecenter.org/TPP.php]

Download references

Acknowledgements

This work was supported by National Natural Sciences Foundation of China (No. 20675081), the China State Key Basic Research Program Grant (2005CB522701, 2007CB914104), the China High Technology Research Program Grant (2006AA02A309), the Knowledge Innovation program of CAS (KJCX2.YW.HO9) and the Knowledge Innovation program of DICP to H.Z. are gratefully acknowledged.

Author information

Authors and Affiliations

National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian, 116023, China
**nning Jiang, **aogang Jiang, Guanghui Han, Mingliang Ye & Hanfa Zou

Authors

**nning Jiang
View author publications
You can also search for this author in PubMed Google Scholar
**aogang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Guanghui Han
View author publications
You can also search for this author in PubMed Google Scholar
Mingliang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Hanfa Zou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mingliang Ye or Hanfa Zou.

Additional information

Authors' contributions

X.N. Jiang carried out the study and developed the software implementing GA. H.F. Zou and M.L. Ye designed the whole project and helped to interpret data analysis results. X.G. Jiang and G.H. Han contributed to the sample preparation and analysis. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2006_1695_MOESM1_ESM.pdf

Additional file 1: Distribution of peptides identified from human liver tissue lysate by SEQUEST. The data represented the detail information for the Xcorr ΔCn distribution of peptides identified from human liver tissue lysate by SEQUEST. (PDF 83 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jiang, X., Jiang, X., Han, G. et al. Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics. BMC Bioinformatics 8, 323 (2007). https://doi.org/10.1186/1471-2105-8-323

Download citation

Received: 14 November 2006
Accepted: 31 August 2007
Published: 31 August 2007
DOI: https://doi.org/10.1186/1471-2105-8-323

Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics