Introduction

A biomarker as defined by the National Cancer Institute is "a biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease [1]." It is a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention [2]. The field of biomarkers has grown extensively over the past decade in many areas such as medicine, cell biology, genetics, geology and astrobiology, and ecotoxicology etc. and biomarkers are currently being studied in many academic centers and in industry.

In recent years, functional genomics studies using DNA Microarrays have been shown effective in identifying markers differentiating between breast cancer tissues and normal tissues, by measuring thousands of differentially expressed genes simultaneously [35].

However, early detection and treatment of breast cancer is still challenging. One reason is that obtaining tissue samples for microarray analysis can be still difficult. Another reason is that genes are not directly involved in any physical functions. On the contrary, the proteome, are the real functional molecules and the keys to understanding the development of cancer. Moreover, the fact that breast cancer is not a single homogeneous disease but consists of multiple disease status, each arising from a distinct molecular mechanism and having a distinct clinical progression path [6] makes the disease difficult to early detect.

Alternative splicing isoforms represent a new class of diagnostic biomarkers [7]. The chance of success with alternative splicing isoforms would be higher than the conventional approach [8, 9]. Alternative splicing occurs in 95% human genes and works by selecting specific exons and sometimes even intronic regions of the gene into mature mRNAs [10]. Alternative splicing accounts for approximately 8% of all protein isoforms which is any of several different forms of the same protein and have three types: alternative splicing, SNP, and posttranslational modification (PTM).

Recent scientific studies have shown that diseased cells may produce many types of splicing variants of common regulatory proteins, e.g., protein kinase C, 14-3-3, p53, and VGFR, which could provide novel insights into complex disease diagnosis and management, particularly in cancers [1114]. Alternative mRNA splicing is an important source of achieving molecular functional diversity. It is often regulated in a temporal or tissue-specific fashion, giving rise to different protein isoforms in different tissues or developmental states mediated by extracellular signaling mechanisms [15, 16]. Splicing regulation is a key mechanism to tune gene expression to a variety of conditions and its dysfunction may often be at the basis of the onset of genetic disease and cancer [9]. In cancer, many examples of alternative splicing isoforms were reported [1721]. For example, Julian et al. used a high-throughput reverse transcription-PCR-based system for splicing annotation to monitor the alternative splicing profiles of 600 cancer-associated genes in a panel of 21 normal and 26 cancerous breast tissues. They found that 41 alternative splicing events significantly differed in breast tumors relative to normal breast tissues and that most cancer-specific changes in splicing that disrupt known protein domains support an increase in cell proliferation or survival consistent with a functional role for alternative splicing in cancer. Compared to normal mRNA splicing events, alternative splicing mechanisms and patterns in complex diseases such as cancer can be quite complex. Finding alternative splicing isoforms or patterns of their development, therefore, have been promising in hel** develop high-quality biomarkers and targets for disease management [22].

Discovering disease biomarkers in the human plasma has been met with both enthusiasm and criticisms in recent years. On one hand, it is expected that disease conditions such as cancer may be diagnosed early by analyzing complex protein mixtures in easily accessible human blood (serum or plasma), in which proteins induced by cancer may be differentially cleaved, secreted, leaked out, and therefore differentially detected from normal healthy conditions. On the other hand, the bulk of proteins circulating in human blood vary in different conditions without cancers and across different individuals, and changes in protein expressions are often diluted in the blood -- extremely challenging for biological interpretation of protein quantification changes [23, 24]. The use of alternative splicing isoforms as potential biomarkers, therefore, offers a new opportunity to use the detectability instead of quantification of biomarker peptides, for the peptides that may map to unique alternative splicing isoforms specific to cancer.

However, systematic, comprehensive, proteome-scale experimental and computational characterization of protein isoforms directly at the protein/peptide level and with exclusive focus on alternative splicing has never been reported. Compared with the "indirect" transcriptome-level characterizations such as EST sequencing[Step 2: identification and characterization of alternative splicing isoform using proteomics

The most popular three types of search algorithms are 1) correlating acquired MS/MS with theoretical spectrum, counts the number of peaks in common, such as: SEQUEST [37] and X!Tandem [38], 2) modeling the extent of peptide fragmentation, then estimates the probability that an assignment is incorrect due specifically to a random match, such as Mascot [39] and OMSSA [40], and 3) De Novo Sequencing such as Lutefisk[41] and PEAKS[42]. We first run OMSSA against SASD to identify peptide from MS data. Then we perform preliminary data analysis. Last, we extract information about alternative splicing.

OMSSA reports hits ranked by E-value. An E-value for a hit is a score that is the expected number of random hits from a search library to a given spectrum, such that the random hits have an equal or better score than the hit. For example, a hit with an E-value of 1.0 implies that one hit with a score equal to or better than the hit being scored would be expected at random from a sequence library search [40]. The E-value is calculated to report the expected frequency of observing scores equivalent to or better than the one for the reported peptide if the results were to take place randomly. The lower the E-value is, the more significant the score for the identified peptide by the peptide search using SASD database is.

One-sided Wilcoxon signed-rank test is used to perform the preliminary statistical analysis in order to identify peptides with significant occurrence differences in the health and breast cancer samples.

Step 3: validation of alternative splicing isoform

We present two kinds of methods to validate isoforms in proteomics data, which are 1) literature curation of alternative splicing isoforms and 2) cross-validation of multiple studies. First, we perform an extensive literature curation to determine the constituents of alternative splicing isoform. Then, we validate results using independent proteomics datasets derived from other study. We believe that such an integrative systems approach is essential to development and validation of panel alternative splicing isoform that may withstand rigorous testing for the future steps.

Pathway analysis

Pathway analysis is performed using the following databases: Integrated Pathway Analysis Database (IPAD) (http://bioinfo.hsc.unt.edu/ipad/) [43].

Results

The 80 breast cancer plasma samples with 40 samples from women diagnosed with breast cancer and 40 from healthy volunteer women as controls were searched by OMSSA[40] against the SASD database. After OMSSA searching, preliminary statistical analysis, and alternative splicing, we identified the eight alternative splicing isoform biomarkers using the peptidomics approach to searching novel alternative splicing isoform in the proteomics data (Table 1). The peptides with E-value greater than 0.01 were filtered out. The P-value was calculated by performing one-sided Wilcoxon signed-rank test to examine the probability that the median difference between two groups of samples is greater than zero. The number of such peptide found in Health (h) and Cancer (c) samples are listed separately in the table. The tested peptides are more likely to exist in cancer samples than in healthy samples when P-value is small. Bold text is the left part of the junction and italic text is the right part. Splicing site is marked by ^ or (). '()' means the splicing site is shared by the left region and right region. For example, the second peptide QTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLI(G)LLHSSVK is a synthetic product of the ENST00000380152 in gene BRCA2 when its eighth, ninth, and tenth exons are skipped and its seventh exon is combined together with its eleventh exon. The Glycine is the shared splicing site between the seventh exon and the eleventh exon.

Table 1 Novel alternative splicing isoform candidate biomarkers for breast cancer in plasma

None of the eight peptides are reported to have ever been detected in the Peptide Atlas database, which contains a comprehensive catalogue of all peptides derived from published proteomics experiments. They are not found with normal splicing mechanism (Table 1).

The first sequence is a single intron alternative splicing. It was observed in 20 patient samples and 4 healthy samples. Triple play mode of the annotated Thermo-Finnegan LCQ-DECA ion-trap MS/MS spectrum is shown in Figure 2[44]. The triple play mode includes a) primary mass spectrum; b) zoom scan mass spectrum; c) MS/MS mass spectrum and d) protein identification from MS/MS). Due to space limit, the spectrums of other seven alternative splicing sequences are omitted.

Figure 2
figure 2

Triple play mode spectrum of SWGGRPQRMGAVPGGVWSAVLMGGAR in patient C_024.

The eight peptides, identified by OMSSA, have significant difference in the numbers of hit samples between healthy women and breast cancers (pvalue < 0.05, Table 1). A screen shot from the UCSC genome browser [45] in the region of these peptides are also shown in Figure 3. It shows that these peptide sequences are not found in EST sequences and mRNA from Genbank and one refseq gene(Figure 3).

Figure 3
figure 3

UCSC genome browser screen shot of genomic region for the novel peptide.

Table 1 shows that our peptidomics approach has significant potential in enabling discovery of new types of high-quality alternative splicing isoform biomarkers. Further literature search found that there are no any literature reports for the eight peptides. Moreover, a cross-validation found that identification of the eight peptides is supported by the independent Study which contains 40 samples collected from women diagnosed with breast cancer and 40 from healthy volunteer woman who served as controls.

Pathway analysis shows the pathways linked with the eight alternative splicing isoforms are transcription factor, signaling, cancer, and synthesis (Table 2).

Table 2 Pathway analysis for the eight alternative splicing isoforms.

Discussions

We described the peptidomics approach to searching novel alternative splicing isoform in proteomics data, especially artificial alternative splicing and SNP. We can use it to identify two types of common alternative splicing events: Exon Skip** and Intron Retention. Exon Skip** is an alternative splicing mechanism in which exon(s) are included or excluded from the final gene transcript leading to extended or shortened mRNA variants. And Intron Retention is an event in which an intron is retained in the final transcript. Other types of alternative splicing events such as alternative 3' splice site and 5' splice site are not included in our method but can be derived indirectly from the two basic types: exon skip** and intron retention.

The current protein sequence databases used by tandem mass spectra search engines, for example IPI, UniProt, and NCBI nr, are designed to be useful as possible to as many researchers as possible. As such, they are a less than ideal substrate for tandem mass spectra search. Protein sequence databases typically represent only "full-length" protein sequences and attempt to collapse protein variants to a single "consensus" entry. Tandem mass spectra search engines, however, chop up the protein sequence using an in-silico enzymatic digestion (such as trypsin), so full-length proteins are not needed in order to identify experimentally observed peptides; and the currently available search engines require the experimental peptides' sequences be explicitly present in the sequence database in order to identify them, so explicit sequence variants are very important. The SASD is in fact a complete peptide sequence database, which includes majority of all occurrence of alternative splicing. It also provides alternative splicing for each peptide, such as splicing mode, splicing type, splicing site, starting position, ending position, and peptide sequence.

Moreover, the current protein sequence databases and some alternative splicing database such as ASTD and EID are not ideal or enough for identifying alternative splicing isoform from tandem mass spectrometry. There are either no or very few isoform information in these databases. For example, ASTD only includes 9757 occurrences of intron isoforms and 5214 occurrences of exon isoforms. Even for Cassette exons event, the number of occurrences is only 12470[15]. In contrast, the SASD database in our method includes 11,919,779 Alternative Splicing peptides covering about 56,630 genes (ensembl gene IDs), 95,260 transcripts (ensembl transcript IDs), 1956 pathways, 6704 diseases, 5615 drugs, and 52 organs.

Its comprehensive coverage means better sensitivity in identifying novel alternative splicing isoforms than the PEPPI. And its exclusive focus on alternative splicing can definitely increase the specificity of the identification of alternative splicing.

Alternative splicing isoform biomarkers are apparently important and can serve as an alternative to traditional biomarkers. We can use quantitative information such as p-value to determine the significance of the marker. We can also use the qualitative information such as: splicing type, splicing mode, peptide sequence etc. to further analyze the alternative splicing's mechanism. We think that combination of traditional biomarkers with the alternative splicing isoform biomarkers will definitely help us better understand the treatment, diagnosis, and prognosis of cancer.