Background

Lung cancer is the leading cause of cancer mortality in the United States. Estimates for 2014 indicate that 224,210 individuals will be diagnosed with lung cancer and 159,260 will die from the disease [1]. The average 5-year survival is about 17 %, with 79 % of cases being diagnosed as regional or distant disease. If lung cancer is detected when localized, survival increases to over 50 % [1].

Support for early lung cancer detection has emerged from the landmark NLST, where LDCT screening was shown to confer a 20 % reduction in lung cancer mortality in a high risk population [2]. Despite concerns associated with the low specificity (73.4 %) of CT screening [3] and the resulting large number of false-positive findings for lung cancer (96.4 %), the USPSTF recently recommended annual LDCT-screening for lung cancer in high-risk individuals [4, 5]. In their recommendation statement, the USPSTF stressed the need for more research into the use of biomarkers to complement LDCT screening. Two key clinical opportunities exist. First, the use of biomarkers for early detection of lung cancer could define a new high-risk population or refine the screening criteria recommended by USPSTF (age: 55 to 80 years, smoking history: >30 pack-year). Such biomarkers would serve as a pre-imaging filter, reducing the overall cost of screening and lowering the number of false-positive findings and unnecessary follow-up procedures. The second opportunity lies in improving the accuracy of lung cancer diagnosis. Given the high frequency of positive findings (pulmonary nodules) with CT screening [2], new means of accurately determining malignant risk are urgently required. In the NLST, 24 % of surgically resected nodules were found to be benign [2]. By improving the accuracy with which malignant risk is determined, biomarkers could potentially enhance diagnostic management by reducing unnecessary surgical intervention, minimizing the use of costly PET-CT and lowering radiation exposure associated with CT monitoring, while enabling detection of lung cancer at an early, more curable, stage.

A wide variety of approaches have been utilized to discover new blood-based lung cancer protein biomarkers [6]. These range from splice variant analysis and the isolation of tumor-enriched transcripts [7], to the development of novel proteomic platforms with the capacity to resolve candidate markers in a highly multiplexed fashion [8]. Advances in mass spectrometry (MS)-based technologies have also enabled discovery of new lung cancer biomarker candidates directly in serum or plasma [913]. While the identification of biomarkers directly in blood-based matrices can be problematic due to their complexity and the presence of multiple highly abundant factors [14], some of these challenges can be minimized through extensive fractionation [15]. Differentially expressed candidate markers have also been successfully identified through comparison of blood draining from the tumor vascular bed matched with systemic arterial blood from the same patient [

Fig. 1
figure 1

Venn diagram showing distribution of 179 candidate lung cancer biomarkers across 3 discovery platforms

The Panther based classification system was used to categorize markers based on Protein Class and Pathway [22, 23]. Protein classes were defined for 141/179 (79 %) of the candidate markers evaluated. The most common classes reported were: receptors (14 %), cell adhesion molecules (14 %), hydrolases (13 %), defense/immunity proteins (10 %), proteases (9 %), enzyme modulators (8 %) and signaling molecules (8 %; Additional file 4: Figure S1). Further protein class analysis revealed similar profiles for biomarkers identified in the two cell-surface based discovery programs, resected tissue and cultured lung cancer cell lines (Additional file 5: Figure S2). Panther classification resolved protein categories for 91/113 (81 %) of the markers identified in tissues and 70/86 (81 %) of those found in cell lines. While some differences clearly exist, the most abundant protein classes (cell adhesion, defense/immunity, enzyme modulator, extracellular matrix, hydrolase, protease, receptor, signaling, transfer/carrier and transporter) were resolved in both tissues and cell lines. Panther-based pathway analysis also revealed many similarities between the two discovery platforms. Pathways commonly identified in resected tissues (integrin signaling, inflammation, gonadotropin releasing hormone receptor, Alzheimer disease-presenilin and plasminogen activating cascade) were also frequently found in the cell lines studied. Some differences were resolved between the two sources, including enrichment of blood-coagulation related proteins in the tissue based discovery system (22 %) relative to cell line studies (9 %; Additional file 6: Figure S3).

Serum-based biomarker verification

ELISA analysis was undertaken to investigate whether the differential expression profiles observed in lung cancer tissues, cell lines and conditioned medium, would also be detected in the bloodstream of subjects with lung cancer. A small number of candidates were selected for serological characterization: CEA, MDK, MMP2, SLPI, TFPI and TIMP1 (Table 1). These biomarkers were selected in part due to the reagent availability, but also, with the exception of CEA, because they represented some of the more novel lung cancer markers identified, with few studies indicating elevated expression in the circulation of patients with early stage disease [24]. While all six markers had been shown to be present in plasma [25], they had not been resolved in other proteomic studies aimed at identifying differentially expressed lung cancer markers using alternative biological fluids: bronchial lavage [26, 27] sputum [28] or pleural fluid [29, 30], or in profiling experiments aimed at identifying markers associated with other common lung disorders: COPD [27], asthma [31] or tuberculosis [32].

Table 1 Candidate lung cancer biomarkers identified through MS discovery that were selected for serological characterization

With the goal of identifying markers to be used to screen for early-stage disease, or to guide diagnosis following CT-based detection, expression levels were determined in subjects with stage I NSCLC (n = 94), relative to normal smoker controls (n = 189; Table 2). In an effort to minimize selection of markers associated with pre-analytical variability, where differential expression profiles may be derived from serum sample collection procedures specific to any single clinical study site, subjects from two independent clinical studies were combined into a single testing set. The first study collected at CRCCC (Clinical Research Center of Cape Cod; West Yarmouth, MA), comprised patients with stage I NSCLC (n = 30) and healthy smoker controls (n = 99). The second cohort, collected at New York University (NYU) School of Medicine/Langone Medical Center, was selected from a high-risk population with a history of heavy tobacco usage. Serum samples were collected from patients with stage I NSCLC (n = 64) and healthy controls (n = 90).

Table 2 Demographic and clinical profiles of subjects tested with lung cancer biomarker candidates

Levels of five of the six candidate biomarkers tested (CEA, MDK, MMP2, SLPI, TIMP1, TFPI) were significantly higher in serum from subjects with NSCLC than in controls (Table 3, Additional file 7: Figure S4), serving to support this indirect discovery approach. Three extensively characterized markers: CYFRA 21–1, SCC and OPN were also evaluated. These markers served as a reference in evaluating clinical accuracy of the MS-identified markers.

Table 3 Expression levels of biomarker candidates in serum collected from patients with NSCLC (n = 94) and healthy volunteer controls (n = 189)

Multi-marker model development and testing

The identification of multiple differentially expressed markers prompted the development of a multi-marker panel. Elastic net modeling [33] started with all 9 candidate markers (Table 3). The optimal value of the regularization parameter, as determined by bootstrap resampling, reduced the parameter estimate for SLPI to zero, while the remaining 8 markers: TFPI, MDK, OPN, MMP2, TIMP1, CEA, CYFRA 21–1 and SCC, which retained non-zero coefficients, were selected in the final model. In the training dataset (Table 2), this 8-marker model resolved lung cancer patients from smoker controls with 75 % sensitivity at 90 % specificity (AUC = 0.913). A bootstrap validation procedure confirmed clinical performance of the model, AUC = 0.903.

The accuracy of the 8-marker model was tested in an independent study (Mayo Clinic). Controls (n = 50) were selected from the high risk control population evaluated in the Mayo CT-Screening Trial [34] and included subjects with pulmonary nodules (n = 22). Lung cancer cases were pre-operative surgical referrals (n = 50). Malignant lesions were significantly larger than screen detected benign nodules. Cases and controls were matched on age, gender and smoking history (Table 2). EDTA plasma samples were utilized in this study. Levels of all markers included in the model had been shown to be highly correlated in serum and EDTA plasma (Additional file 8: Table S4). The 8-marker model distinguished patients with malignant lesions from all smoker controls with an AUC = 0.775 (Fig. 2), accurately classifying control subjects with (AUC = 0.745) or without pulmonary nodules (AUC = 0.799).

Fig. 2
figure 2

Multi-marker model resolves lung cancer cases from smoker controls. Receiver Operator Curves are plotted for all controls, nodule controls and no-nodule controls

While the 8-marker model was found to be substantially correlated with nodule size (r = 0.739; p < 0.0001), it was not associated with any of the other clinicopathological variables tested: age, sex, smoking history (unpublished data). Elevated expression of the multi-marker model was observed in tumors with a squamous cell histology, relative to adenocarcinoma cases (p = 0.019), driven in part by higher levels of CYFRA 21-1 (p < 0.0001) and OPN (p = 0.013) in squamous cell carcinomas (unpublished data).