Introduction

Early diagnosis improves the survival rates of many types of cancer. For lung adenocarcinoma (LA), which accounts for almost half of all lung cancers and has a mortality rate up to 80%, early diagnosis can increase the 5-year survival rate to 52% and reduce the costs of management of the disease1. However, conventional diagnostics using proteomic/genomic biomarkers or in vivo imaging are limited considering the detection throughput, diagnosis accuracy, analysis speed, and sampling invasiveness, particularly for early-stage LA2,3.

Serum analysis holds promise for early diagnosis of LA4 and is superior to traditional biopsy and computed tomography (CT) methods5, because serum analysis is non-invasive and low-cost for point-of-care testing (POCT)6,7 and has the desirable adaptability for universal applications. Most current serum analysis for the diagnosis of LA relies on selected genomic8,9 or proteomic10 biomarkers with limited sensitivity and specificity.

Metabolic serum analysis is more distal over genomic and proteomic approaches for precision diagnostics11,12,13, but it has rarely been reported or studied for complex diseases such as LA, due to the lack of efficient metabolite detection tools and systematically designed patient sub-groups. Changes in metabolism are associated with diverse diseases including LA6,14. Specifically, malignant transformations are associated with altered metabolic pathways for biosynthetic and bioenergetic processes, which depict an adjustment in blood metabolomics. Serum metabolite-guided approach has been applied to detect blood metabolic fingerprints and to identify biomarkers in various diseases, including pancreatic adenocarcinoma15, acute myeloid leukaemia16, and hepatic steatosis17, etc. These changes can be used for diagnostic purposes, hence the intense interest in extracting and deciphering serum metabolic information. Therefore, it is urgent to construct an advanced analytical tool for the metabolic screening of early-stage diseases, including LA.

Spectrometry methods, including nuclear magnetic resonance (NMR)18 and mass spectrometry (MS), particularly laser desorption/ionization (LDI) MS, enable high-throughput extraction and measurement of metabolomic information, while tandem MS allows accurate identification of metabolites19. However, the metabolite abundance and sample complexity affect MS analysis, and rigorous pre-treatment procedures are required for enrichment and separation of metabolites from complex bio-mixtures.

Substrates decide the efficacy of LDI MS. The tailoring of material interfaces optimizes designed interactions between molecules and substrate materials for analytical use20,21. For LDI MS, there have been global efforts, including ours, to engineer substrate materials22,23,24. An ideal substrate material for LDI MS-based metabolic analysis should have the following properties: (1) nanoscale surface roughness with stability for the selective LDI of metabolites25; (2) favourable surface charge for ion formation and conductivity for electron transfer26; and (3) easy preparation with low costs for mass production aimed at large-scale clinic use. The current materials being used, including noble metals27,28, silicon26, carbon29, metal oxides23, and their hybrids, only have some of these properties, so novel material-based platforms combining all of the above merits are a pressing need for the practical use of LDI MS in clinics.

A further challenge is the processing of MS big data in serum samples to obtain the necessary accuracy. Machine learning of imaging and omic information has enjoyed huge success for diagnostic use in clinics30. Compared with in vivo imaging and biopsy methods31,32 that require expensive and invasive equipment, in vitro omics diagnostic methods are advantageous, although they require big data. As one of the major tools for omic information collection, MS techniques33,34 (such as MasSpec Pen for cancer tissues) have afforded big data for processing and interpretation by machine learning. Notably, the selection and optimization of algorithms are required to apply machine learning in disease diagnostics.

Due to the biological significance of small metabolites (molecular weight (MW) <1000 Da) as end products of pathways and limitation performance of LDI MS in complex bio-mixtures, tackling the major problems in sample treatment, substrate materials, and data analysis for MS will lead to insights into metabolic pathways and identify effective diagnostic metabolic biomarkers. Here, we optimize the LDI MS approach to analyse a large range of metabolites (including biologically relevant metabolites) as metabolic patterns from serum samples without pretreatment by improving the substrate used. Further encoded by machine-learning algorithm, the serum metabolic patterns achieve high specificity and sensitivity diagnosis of early-stage LA and enable large-scale and low-cost rollout for use in clinics. Our approach contributes to the design of advanced metabolic analysis protocols for use in the development of precision medicine, and will lead to the development of personalized diagnostic tools for diverse diseases including but not limited to LA in the near future.

Results

Optimization of substrate material for selective LDI MS

To enable efficient extraction of serum metabolic patterns by LDI MS, we first prepared ferric particles using a modified low-cost solvo-thermal method, yielding ~0.5 g of product from a single experiment (Fig. 1 and Supplementary Fig. 1a). Ferric particles consisted of nanocrystals (~5 nm diameter) as shown by transmission electron microscopy (TEM) (Fig. 1a). High-resolution TEM (HR-TEM) (Supplementary Fig. 1b) demonstrated the polycrystalline structure of the ferric particles (Supplementary Fig. 1b) in addition to the diffraction pattern of the particles by selected area electron diffraction (SAED, inset of Fig. 1a). By scanning electron microscopy (SEM), we observed a raspberry-like morphology of the ferric particles, which were of uniform size (~300 nm diameter, polydispersity index (PDI) of 0.155) and had a rough surface (Fig. 1b and inset), which agreed with the TEM and dynamic light scattering (DLS) results (Supplementary Fig. 1c). These particles exhibited a large surface area of 154 m2 g−1 (Supplementary Fig. 1d) validating the existence of crevices on the rough surface to selectively accommodate metabolites other than proteins, and could undergo simple and fast (~45 s) separation with a magnet due to the superparamagnetic property (Supplementary Fig. 1e). We investigated the laser absorption properties of particles and showed strong absorption in the ultraviolet–visible region of 270–1100 nm (Supplementary Fig. 1f). We concluded that these ferric particles with designer structure might be ideal as a matrix for LDI MS.

Fig. 1: Substrate material characteristics and schematics of extraction and machine-learning workflow.
figure 1

a Transmission electron microscopy (TEM) image of ferric particles (n ≥ 3 randomly selected) and selected area electron diffraction (SAED) pattern (inset) showing polycrystalline structure. Scale bar = 100 nm. b Scanning electron microscopy (SEM) images (n ≥ 3 randomly selected) of ferric particles showing nanoscale surface roughness and large-scale uniformity (inset). Scale bars = 100 nm in b and 1 μm in the inset of b. c Schematic workflow for the extraction of serum metabolic patterns by ferric particle-assisted laser desorption/ionization mass spectrometry (LDI MS). Fifty nanolitres of native serum was consumed for direct analysis without pre-treatment procedures. Only Na+-adducted and K+-adducted metabolites can be selectively detected with the coexistence of high concentration of peptides and proteins. d Schematic outline for the sparse regression machine learning of serum metabolic patterns (X). The sparse regression method was used to build calculation models with sparsely constrained \(\bar \beta\) towards the diagnosis of early-stage LA (\(\overrightarrow {\mathbf{y}}\)). Each square and its colour in X corresponded to one m/z feature and its signal intensity in serum metabolic patterns.

Optimizing the surface charge of substrate particles is critical for the LDI MS process of extracting serum metabolic patterns to allow ion formation and conductivity for electron transfer (Fig. 1c). We controlled the surface charge of the ferric particles during synthesis (Supplementary Fig. 2a), demonstrating that negatively charged particles with a zeta potential of –11.5 ± 2.65 mV produced by 0.4 g trisodium citrate afforded the optimized serum metabolite profile in LDI MS (Supplementary Fig. 2b) due to the enhanced formation of a positive metal ion layer on the surface to produce cation-adducted species. From 0 to 0.4 g of trisodium citrate, the metabolite signals with a signal-to-noise ratio (S/N) > 3 increased in number. Further increasing the amount of trisodium citrate resulted in no improvement in the number of metabolite signals. In addition, the ferric particles we produced had a specific band gap of <3 eV, with specific ultraviolet absorption that could be easily excited (from ground state E0 to excitation state E1) by a 355 nm laser for facile electron transfer during ionization (Fig. 1c).

We also compared LDI MS results using the conventional organic matrix (α-cyano-4-hydroxycinnamic acid, CHCA) and inorganic matrices (silica and carbon nanoparticles) together with blank controls using no matrices, showing either strong interference in low mass range or limited sensitivity/selectivity in the analysis of bio-samples to demonstrate the superiority of our approach (Supplementary Fig. 3). Specifically, as control experiments, we observed no signals by LDI MS without any matrix due to low LDI efficiency (Supplementary Fig. 3a). We obtained overwhelming background noises with few peaks from small metabolites using the organic matrix (CHCA) and carbon particles (Supplementary Fig. 3b, c) and could only recognize glucose signal using silica nanoparticles (Supplementary Fig. 3d), all of which demonstrated the advantages of ferric particles over current matrices. Notably, the rough surface of the particles offered abundant cavities for the selective and sensitive LDI of small metabolites in the presence of salts and proteins (Supplementary Fig. 4a–c), while the stable crystalline structure prevented unwanted fragmentation under laser irradiation. The features of the ferric particles that we designed promised the efficient extraction of metabolic patterns from complex fluids (e.g. serum) based on selective LDI that would enable subsequent data analysis (Fig. 1d).

There are four major aspects as rationales to select ferric particles as the substrate for our described method, including photo-thermal properties, preparation process, structural stability, and experimental cost. For photo-thermal properties, ferric particles show strong laser absorption (absorption coefficients at 355 nm as ~3.6 × 105 cm−1) and low thermal conductivity (heat capacity as 653 J (kg K)−1). Thus, ferric particles can be heated to a high temperature by the laser irradiation, towards the efficient molecular desorption35,36. For preparation process, the solvo-thermal method required is facile to synthesize the ferric particles and the yield of ~0.5 g of product can be used to detect ~106 samples for large-scale clinical use. For comparison, the preparation of various types of silicon substrates requires complicated devices and procedures, such as micro-electro-mechanical system (MEMS)37. For structural stability, ferric particles with stable polycrystalline structure prevented unwanted fragmentation under laser irradiation, compared to carbon nanomaterials (Supplementary Fig. 3c) that produced unavoidable carbon cluster peaks in the low MW region at high laser fluence38,39. For experimental cost, the ferric particles (~£0.05 g−1) are much cheaper, compared with noble metals (~£36.36 g−1 for gold), silicon (~£3.59 g−1), and carbon (~£0.30–43.72 g−1).

Extraction of serum metabolic patterns

Having optimized the substrate, we tested the ability of ferric particle-assisted LDI MS, to extract serum metabolic patterns from patients. A total of 481 serum samples from 200 patients with early-stage LA, 200 healthy controls, 36 patients with other lung cancer, and 45 with benign lung diseases were included. The blood was drawn at initial diagnosis, without surgery or anaesthesia. The blood collection for each subject enroled in this project was following the same protocol. We also included power analysis (a universal method to derive the optimal sample size by estimating statistical power in a hypothesis test) on a dataset from a pilot study of 12 samples (6/6, LA/control) to compute the minimum sample number required for the meaningful machine learning (Supplementary Fig. 5). Based on the power analysis result, the minimum number of samples was 200 (100/100, LA/control) with predicted power ~0.8 at a false discovery rate (FDR) of 0.1, which can be a sufficient confidence level to conclude the statistical meaningful results according to previous refs. 40,45,46. For the double-blind test, we demonstrated the discriminant performance (AUC of 0.915) by double-blind test in diagnosis was consistent with the results (AUC of 0.921) by cross-validation in classifier building. Notably, the double-blind test cohort was independently enrolled, decreasing the risk of model overfitting and refusing overly optimistic results. The consistency between double-blind test and cross-validation further guaranteed a robust model without overfitting, according to previous reports46,47. Recently reported proteomic and genomic approaches (with AUC of ~0.6–0.9) require time-consuming (~hours) reactions (e.g. immunoassay and polymerase chain reaction) that are not ideal for routine clinical use4,48. For comparison, our metabolic approach provided desirable analytical performance (speed of ~seconds) and diagnostic performance (AUC of ~0.9) for early-stage LA detection in serum, demonstrating that computer-aided diagnosis based on serum metabolic patterns detects early-stage LA.

Construction of the metabolic biomarker panel

We further set out to find metabolic biomarkers (also as potential therapeutic targets) in patterns to characterize relevant pathways. We identified a biomarker panel of seven metabolites (<400 Da) based on accurate mass measurement (for both Na+- and K+-adducted signals) and tandem MS (Fig. 4a, Supplementary Figs. 915, Supplementary Table 5), accounting for an AUC of 0.894 (Supplementary Fig. 16a). The panel consisted of: uracil (Ura), histamine (His), cysteine (Cys), 3-hydroxypicolinic acid (HPA), uric acid (UA), indoleacrylic acid (IA), and fatty acid (FA) (18:2). Notably, a strong Pearson correlation between Na+-adducted and K+-adducted signals (>0.5) for the seven metabolites validated the presence and role of these metabolites as biomarkers (Fig. 4b, Supplementary Fig. 17). Specifically, we computed the odds ratios of the metabolic biomarkers in a logistic regression model (referred to the basic model) and adjusted for age and sex, according to previous reports49. As a result, age and sex were not significant covariates for any metabolic biomarker and thus the seven metabolites retained significant odds ratios (≠ 1) when adjusted for age and sex (Supplementary Table 6). The localized mass spectra and scatter plots for serum metabolic patterns showed significant differences (p < 0.05, Supplementary Figs. 18 and 19) between early-stage LA and healthy controls for each biomarker.

Fig. 4: Construction of metabolic biomarker panel.
figure 4

a Venn diagram of 161 m/z features from 810 metabolite peaks in serum, seven of which were selected as potential biomarkers with both model selection frequency >90% and p < 0.05 (<400 Da). b Correlation network plot elucidating strong Pearson correlation (>0.5) between Na+-adducted and K+-adducted signals (along diagonal line) for all seven selected metabolites in serum. Binding affinity of cations on the exposed surface [1,1,1] of ferric particles. Density functional theory (DFT) calculation results of c [ferric particles+Na+], d [ferric particles+K+], and e [ferric particles+H+] system with an anionic cluster model in the minimum-energy structure. f Fold change of five up-regulated metabolites (blue) and two down-regulated metabolites (orange) in early-stage LA patients compared with healthy controls. g Potential pathways differentially regulated in early-stage LA patients and healthy controls. The seven selected metabolites were tested to identify altered pathways. The colour and size of each circle were correlated to the p value and pathway impact value. A total of six pathways were differentially regulated: (1) fatty acid metabolism, (2) sulfur metabolism, (3) histidine metabolism, (4) cysteine and methionine metabolism, (5) pyrimidine metabolism, and (6) purine metabolism. Pathways with impact values >0 were considered to be differentially altered between early-stage LA patients and healthy controls. Source data are provided as a Source Data file.

There are two aspects regarding the breadth of metabolites, including both chemical (molecular structure) and physical (molecular size) properties. For molecular structure, metabolites containing polar functional groups (like hydroxyl group) can be cationized on the surface of ferric particles, through the dipole–dipole interaction50,51. Therefore, our approach exploits an ability to produce cation (Na+, K+)-adducted metabolite species for polar compounds (e.g. amino acids, polyamines, carbohydrates, organic acids, nucleosides, etc.). For molecular size, only small metabolites (MW < 1000 Da) can be selectively accommodated and trapped by the nano-crevices (~nm) of ferric particles, due to the size-exclusive effect as demonstrated in literatures22,52. Therefore, the surrounding alkali metal ions in the nano-crevices may facilitate efficient LDI of small metabolites typically with MW < 1000 Da. Notably, we did not observe H+-adducts by using ferric particle-assisted LDI MS, which was validated by the standard molecule detection (Supplementary Fig. 20) and consistent with previous reports35,53. Importantly, to further investigate the ion adduction process and characterize the competing adduction effect regarding H+/Na+/K+, we performed quantum simulation with density functional theory (DFT) calculation to the exposed surface [1,1,1] of ferric particles (Supplementary Fig. 21). The binding affinity of H+ is −13.6 eV (Fig. 4c) on the surface of ferric particles, much higher than those of Na+ (−4.7 eV, Fig. 4d) and K+ (−4.0 eV, Fig. 4e), hindering the cation transfer to analytes and coupled cationization.

Notably, we found that uracil (increases of 3.36-fold) and UA (increases of 2.95-fold) were the most highly altered species with over expression, while HPA was the most highly altered specie with down expression (Fig. 4f). Principle component analysis (PCA) of these seven metabolites (Supplementary Fig. 16b) displayed enhanced clustering, compared with that of all 161 m/z features (Supplementary Fig. 16c) between early-stage LA and healthy controls. Single one of these biomarkers cannot be very useful in discriminating disease from control samples. Only poor AUC (<0.7) can be acquired by univariate receiver operating characteristic (ROC) curve analysis for single one of these biomarkers (Supplementary Table 5). Importantly, the combination of seven biomarkers together accounted for an enhanced AUC of 0.894 by multivariate ROC curve analysis, in differentiating early-stage LA from healthy controls (Supplementary Fig. 16a), compared to the poor diagnostic performance by single one of these biomarkers (AUC < 0.7). Therefore, we concluded that the panel of seven biomarkers was useful in discriminating disease from control samples. The success can be attributed to that multivariate analysis by combined biomarkers is superior to univariate analysis by one single biomarker, which had been well established and recognized in literatures4,54. The construction of the biomarker panel facilitated the simple analysis and large-scale use of our approach in clinics.

We also performed further data analysis to demonstrate the metabolic differences and similarities, among early-stage LA and other lung cancers/benign diseases (Supplementary Table 1). For metabolic differences, we identified another two new panels of metabolites based on the metabolic patterns, to differentiate early-stage LA from other lung cancers/benign diseases. Notably, the two panels showed superior diagnostic performance, due to the metabolic differences related to disease phenotypes (Supplementary Fig. 22, Supplementary Tables 7 and 8). For metabolic similarities, we identified the overlap** metabolites that were differentially expressed, among early-stage LA and other lung cancers/benign diseases. Specifically, in the differentiation of early-stage LA and other lung cancers from healthy controls, we observed that Ura and IA were the overlap** metabolites. In parallel, in the differentiation of early-stage LA and other lung diseases from healthy controls, we observed that IA was the overlap** metabolite. Due to the pathological process of lung diseases and altered metabolic pathways, the metabolic similarities reflected the systematic response to diseases.

In-silico interrogation of potentially altered metabolic pathways (Fig. 4g, Supplementary Table 9) were analysed by the pathway topology analysis in MetaboAnalyst (http://www.metaboanalyst.ca/), displaying the major metabolic contributions from nucleotides (Ura and UA), FA, organic acids (Cys, HPA, and IA), and active amine (His). Specifically, the differential expression of Ura and UA (the nucleotide metabolism intermediate metabolites) reflected metabolic adaptation to the increased transcriptional activity and differential regulation of purine and pyrimidine metabolism due to cancer cell proliferation11,18. The abnormal expression of FA fit with the current theory that FA degradation is reduced in tumour cells12,34, which was the pathway with the most significant impact (0.656). Among the organic acids correlated with protein and energy metabolism disorders, the changes in Cys, HPA, and IA suggested differential regulation of cysteine and methionine metabolisms, and sulfur metabolism caused by the greatly increased biosynthesis of proteins and abnormal activation of degradation enzymes during tumour growth12,55. Finally, active amine (His) is involved in allergy and inflammation, which are involved in the cancer initiation process56,57. Moreover, we found six metabolic pathways were shared both in early-stage LA and other lung cancers, including (1) beta-alanine metabolism, (2) pyrimidine metabolism, (3) pantothenate and CoA biosynthesis, (4) glycine, serine, and threonine metabolism, (5) taurine and hypotaurine metabolism, and (6) histidine metabolism (Supplementary Fig. 23a). Similarly, we found (1) histidine metabolism and (2) pyrimidine metabolism were shared both in early-stage LA and benign lung diseases (Supplementary Fig. 23b). Together, we concluded that the commonly altered metabolisms were observed in lung diseases, also as demonstrated in literatures58,59.

Pathway topology analysis has been widely applied in biomedical research and depends on the metabolite importance and metabolite number. For metabolite importance, the importance of one compound is estimated by its centrality measure (node or edge), in a given metabolic network according to literatures40,68. Chromatography was performed on an Agilent Technologies Acquity UPLC system. Mass spectrometric detection was carried out using an Agilent Technologies Xevo G2-XS QTOFMS mass spectrometer equipped with an ESI source.

Preparation of clinical samples

A total of 481 subjects were consecutively recruited from 2014 to 2019 in Shanghai Chest Hospital, including 200 patients suffering early-stage LA and 200 healthy controls undergoing routine health care maintenance, 36 patients with squamous carcinoma (including squamous cell carcinoma and small cell carcinoma), and 45 patients with benign lung diseases (including pneumonia, hamartoma, pulmonary tuberculosis, granuloma, and others). All patients were diagnosed by a panel of pathologists together and the tumours staged according to the international standards for TNM staging of lung cancer. The pathologists were blind to any information about the acquisition from MS analysis. Patients were excluded from the study if they had evidence of autoimmune syndromes or drugs. The blood was drawn at initial diagnosis without surgery or anaesthesia. All blood samples were drawn by venepuncture and clotted at room temperature within 40 min16. Serum samples were obtained by centrifuging at 5100×g and 4 °C for 10 min. After centrifugation, the precipitate was discarded and the supernatant serum was stored at −80 °C immediately (within 15 min). The elapsed time was within 1 h between blood draw, centrifugation, and ultimate storage at −80 °C69.

To validate the classification of early-stage LA and healthy controls, we recruited an independent double-blind test cohort from Shanghai Chest Hospital, with serum samples from 58 subjects (23/35, early-stage LA/healthy controls). The situations for blood drawn were the same for all subjects.

All the investigation protocols in this study were approved by the institutional ethics committees of the Shanghai Chest Hospital and School of Biomedical Engineering, SJTU (KS1736). All subjects provided written informed consent to participate in the study and approved the use of their biological samples for analysis, according to the Helsinki Declaration.

Machine learning and computer-aided diagnosis

Considering the large size of MS data, the sparse learning and regression model was employed for the diagnosis of subjects. Models generated can be simpler to interpret duet to the “sparse” models (involving only a subset of the features). Given a set of training subjects, we defined the matrix X = {⋯, xi,⋯}, where each row recorded the serum metabolic patterns (mass spectra) of the corresponding subject. The disease labels (i.e., ‘1’ for early-stage LA, ‘0’ for healthy control) of the training subjects were known already and were vectorized into the column vector \(\overrightarrow {\mathbf{y}} = \left( { \cdots ,\overrightarrow {{\mathbf{y}}_{\mathbf{i}}} , \cdots } \right)\prime\) accordingly. The l1-norm (and the squared l2-norm) regularized logistic regression model could thus be acquired by solving the following:

$$\min _{\overrightarrow \beta ,c}\mathop {\sum }\limits_{i = 1}^m \ln \left( {1 + {\mathrm{{e}}}^{ - \overrightarrow {\mathbf{y}} _i\left( {x_i\overrightarrow {\beta} + c} \right)}} \right) + \frac{{\lambda _1}}{2}\left\| {\overrightarrow {\beta} } \right\|_{l_1} + \lambda _2\left\| {\overrightarrow {\beta} } \right\|_{l_2}$$
(1)

where λ1 was the l1-norm regularization parameter enforcing the sparsity constraint, and λ2 was the regularization parameter for the squared l2-norm. The model chose a limited number of m/z features by adjusting l1-norm to attenuate the coefficients of the less significant features to 0, and fit the disease labels of the training subjects according to the selected m/z features. A mathematical weight for each statistically informative feature was calculated depending on the importance of the mass spectral feature in differentiating early-stage LA versus healthy control. The regression model was applicable to infer the disease label of a new test subject and provided a prediction score for each pattern of a test sample. Specifically, we detected xtest and computed \(\overrightarrow {\mathbf{y}} _{{\mathrm{test}}} = {\mathbf{x}}_{{\mathrm{test}}}^\prime \cdot \vec \beta + c\). The outcome was thresholded and converted to a diagnosis.

For a typical machine-learning-based diagnosis, five mass spectra obtained for each sample were used to build molecular databases. Pre-processing of the raw mass spectra data, including baseline correction, peak detection, extraction, alignment, normalization, and standardization, was carried out by MATLAB (R2016a, The MathWorks, Natick, MA) prior to pattern recognition analysis. The total number of metabolite signals for each mass spectrum was detected, and then, m/z features were selected based on the Otsu algorithm and utilized in the subsequent analysis.

To build the classifier model and evaluate the performance, a five-fold cross-validation approach was performed to estimate the performance of the predictor for both the inner-loop and outer cross-validation (20 rounds for each fold, thus 100 models for outer cross-validation in total). The performance of the classifiers was measured based on the receiver operation curve (ROC) by the area under curve (AUC), calculating the proportions of concordant pairs among all pairs of observations, with 1 indicating perfect prediction accuracy.

To validate the discriminant performance of the built classifier on an external double-bind test cohort for differentiating early-stage LA from healthy controls, 58 samples (23/35: LA/healthy controls) were enrolled. The disease labels of the double-bind test cohort were unknown and predicted by the classifier. Further comparing the predicted disease labels with the true disease status, we computed the sensitivity, specificity, and AUC. A step-by-step protocol describing the preparation of ferric particles, MS data acquisition, clinical sample preparation, and computer-aided diagnosis can be found at Nature Protocol Exchange70.

Potential biomarker identification

To identify the metabolic panel that contributed the most to diagnosis, two major aspects were considered for the 100 tuned models. First, we ranked the m/z features according to the model selected frequency and chose the top m/z features with repeat occurrence over 90% in 100 models. In parallel, we selected m/z features with a p-value < 0.05 according to two-sided Student’s t-test. Verification of the metabolites that were both frequently occurring and displayed a significant difference between early-stage LA and healthy control was conducted manually by m/z feature selection using the human metabolome database (HMDB, http://www.hmdb.ca/) and subsequent validation by tandem MS and accurate mass measurement (for both Na+-adducted and K+-adducted signals). Pearson correlations were computed between the Na+-adducted and K+-adducted signals of metabolites. The differential metabolomic profiles reflecting their respective biochemical pathways were analysed by MetaboAnalyst (http://www.metaboanalyst.ca/).

Statistical analysis

Multivariate statistics were performed using the SIMCA software package (version 14.0, Umetrics, Umeå, Sweden). Before analysis, all mass spectra were scaled to Pareto (par) by dividing variables using the square root of the standard deviation when centring was completed. All covariates were tested, including age and sex. Logistic regression model was fit to evaluate the association of metabolic biomarkers with the presence of early-stage LA. Odds ratios with 95% confidence interval (CI) were calculated for metabolic biomarkers (including histamine, uracil, cysteine, HPA, UA, IA, and FA (18:2)). Before the analysis, all metabolites were centred and standardized to have a mean of 0 and a standard deviation of 1. Age and sex were added as covariates to the basic logistic regression model to calculate the adjusted odds ratios. An unsupervised principal component analysis (PCA) model was constructed from a number of principal components (PCs, orthogonal transformation of m/z features into linearly uncorrelated variables). All the statistical models above were manually optimized. The transformation was defined that the first PC accounted for the largest variance (as much of the variability in the dataset as possible). From the results of PCA analysis, we can obtain a PCA score plot, by visualizing the first two PCs in a two-dimensional space. To quantify the reproducibility of clinical serum samples, the p value for the normal distribution test (Lilliefors (Kolmogorov–Smirnov) test) was acquired through the lillietest function in MATLAB, with the null hypothesis at the default 5% significance level.

Power analysis was performed by uploading 12 samples (6/6: LA patients/healthy controls) as the pilot metabolomic data into MetaboAnalyst at a FDR of 0.1. As the result, the predicted power for estimating the effect sample size was set as 0.840,41. To investigate the spectra similarity within one group, we computed the similarity scores for each group (both early-stage LA and healthy controls). Typically, one experimental spectrum obtained from a serum sample for different cohorts was randomly selected and fixed as the reference spectrum. The other experimental spectra within the same cohort were compared with the reference spectrum, and spectral similarity scores were calculated. The similarity score between two mass spectra (i and j) was calculated by cosine correlation method following a reported algorithm44 defined as

$${\mathrm{{cos}}} = \frac{{\overrightarrow {{\mathbf{y}}_{\boldsymbol{i}}} \cdot \overrightarrow {{\mathbf{y}}_{\boldsymbol{j}}} }}{{\left| {\overrightarrow {{\mathbf{Y}}_{\boldsymbol{i}}} } \right| \cdot \left| {\overrightarrow {{\mathbf{Y}}_{\boldsymbol{j}}} } \right|}} = \frac{{\mathop {\sum }\nolimits_{k = 1}^l y_{ik}y_{jk}}}{{\sqrt {\mathop {\sum }\nolimits_{t = 1}^{n_i} Y_{it}^2} \cdot \sqrt {\mathop {\sum }\nolimits_{t = 1}^{n_j} Y_{jt}^2} }}$$
(2)

where y was the normalized intensity of a peak appearing in both spectrum i and spectrum j (an identical peak), l was the number of identical peaks in the two spectra, Y was the normalized intensity of a peak appearing in a spectrum and n was the number of peaks in a spectrum.

Other statistical analyses in this work were performed by using SPSS software (version 19.0, SPSS Inc., USA) to calculate the p value for statistical demonstration, including two-sided Student’s t-test and one-way ANOVA. All significance level was set as 5%. Specifically, the means comparison in one-way ANOVA was based on Bonferroni corrections.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.