Introduction

Acute lymphoblastic leukemia (ALL) is the most common childhood malignancy, representing 25% of all cancers. ALL exhibits clinical and biological heterogeneity, driven by recurrent genetic aberrations [1]. Treatment advancements have led to 5-year overall survival (OS) rates exceeding 90% [2]. However, relapsed patients face slower progress, with an mortality rate of approximately 45% in Nordic countries [3]. Additionally, ALL treatments carry risks of adverse outcomes, including increased late incidence of secondary malignancies, as well as long-term neurological, cardiac, endocrine, and social/psychological disorders [4]. In this regard, the long-term organic complications associated with an allogeneic stem cell transplantation (allo-HCT) during childhood are broad [5], and therefore optimizing patient selection is key to minimize unwanted toxicity.

Upfront treatment is primarily based on combination chemotherapy. Prognostic factors have been used to estimate the risk of relapse and to adjust treatment intensity accordingly, which has resulted in reduced toxicity without adversely impacting the rate of curation [6]. As the treatment intensity required for cure varies greatly between patients, a risk-adapted strategy is intended to reduce toxicity for those cases that are likely to achieve curation with low-dose chemotherapeutics, while more intense schemes are reserved for high-risk groups [7,8,9,10,11,12,13,14,15]. Prognostic factors for risk stratification include age, white blood cell (WBC) count, immunophenotype, minimal residual disease (MRD), cytogenetic aberrations, and central nervous system (CNS) involvement [2, 7, 9, 16]. Additional factors, like IKZF1 deletion, may enhance future risk prediction [17, 18]. Intensive chemotherapy and cell therapy (allo-HCT and chimeric antigen receptor (CAR)-T cell) are used for relapsed and refractory disease [19,20,21].

Genomic techniques have the potential to improve risk stratification [22, 23], as traditional risk grou** approaches may not be applicable to all circumstances [24,25,26]. High-dimensional data cluster patients and assess their relationship with drug response and survival [27, 28]. However, the complex molecular determinants of leukemia hinder accurate grou**, resulting in misclassification. An optimization problem simplifies the analysis by incorporating clinical outcomes and baseline prognostic information to derive a risk predictor for predicting outcomes in new patients [29]. This approach has paved the way for the development of prognostic and predictive tools in various onco-hematological fields, including myelofibrosis, myelodysplastic neoplasms, and multiple myeloma [30,31,32].

While genetic changes have led to a better understanding of tumor biology, epigenetics has emerged as a valuable avenue to explain tumor phenotypes. The epigenetic landscape is essential in defining tumor types and subtypes, allowing for high-resolution classification and insight into tumor-specific mechanisms [33]. In B-cell malignancies, epigenetic alterations involve distinct cellular processes and B-cell-specific aspects, enabling accurate detection of B-cell tumor-specific aberrations for improved prognostication [26, 34, 35]. ALL cells are known to exhibit CpG island hypermethylation [36], but minimal global loss of methylation, a fact which was particularly marked in T-cell ALL [37]. While the mechanisms underlying the transformation of progenitor B- and T-cells into leukemic cells are not fully understood, these studies cumulatively demonstrate the potential of DNA methylation as a biomarker for lineage and subtype classification, prognostication, and disease progression [38].

In the present study, we trained two machine learning (ML) models based on DNA methylation signatures obtained at ALL diagnosis aimed to refine risk grou**. Our results suggest that DNA methylation profiling at ALL diagnosis could aid in future refinement of risk assignment and may contribute to improved survival and long-term quality of life for pediatric ALL patients.

Materials & methods

Data origin and preprocessing

Pediatric ALL samples from three cohorts were processed as originally described by Nordlund et al. (2013) [26], Busche et al. [39] and Krali et al. [40] The Nordlund et al. dataset was used to build the risk predictors and evaluate their performance internally. This dataset comprises of pre-treatment DNA methylation status of a filtered set of 435,941 CpG sites downloaded from GEO (GSE49031), which assayed 763 diagnostic ALL samples on Infinium Human Methylation 450K BeadChips (450k array) [26]. The following clinical covariates were available: age, sex, Down syndrome status, risk group and cytogenetic subtype. The risk group was defined according to age at diagnosis, WBC count, B- or T-lineage, and genetic aberrations according to the NOPHO-92 or NOPHO-2000 protocols [6, 16]. Patients were assigned to standard, intermediate or high-risk groups and treated accordingly. Relapse free survival (RFS) was established from the time of ALL diagnosis to the date of the first relapse. OS was defined as the time from ALL diagnosis to the moment of death from any cause.

For external validation, we identified and downloaded the following ALL datasets generated on the 450k array: Busche et al. GSE38235 (n = 42) [39], and Krali et al. (https://doi.org/10.17044/scilifelab.22303531) (n = 384) [40]. RFS and OS were defined as indicated above also for the external validation datasets. The dataset by Busche et al. included complete 450k array and clinical data from 42 Canadian BCP-ALL patients treated between 1999 and 2010 at the Sainte-Justine University Health Center (UHC; Montreal, QC, Canada). All patients underwent treatment with uniform Dana-Farber Cancer Institute ALL Consortium protocols DFCI 95–01, 2000–01 or 2005 [41,42,43]. The cohort by Krali et al. included patients treated with the NOPHO-2000 and NOPHO-2008 protocols [12, 40].

Variable selection and model development

The cohort by Nordlund et al. [26] was randomly divided into a training (80% of the cohort, n = 573) and a test (20% of the cohort, n = 190) sets. Univariate Cox regression (survival package) [41] was used to evaluate the association of CpG sites with RFS and OS in the training set. CpGs were selected according to the role of a filter based on a hazard ratio (HR) < 0.1 or > 10 for CpG site selection. CpG selection was based on the univariate association (cox regression) of the DNA methylation beta-values of each CpG site with RFS and OS in the training set. CpGs with q-values < 0.01 or 0.05 were selected for model construction with or without a HR filter. Due to collinearity, a correlation filter was applied to the mortality risk predictor (MRP), in contrast to the relapse risk predictor (RRP), which did not require correlation filtering to reduce dimensionality due to its smaller size. This filter removed CpGs with a Pearson’s correlation > 0.7 with any other variable included in the regression. Multivariate models of survival were constructed using random forests (randomForestSRC package) [44]. The model outputs include a survival function and a cumulative hazard function, which represent patient risk predictions over time. Missing variables were imputed in each dataset separately using a missing data algorithm developed by Ishawarian et al. [42] Random forests were created with 1,000 trees. Hyperparameter optimization of the mtry and nnodes variables was performed using a grid search method. Variable importance was calculated with the permutation importance method (also known as Breiman-Cutler method, implemented in the vimp function) and used to eliminate those CpGs with lower predictive value. In random forests, variable importance is commonly evaluated using a permutation-based method. Initially, the model's out-of-bag (OOB) error is calculated. For each feature, its values are then randomly shuffled in the OOB dataset, and a new OOB error is computed using the perturbed data. The difference between the new and original OOB errors gives the VIMP score for that feature, with higher scores indicating greater importance. A graphical summary of the workflow is represented in Fig. 1.

Fig. 1
figure 1

Graphical representation of the study design. The models were trained with data from 763 ALL patients, all of whom had previously been characterized by genome-wide DNA methylation arrays. The dataset was partitioned into a training set (80% of the patients) and a test set (the remaining 20%). The training set was used to identify CpG sites with DNA methylation status associated with two key outcomes: relapse risk and mortality. The selected CpG sites were used to train Random Survival Forests models. Two models were generated: a Relapse Risk Predictor (RRP) and a Mortality Risk Predictor (MRP). The test set was utilized for internal validation. Finally, the models were further validated on two additional datasets

The discriminative capacity of the random forest models in the training set was evaluated with OOB error estimates of the concordance index (c-index) and with time-dependent areas under the receiving operator (ROC) curve (AUCs). The c-index is a metric for survival prediction and reflects a measure of how well a model predicts the ordering of patients’ event times (e.g., death or relapse). A c-index of 0.5 represents a random model, whereas a c-index of 1 refers to a perfect ranking between real and predicted outcomes. OOB is based on subsampling with replacement to create training samples for the model to learn from. OOB error is the average prediction error on each training sample ** more accurately predicted early relapses, but with a drop in performance after 20 months. The RRP, however, remained superior and more stable even after 20 months. The combination of the RRP with clinical risk grou** provided the best prognostic accuracy, outperforming any of the two separate strategies. Remarkably, the 20-month AUC was 81.4% and 82.5% in the training and test sets. To improve interpretability and enhance applicability, we identified the optimal cut-point of the RRP score in the training set (14.09 points). This was performed in order to divide patients into high- and low-RRP groups. The same cut-off (14.09) was also applied on the test set. The high-RRP group demonstrated a significant difference in relapse rates in comparison with the low-RRP (p-value < 0.001, Fig. 3c–d, Additional file 2: Table S4).

Fig. 3
figure 3

Time-Dependent Area Under the Curve and Kaplan-Meier Plots for Relapse-Free Survival Analysis. a–b Time-dependent Area Under the Curve (AUC) representing the accuracy of the cox models in the prediction of relapse free survival (RFS) in the training (a) and test (b) sets. The red line represents a cox  model based on standard of care risk grou**, the blue line represents the cox model based on the relapse risk predictor (RRP), and the purple line represents the cox model integrating both methods. c–d Kaplan-Meier plots depicting the RFS for the patients assigned to the two groups denoted by the relapse risk predictor: high-RRP (coral line) and low-RRP (blue line) in the training (c) and test (d) sets

Mortality risk prediction using random forests

The training and test sets contained 97 and 30 deaths, respectively. In the training set, 174 CpGs were univariately associated with OS (q-value < 0.05) accompanied by HRs < 0.1 or > 10. After collinearity filtering, the MRP signature consisted of 53 CpG sites (Table 3, Additional file 2: Table S5). The MRP achieved c-indexes of 0.751 and 0.754 in the training and test sets, respectively (Additional file 1: Figs. S3 and S4). Similarly to the RRP, addition of cytogenetic subtype (c-indexes, 0.753 and 0.751 for the training and test sets, respectively) or age at diagnosis (c-indexes, 0.752 and 0.753) as a covariate did not alter the performance of the MRP. However, the prognostic impact of cytogenetic classification alone in the entire cohort was low for OS (c-index, 0.597). Furthermore, we again observed that the MRP was also prognostic in the subgroup of patients with T-ALL (c-indexes, 0.702 and 0.597 in the training and test sets). Finally, we observed that the MRP signature could also be used to predict relapse (c-index of 0.694 and 0.643 in the training and test sets, respectively), but the RRP could not be used to predict OS.

Table 3 Annotation of the 450K-probes probes included in the mortality risk predictor (MRP)

Longitudinal assessment of the mortality risk predictor

The scores generated by the MRP were implemented on cross-validated cox models for the calculation of time-dependent AUCs. We compared this model with the conventional clinical risk grou**. Once again, the results of the MRP outperformed the clinical risk grou** strategy in terms of AUCs and bootstrapped c-indexes in both patient groups. In this case, the MRP outperformed the conventional risk grou** strategy at all evaluated time points (Fig. 4a–b). The combination with clinical risk grou** provided the best prognostic accuracy, outperforming any of the two individual strategies. The highest accuracy of the MRP was observed for risk stratification at 40 months post-diagnosis, which rendered AUCs of 83.66% and 88.58% in the training and test sets. We calculated the optimal MRP cut-off (12.31 points) to split the patients into low- and high-MRP groups. The high-MRP group had significantly shorter OS than the low-MRP group in both the train and test datasets (Fig. 4c–d, Additional file 2: Table S4).

Fig. 4
figure 4

Time-Dependent AUCs and OS Kaplan-Meier Plots for Overall Survival Prediction. a-b Time-dependent AUCs representing the accuracy of the different classifiers (cox regression) in the prediction of OS for the training a and test b sets. The red line represents the cox model based on standard risk groups, the blue line represents the cox model based on the mortality risk predictor (MRP) and the purple line represents the cox model integrating both methods. c-d OS Kaplan-Meier plots for the high-MRP (coral line) and the low-MRP (blue line) groups, as determined by the surv_cutpoint MRP optimal cut-off, in the training c and test d sets

Validation

We identified two independent datasets of 450k array data generated from pediatric ALL cohorts, which we used to validate the predictors. In the Busche et al. data set (n = 42) five patients relapsed during follow-up, and two deaths were recorded. Due to the small number of deaths, we used this dataset to validate only the prediction performance of the RRP, achieving a c-index 0.667. In the Krali et al. dataset (n = 384, Additional file 1: Fig. S5, Additional file 2: Table S6) 50 patients relapsed during follow-up and 45 patients died, of which 19 did so due to relapse and 20 were in complete remission. The RRP and MRP were 0.529 and 0.621, respectively for the Krali et al. dataset. In this dataset, the RRP score was weakly associated with relapse risk (p-value 0.064, HR 1.028 (95% CI: 0.9984–1.058 for each risk unit increase), while the MRP score was strongly associated with OS (p-value 1.04 × 10–4, HR 1.073 (95% CI: 1.036–1.112 for each risk unit increase). We observed that the MRP provided its best prognostic accuracy within the standard and infant risk groups (Additional file 2: Table S7). Furthermore, when we applied the MRP on the RFS data, the c-index for predicting risk of relapse in the validation data (0.62 for Krali et al.) was similar to the cross-prediction performance observed in the train and test sets.

We applied the previously defined low- and high-MRP dichotomization to evaluate differences in OS within this cohort. Consistently, increased mortality was observed in the high-MRP group (p-value < 0.001, Fig. 5a, Additional file 2: Table S4). Along the same line, we employed the low- and high-RRP dichotomization, but no significant difference in RFS was observed between the groups (p-value 0.14, Fig. 5b, Additional file 2: Table S4). Hence, we used the MRP groups to investigate if the MRP dichotomization could be predictive of relapse, which resulted in significant differences in RFS between the low-MRP and low-RRP group (p-value < 0.001; Fig. 5c, Additional file 2: Table S4).

Fig. 5
figure 5

Kaplan-Meier plots for overall and relapse-free survival in the independent dataset. a–c Kaplan-Meier plots for a overall survival (OS) and b-c relapse-free survival (RFS) in the independent dataset a OS differences between the high-MRP; (coral line) and low-MRP (blue line) groups. b RFS differences between patients assigned by the model to the high-RRP, (coral line) and low-RRP, (blue line) groups. c RFS differences between patients assigned to the high-MRP (coral line) and low-MRP (blue line) groups

Patient characteristics associated with epigenetic risk

Based on the current ICC system for ALL subty** [47] we grouped the 1,147 ALL patients from the training, test, and validation sets using the latest molecular classification. The frequencies of the cytogenetic profiles were consistent with those described across other ALL cohorts [1, 48,49,50,51]. Using the dichotomized MRP cut-point, each sample set visualized by the MRP grou** (Fig. 6). Patients in the high-MRP group had a tendency to display to the known high risk molecular subtypes (T-ALL, BCR::ABL1, KMT2A-r, hypodiploid, and MEF2D-r), while low-risk molecular subtypes (HeH, ETV6::RUNX1, and PAX5-alteration) were more frequent in the low-MRP group. Patients denoted as B-other were split between the high-MRP and low-MRP groups. Notably, in the independent cohort, patients characterized by standard-risk cytogenetic aberrations, such as high hyperploid and ETV6::RUNX1, were assigned to the high-MRP group.

Fig. 6
figure 6

Clinical outcome and molecular subtypes of patients in the train (left, n = 573), test (center, n = 190) and independent datasets (right, n = 384). The patients were sorted by the mortality risk predictor (MRP) groups (high/low, x-axis). Clinical annotations, including relapse, mortality, clinical risk groups, NOPHO treatment protocol, sex and age are provided as annotation bars, color-coded according to the figure legend. Patient molecular subtypes (y-axis) are denoted as gray vertical lines on the heatmap plots

Discussion

This study provides evidence that DNA methylation signatures analyzed using ML algorithms offer a promising avenue for personalized risk stratification in pediatric ALL patients. Our model effectively predicted patient risk using a small set of CpG sites, demonstrating an improvement over conventional prognostic approaches. Integration of the ML predictors with the conventional clinical risk score resulted in enhanced overall performance. Notably, the mortality risk predictor (MRP) outperformed the relapse risk predictor (RRP), potentially indicating the superior predictive ability of DNA methylation patterns in assessing biological risk. Importantly, for patients initially treated with low or standard-risk protocols who later relapsed and received intensified therapy, a substantial fraction achieved complete response. Hence, risk-associated DNA methylation signatures could help identify highly refractory patients who are unlikely to respond adequately to salvage chemotherapeutics.

Significant research efforts have been devoted to improve ALL prognostication using relevant clinical annotation from large cohorts. For example Enshaei et al. used data from four different trials involving thousands of ALL patients for the development of a continuous risk model based on white cell count at diagnosis, cytogenetics and end-of-induction MRD [47]. Despite promising results, the main limitation of this approach relies on the inclusion of post-induction MRD status, which impedes its application at the moment of diagnosis. Newer therapeutic approaches may try to optimize treatment since the beginning, which might limit the probability of develo** clonal diversity as a driver of chemorrafractoriness [52]. In this regard, several previous reports have proved the usefulness of DNA methylation signatures determined at diagnosis to classify ALL patients into different molecular [26, 40] and prognostic subgroups [35, 53]. The present results indicate that DNA methylation signatures hold prognostic value in pediatric ALL regardless of the use of risk-adapted protocols that include cytogenetics, immunophenotype and MRD assessment.

A pivotal achievement of our investigation is the successful determination of the optimal cut-point for the MRP score. This critical threshold proficiently delineates a poor-prognosis group across all analyzed cohorts, underscoring the robustness and universal applicability of the MRP in risk stratification. While partially aligned with prevailing cytogenetic and molecular classifications, the MRP algorithm reconfigures risk groups with enhanced efficiency. This reclassification not only corroborates the established risk factors but also refines them, thereby presenting a more nuanced and potentially more accurate landscape of risk stratification in pediatric ALL.

The main advantage of our approach relates to the large sample size and the long-follow up of the patients. One limitation of our study, however, is the lower predictive performance of the RRP in the Nordic [40] independent validation set. The training set originated from Nordic patients treated on either the NOPHO-92 or NOPHO-2000 protocols, in which MRD measurements were not used to guide the indication of allo-HCT [6, 16]. On the contrary, MRD analysis was performed at days 29 and 79 post-induction to select candidates for allo-HCT in the NOPHO-2008 protocol [12]. Differences in treatment between the protocols may explain the lower reproducibility of the RRP, and further highlights the importance that MRD analysis plays in treatment stratification. Regardless, the prognostic value of our MRP was replicated in the cohort of patients treated on NOPHO-2008, indicating that our methylation-based MRP identifies patients who will succumb to their disease despite MRD-guided approaches. Future studies evaluating this methodology should pursue its potential enrichment with MRD data for risk stratification. Another relevant issue is the absence of a comparison between prognostic classifications based on integrative genomic profiling data. Such a comparison could be beneficial for evaluating different methods and refining the methodology further [54].

The translation of epigenetic biomarkers into clinical practice has been limited, with only a few successful examples in oncology [55]. DNA methylation is not currently performed in the clinical management for ALL, and consequently its implementation into the clinical routine will need a progressive adaptation [2]. Furthermore, optimizing epigenetic biomarkers for clinical use is not straightforward, and several factors need to be considered, such as genomic region selection, accurate DNA methylation measurements, confounding parameter identification, standardized data analysis, efficient turnaround time, and cost considerations [56]. However, the incorporation of DNA methylation signatures could offer a deeper layer of biological complexity, thereby facilitating more informed clinical decisions and potentially transforming patient care.

In conclusion, our research presents two innovative models utilizing DNA methylation data for predicting relapse and mortality risk (RRP and MRP) in pediatric ALL. These models surpass traditional cytogenetic and clinical prognostic methods in risk stratification. They also demonstrate potential synergies with diagnostic clinical data, enhancing their predictive performance. Our findings reveal that DNA methylation signatures, analyzed through ML, are reliable predictors of patient outcomes in pediatric ALL. Particularly, the MRP's capacity to extend beyond established markers exemplifies its transformative potential in clinical decision-making, suggesting more personalized and effective treatment approaches for pediatric ALL.