Background

Acute promyelocytic leukemia (APL) is a distinct subclass of acute myeloid leukemia (AML) which is characterized by a reciprocal and balanced translocation between the promyelocytic leukemia protein (PML) gene on chromosome 15 and the retinoic acid receptor α (RARα) gene on chromosome 17 [1, 2]. The t(15;17) results in an oncogenic fusion protein PML-RARα which functions as a transcriptional repressor of RARα target genes and impairs the homeostatic function of PML thereby promoting a proliferation of myeloid progenitor cells and provoking a maturation arrest at the promyelocytic stage [3,4,5]. APL was first described by the Norwegian hematologist Leif Hillestad in 1957 [6] and for a long time it was considered one of the most lethal leukemias [7] with population-based incidence rates varying between different ethnicities [8,9,10]. The introduction of all-trans retinoic acid (ATRA) [11] and arsenic trioxide (ATO) [12] has revolutionized APL therapy and outcome nowadays showing remarkable cure rates [13, 14]. Nevertheless, APL is considered a hematologic emergency and requires immediate treatment upon suspected diagnosis, both causally and supportive, due to possible early death from bleeding [13]. Early death rates in APL – commonly defined as death within 30 days of presentation [15] – appear to be underestimated in the medical literature: While clinical trials frequently show early death rates below 10% it has to be considered that a substantial number of patients even dies before APL is diagnosed and patients with significant comorbidities or higher age are often excluded from trials leading to bias [15, 16]. In patients ineligible for clinical trials, registry data as well as population-based analyses show an early death rate of approximately 20% with even higher rates for elderly patients [15,16,17,18,19]. When diagnosed and treated promptly, APL is curable in the majority of patients. Therefore, fast and accurate diagnosis as well as immediate treatment upon suspicion is crucial [13]. Classical APL can be recognized by a distinct morphology of abnormal promyelocytes with a heavy granulation pattern and characteristic cells containing single Auer rods or bundles of Auer rods in the cytoplasm (‘faggot cells’) [20]. Therefore, cytomorphologic assessment by experienced hemtopathologists is essential for APL diagnosis since it is fast, feasible and can often reinforce clinically suspected diagnosis. Still, diagnosis of APL routinely encompasses cytomorphology [21, 22] as well as cytogenetics for confirmation of suspected diagnosis [13], however genetic analyses take more time and resources until results are available. Further, high-quality genetic testing might not be ubiquitously available.

Machine Learning (ML), especially Artificial Neural Nets (ANN), can handle large-scale data sets and are implemented as image recognition and computer vision technologies, especially Convolutional Neural Nets (CNN) [23, 24] as a form of Deep Learning (DL). DL models consist of massive parallel computing systems consisting of large numbers of interconnected processing units called artificial neurons, [25, 26] which can be run efficiently on high performance computing systems. CNNs contain multiple neural layers to provide functionality for image recognition [24] Thus, these capabilities can be utilized for cell segmentation, cell recognition and disease classification in hematological malignancies [27,28,29]. We here present a CNN-based scalable approach that can detect APL among healthy bone marrow donor and non-APL AML samples from bone marrow smear (BMS) images. The resulting models provide a reliable method for APL diagnosis when genetic data are still pending or an experienced hematopathologist is not immediately available, thereby reducing treatment delay. Further, our DL model can be implemented remotely in areas where no immediate access to high-quality genetic testing is available, thereby enabling the diagnosis of APL in non-industrialized countries, where APL is often more common [8, 9].

Methods

In this study, we trained a multi-stage DL platform to segment cells in BMS and distinguish between APL and non-APL AML as well as APL and healthy bone marrow donor samples using visual image data only.

Data set and molecular analysis

We retrospectively identified 58 APL patients that have been diagnosed and treated in the multicentric AIDA2000 (NCT00180128) [30] and NAPOLEON studies (national APL observational study, NCT02192619) [31] or from the German Study Alliance Leukemia (SAL) registry (NCT03188874). Eligibility criteria for the APL cohort were newly diagnosed APL according to WHO criteria [32] (FAB M3) as defined by the presence of t(15;17) or fusion transcript PML-RARA, age ≥ 18 years and available biomaterial at diagnosis. Diagnosis of APL was confirmed using standard techniques for chromosome banding and fluorescence in situ hybridization (FISH). Seven samples were excluded because BMS were inconclusive due to dry tap and diagnosis was performed using peripheral blood. 51 APL BMS were analyzed for the purpose of this study. The first control cohort was comprised of 236 bone marrow samples from healthy bone marrow donors who underwent bone marrow donation at our center. The second control cohort consisted of 1048 BMS from patients with non-APL AML that were identified from the multicentric German SAL registry. Written informed consent was obtained from all patients and donors according to the Declaration of Helsinki. The studies were previously approved by the Institutional Review Board of the Technical University Dresden (EK 98032010). High-resolution pictures of representative areas of the BMS were taken using the Nikon ECLIPSE E600 microscope (50-fold magnification) with the Nikon DSFi2 mounted camera and Nikon Imaging Software Elements D4 for image processing. For each sample, one image of a representative area was taken for evaluation by the deep learning model. To account for imbalances in the data sets, image augmentation techniques were employed as described below.

Deep learning model

Pre-processing and cell segmentation

We developed a multi-step ML workflow with individual DL models for different tasks as shown in Fig. 1. After digitization, BMS images were uploaded to an online segmentation and labeling platform that we developed for the purpose of this work. The platforms architecture was designed to receive BMS images sized 2560 * 1920 pixels as input at the top level for BMS images while receiving 299*299 pixels input size on the level of individual cells in subsequent cell-level classification tasks (see below). Picture input at the top level corresponded to an area of 171*128 µm. In the first step, initial cell segmentation was performed with a human-in-the-loop approach by hematologists with the VGG Image Annotator [33] tool to train a Faster Region-based Convolutional Neural Net [34] (FRCNN). Cell borders were initially drawn by hematologists, then, the FRCNN learned by example and subsequently provided cell border proposals on unsegmented cells that were manually corrected in an iterative way. Thereby, the FRCNN substantially improved its accuracy over iterations enabling the final model to automatically segment BMS images without the need for human manual correction. The trained FRCNN was used to subsequently segment cells on all available BMS images. Hyperparameter optimization was performed automatically using the Optuna [35] framework with a predefined hyperparameter space.

Fig. 1
figure 1

Workflow of the multi-step deep learning model for APL recognition. We identified patients with APL, non-APL AML and healthy bone marrow donors by retrospective chart review. Representative images of bone marrow smears (BMS) were labeled according to diagnosis. After image preprocessing, transformation and augmentation, initial cell border proposals were given by the Faster Region-based Convolutional Neural Net (FRCNN) that were manually corrected on an online segmentation and annotation platform based on the VGG image annotator tool. The FRCNN was trained iteratively to improve cell border proposals. Segmented cells were manually labeled according to cell type (myeloblasts, promyelocytes) and Auer rods. Convolutional neural nets were then implemented on the automatically segmented cells for binary classification of individual cell types and features. Their output was used to train an ensemble neural net for the binary classification between APL and non-APL AML or APL and healthy bone marrow donor samples

Cell labeling and cell-level classification

In the second step and analogous to training the model by example to detect cell borders, feature extraction was initially performed manually by hematologists, i.e., labels of cell type, lineage and distinct characteristics like Auer rods were attributed to 8500 individual cells by hematologists (Tab. S1 shows the numbers of individual labels). Cell size, volume and contrast were automatically calculated by computer vision algorithms. Given the rarity of the disease and therefore limited number of cases included in our study, image augmentation techniques like linear transformations, color shift and brightness adjustment were applied to increase sample size and balance samples for binary classifications as imbalances may otherwise have introduced bias towards the predominant class. For image classification, model architecture was based on the Xception CNN [2B).

Fig. 2
figure 2

Examples of automated segmentation and occlusion sensitivity map**. A Faster Recurrent Neural Network (FRCNN) was used for cell segmentation. First, it was trained by human example and after iterative learning, automated cell detection was performed (A). Segmented cells show a yellow elliptic border. With respect to explainable artificial intelligence, we used occlusion sensitivity map** to retrace the decision-making process of the convolutional neural nets in image-level recognition (B). In occlusion sensitivity map**, parts of the image are iteratively blocked from evaluation by the neural network and performance is measured. If the blocked part of the image is highly important for correct classifications, performance will drop accordingly. This process is iteratively repeated for the entire image. The result can be visualized in the sense that highly important image areas are highlighted (yellow/green) while less important or negligible areas are shaded (blue/purple)

Previous efforts were conducted with a single CNN for upfront whole image classification, but showed only moderate results. The stepwise approach of distributing different classification tasks over different CNNs and integrating output information into an ENN showed substantially improved accuracy for APL prediction. Hence, we used proportions of myeloblasts, promyelocytes, and Auer rods as a proxy to achieve image-level classification. To train the CNNs, 2992 myeloblasts, 1378 promyelocytes, and 130 cells with Auer rods were labeled manually (a full list of manually labeled cells is provided in Tab. S1). For the individual binary classifications (e. g. myeloblast vs. non-myeloblast, promyelocyte vs. non-promyelocyte etc.), the CNNs’ accuracy differed (Fig. 3). The individual CNNs for the detection of myeloblasts, promyelocytes, and Auer rods showed an AUROC of 0.8741, 0.9199, and 0.8363, respectively.

Fig. 3
figure 3

Performance of convolutional neural nets for binary cell type classifications. Since end-to-end image-level classification did not show satisfactory results in preliminary testing, we used cell-level recognition with convolutional neural nets as a proxy. Relevant cell types and features for the distinction between non-APL AML, APL and healthy bone marrow, i. e. myeloblasts, promyelocytes, and Auer rods, were labeled manually and CNNs were trained. The performance of individual CNNs for the detection of myeloblasts (A), promyelocytes (B), and Auer rods (C) on the respective testing sets (that were rigorously withheld from training) is displayed as area under the receiver operating curve (AUROC) using three-fold cross-validation (cv 0, 1, 2 illustrated in light blue, orange, and green). std. dev. – standard deviation of the mean; TPR – true positive rate; FPR – false positive rate

For final classifications, our ENN model achieved a mean AUC of the precision-recall-curve of 0.9671 (95%-CI: 0.9441 – 0.9901; Fig. 4A) and a mean AUC of the ROC of 0.9585 (95%-CI: 0.9327 – 0.9843; Fig. 4B) for the detection of APL samples among healthy bone marrow donor samples using threefold internal cross-validation. For the binary classification between APL and non-APL AML our ENN model reached a mean AUC of the precision-recall-curve of 0.8599 (95%-CI: 0.7741 – 0.9457; Fig. 4C) and a mean AUC of the ROC of 0.8575 (95%-CI: 0.7831 – 0.9319; Fig. 4D) with threefold internal cross-validation.

Fig. 4
figure 4

Performance of the ensemble neural net for APL image-level recognition. Performance metrics for the binary classification of APL vs. healthy bone marrow donor samples (top row) and APL vs. non-APL AML samples (bottom row) were calculated as areas under the curve for precision-recall curves (A, C) and the receiver operating characteristic (B, D) using threefold cross-validation (cv 0, 1, 2 illustrated in light blue, orange, and green) and averaging results (Macro avg, dark blue). Calculations were performed in Python. std. dev. – standard deviation of the mean; TPR – true positive rate; FPR – false positive rate

Discussion

We here present a DL-based system for the diagnosis of APL from BMS. The resulting DL model automatically segments nucleated cells in images of BMS with high accuracy. Accurate cell segmentation is a key initial step in CNN-based evaluation of leukemic cell morphology that is considerably harder in the bone marrow as cells are often clumped in narrow spaces and artifacts are more frequent than in peripheral blood [37]. We tested APL recognition with our ENN model both against healthy bone marrow donor and non-APL AML samples and obtained an AUROC of 0.9585 and 0.8575, respectively. The time it takes from the upload of a BMS image to the model and subsequent output of a suggested diagnosis was 45.3 s. To the best of our knowledge, our model represents one of the first and the most accurate DL approach to recognize APL from bone marrow cytomorphology. The few recent studies have either focused on peripheral blood smears [38] or report a lower accuracy [39]. Given considerable early death rates especially in elderly patients [19], reliable methods for diagnosis are crucial to provide highly effective treatment. Since APL is a very rare disease, patients suspected of having APL must be referred to specialized centers as soon as possible and only experienced trained laboratory technicians and hematologists are typically able to raise the suspicion of APL, specifically in areas with low incidence rates. ML has been reported to be able to flag possible cases from peripheral blood cell counts [40] and peripheral blood smear morphology [38]. Our DL model may serve as a proof-of-concept that DL can also be implemented as a robust tool for bone marrow evaluation in rare hematologic diseases such as APL. This can be advantageous when confirmation of t(15;17) by cytogenetics or PML-RARα by FISH is still pending or when such advanced tests are not available, e. g. in peripheral centers or in develo** countries without ubiquitous access to high-quality laboratory procedures and genetic testing. In such cases, our DL model may enable diagnosis of APL even in smaller centers in develo** countries as long as digitization of BMS and internet access is available. For specialized centers, our system may provide a mechanism of rapidly pre-scanning samples for APL and flag them for immediate evaluation by hematopathologists.

Nevertheless, given the rarity of APL our cohort consists of only 51 cases as available training data. To account for the small sample size, we used image augmentation techniques. Indeed, to improve accuracy and further validate the model in terms of generalizability, future studies will have to include larger data sets. Especially for rare entities such as APL, international collaboration is crucial since ML models thrive on data. A larger APL data set may therefore improve the model’s accuracy and allow it to be implemented as a diagnostic screening test in clinical routine of specialized centers to pre-scan larger numbers of samples for possible cases of APL. From a technological perspective, we used cell-level detection of myeloblasts, promyelocytes and Auer rods as a proxy for subsequent BMS image-level classification by the ENNs. In preliminary experiments, we also tested direct end-to-end BMS image-level classification, although with unsatisfactory results. Conceivably, a larger data set may provide the possibility to train CNNs directly end-to-end for APL detection. Further, we tried to include a fourth CNN for the detection of faggot cells into the model, however, the scarcity of faggot cells in the sample did not allow for accurate faggot cell detection by the respective CNN. Again, a larger sample size may increase the accuracy of this individual classifier and whether a subsequent incorporation into the model could potentially boost its performance remains to be tested. While we manually selected representative BMS areas for evaluation by the DL model, future applications of the model with whole slide imaging seem warranted. Novel techniques like DL-based automated focusing on whole slide images [41] can be used to further automatize the process. However, the majority of studies of ML in hematology, including our DL approach, are developed and tested on retrospective data. Prospective validation of our DL model is planned to confirm its accuracy and transferability to daily clinical practice. We believe prospective evaluation of the model is necessary to evaluate its implications for routine diagnostics and improved accuracies given a larger training sample is needed therefore. Hence, in future iterations the model can either serve as diagnostic tool to raise awareness of suspected cases of APL for expert evaluation, provide expertise on morphologic assessment where no such expertise or access to other means of diagnostic evaluation is available and serve as a proof-of-concept that deep learning can function even with sparse data in the medical domain. We consider this to be groundwork to build upon in future iterations of the model. Furthermore, it needs to be noted that our model only considers cytomorphology and is agnostic of clinical, genetic or laboratory data. An integration of these modalities is needed to improve diagnostic accuracy and provide an even stronger decision support system for clinicians.

Conclusion

We present a DL model for assistance in the diagnosis of APL from bone marrow cytomorphology using control cohorts of both healthy bone marrow donors and non-APL AML samples. Our ENN model achieved high values for AUROC despite the limited sample size and serves as a proof-of-concept for the viability of DL in the diagnosis of rare cancer entities from image data. Since our DL platform uses visual image data only, it may potentially be used to support diagnosis of APL in areas where molecular and cytogenetic profiling is not routinely available. Future work will therefore focus on increasing sample size, prospective validation and the implementation of an online tool for easily accessible remote use of the model’s APL prediction capabilities.