Myocardial scar and left ventricular ejection fraction classification for electrocardiography image using multi-task deep learning

Boribalburephan, Atirut; Treewaree, Sukrit; Tantisiriwat, Noppawat; Yindeengam, Ahthit; Achakulvisut, Titipat; Krittayaphong, Rungroj

doi:10.1038/s41598-024-58131-6

Myocardial scar and left ventricular ejection fraction classification for electrocardiography image using multi-task deep learning

Article
Open access
Published: 29 March 2024

Volume 14, article number 7523, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Myocardial scar and left ventricular ejection fraction classification for electrocardiography image using multi-task deep learning

Download PDF

Atirut Boribalburephan^1,2,
Sukrit Treewaree³,
Noppawat Tantisiriwat³,
Ahthit Yindeengam⁴,
Titipat Achakulvisut¹ &
…
Rungroj Krittayaphong³

730 Accesses
Explore all metrics

Abstract

Myocardial scar (MS) and left ventricular ejection fraction (LVEF) are vital cardiovascular parameters, conventionally determined using cardiac magnetic resonance (CMR). However, given the high cost and limited availability of CMR in resource-constrained settings, electrocardiograms (ECGs) are a cost-effective alternative. We developed computer vision-based multi-task deep learning models to analyze 12-lead ECG 2D images, predicting MS and LVEF < 50%. Our dataset comprises 14,052 ECGs with clinical features, utilizing ground truth labels from CMR. Our top-performing model achieved AUC values of 0.838 (95% CI 0.812–0.862) for MS and 0.939 (95% CI 0.921–0.954) for LVEF < 50% classification, outperforming cardiologists. Moreover, MS predictions in a prevalence-specific test dataset recorded an AUC of 0.812 (95% CI 0.810–0.814). Extracted 1D signals from ECG images yielded inferior performance, compared to the 2D approach. In conclusion, our results demonstrate the potential of computer-based MS and LVEF < 50% classification from ECG scan images in clinical screening offering a cost-effective alternative to CMR.

Radiomics and deep learning for myocardial scar screening in hypertrophic cardiomyopathy

Article Open access 27 June 2022

Cine-cardiac magnetic resonance to distinguish between ischemic and non-ischemic cardiomyopathies: a machine learning approach

Article Open access 07 March 2024

Automatic Myocardial Disease Prediction from Delayed-Enhancement Cardiac MRI and Clinical Information

Introduction

Coronary artery disease (CAD) is the leading cause of death and disability worldwide, with decreasing incidence in developed countries but increasing in develo** countries¹. CAD remains asymptomatic until the coronary stenosis becomes moderate or severe, resulting in the chest pain, dyspnea, and syncope². However, approximately 30% of myocardial infarctions (MI) do not manifest with clear symptoms³. A missed MI diagnosis can lead to serious complications, such as left ventricular systolic dysfunction, heart failure (HF), and death.

HF was reported to affect 1–2% of the population^4,5, and that rate is expected to increase over the next decade, posing a significant healthcare burden⁶. Left ventricular ejection fraction (LVEF) is used to classify HF patients for appropriate treatment^7,8. Therefore, detection of MI and LVEF at an earlier stage can help clinicians make better decisions for further investigation and proposing an appropriate treatment⁹.

Cardiac magnetic resonance (CMR) imaging has the ability to accurately and non-invasively assess the heart's functional and anatomical abnormalities, including characterizing myocardial scars (MS), which is a common sequelae of MI^9,10. Importantly, MS was reported to be one of the critical determinants, predicting the future development of heart failure¹¹. However, CMR is expensive and requires highly trained personnel to perform imaging and interpretation, limiting its availability in remote or underdeveloped settings.

Electrocardiogram (ECG) is frequently employed as an initial investigation for diagnosing cardiovascular diseases due to its accessibility, whereas CMR is typically reserved for more in-depth investigations¹². ECG scans contain patterns that can suggest cardiovascular diseases like CAD, MS, and left ventricular systolic dysfunction^13,14. Trained cardiologists can identify these patterns, and diagnose the corresponding diseases. However, the availability of well-trained cardiologists is limited in remote or develo** areas. Moreover, the interpretation of ECG scans is susceptible to both human error and interrater variability. Alternatively, computer-based ECG interpretation could help mitigate these limitations. Previous studies reported that machine learning systems can detect cardiovascular diseases, such as MI, arrhythmias, and left ventricular systolic dysfunction, using a 12-lead ECG^14,15,16,17.

However, obtaining ECG data in machine-readable format (i.e., ECG tracings) can be challenging in resource-limited settings, including Thailand, since most ECG records are stored in paper or scanned format. Hence, computer image-based ECG classification for CAD detection may be advantageous in these circumstances.

In this paper, rather than using a separate classification model for each task, we introduce multi-task convolutional neural network (CNN) models that read 12-lead ECG scan images to identify both CAD scars and abnormal LVEF of less than 50% for clinical screening purposes. We provide our models as open-source software to improve CAD screening in resource-limited settings.

Results

Study population

A total of 13,707 patients and 14,826 ECGs were retrospectively enrolled in this study. To prevent cross-dataset contamination, 774 ECGs were excluded, resulting in a total of 14,052. The population comprises two ECG formats, specifically the non-grid (old) format and the grid (new) format ECGs, collected using different machines. The following baseline characteristics represented the total number of ECGs. The average age of all ECGs was 72.29 ± 13.81 years and 50.53% were acquired from male patients. The prevalence of MS and LVEF < 50% in the total ECGs was 27.11% and 18.72%, respectively, while 12.35% belonged to the LVEF < 40% group. A total of 10.04% of all ECGs had both subendocardial and transmural scarring. ECGs with only subendocardial scarring or only transmural scarring accounted for 9.45% and 7.61% of the population, respectively. Table 1 shows the baseline data of the overall ECG population and each dataset used in the study. Out of the total ECGs, 5,407 (38.48%) had no clinical feature data except for age and sex, while all other ECGs had complete clinical data. The missing data were imputed as described in the methods section. Summary workflow of MS and LVEF classification system is shown in Fig. 1a.

Table 1 Baseline data of the overall study population, and of those included in each dataset.

Full size table

Model performance

We trained eight deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats), (6) Multi-task with clinical features, (7) Single-task (MS) with clinical features, and (8) Single-task (LVEF) with clinical features.

Overall, our multi-task both formats model outperformed the transferred multi-task model and the multi-task old-format only model in both MS classification and LVEF classification, except for MS classification on new-format test dataset. The AUCs for the multi-task both formats model for MS classification were 0.838 (95% CI 0.812–0.862) and 0.811 (95% CI 0.788–0.832) for the old-format and new-format test datasets, respectively. For detecting LVEF < 50%, the AUCs of the multi-task both formats model were 0.939 (95% CI 0.921–0.954) and 0.931 (95% CI 0.915–0.944) for the old-format and new-format test datasets, respectively (Fig. 2). Model performance results compared to baseline prediction are shown in the Supplementary Data 1.

Regarding the ECG interpretation performance of cardiologists on new-format test data, the AUCs for MS classification by an experienced cardiologist and an in-training cardiologist were 0.683 (95% CI 0.659–0.707) and 0.657 (95% CI 0.632–0.681), respectively. Both cardiologists had similar sensitivity (44.1% vs. 44.50%, respectively); however, the experienced cardiologist had a higher specificity (92.60%) than the in-training cardiologist (86.40%) (Supplementary Fig. 1).

All of our developed models that were designed to evaluate ECG images as input features outperformed the XGBoost model, which was designed to evaluate only standard clinical features, by up to 50.40% (Fig. 2). The multi-task with clinical features model was able to classify MS in the old-format test dataset with specificity of 66.92%, 44.61%, and 30.50% at 80.00%, 90.00%, and 95.00% sensitivity, respectively. For LVEF < 50% classification when using the old-format test dataset, the model achieved the specificity of 90.03%, 84.76%, and 66.90% at 80.00%, 90.00%, and 95.00% sensitivity, respectively (Fig. 3).

Incorporating clinical features

Incorporating clinical features into our multi-task models resulted in improved model performance in some cases (Fig. 2). For the single-task model, adding clinical features provided a performance boost only in MS classification in old-format test datasets. For the multi-task model, the performance boost was observed only in MS classification in new-format test datasets. The multi-task with clinical features model greatly outperformed the XGBoost model (which used only clinical features) with an AUC of 0.841 (95% CI 0.819–0.860) compared to 0.682 (95% CI 0.655–0.707) in MS classification using the new-format test dataset. (Fig. 2).

Prevalence-specific analysis

When using the prevalence-specific test dataset, our multi-task both formats model achieved an AUC for MS classification of 0.812 (95% CI 0.810–0.814) and an F1-score of 0.931 (95% CI 0.931–0.932).

Performance in detecting LVEF < 40%

Our sensitivity analysis showed similar performance among models when comparing LVEF < 50% detection and LVEF < 40% detection. The multi-task model with clinical features achieved the highest AUC of 0.942 (95% CI 0.925–0.956) when using the old-format test dataset, while the multi-task model achieved the highest AUC of 0.939 (95% CI 0.924–0.951) when using the new-format test dataset (Supplementary Fig. 2).

Localization of models’ decision

We applied Grad-CAM++ to visualize the areas of ECG images that influenced the model decision. Figure 4 shows examples of heatmaps generated on top of the ECGs for multi-task and multi-task with clinical model. The heatmap highlighted the area with ECG tracings with greater emphasis on the area associated with models’ decisions. In cases with MS and LVEF < 50%, we observed that the model focused on abnormal Q waves, QRS complexes, and T wave inversions. For cases with no MS and LVEF ≥ 50%, the multi-task model generally highlighted QRS complexes in lead I, II, V2, and V6, while the multi-task with clinical model focused on lead I. Interestingly, the multi-task with clinical model appears to focus more on fewer leads compared to the multi-task model.

1D signal extraction and 1D model performance

We extracted 1-Dimension (1D) signals from the ECG images and trained five 1D deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats).

The results consistently demonstrate the superior performance of our 2D CNN models over their 1D counterparts in both MS and LVEF range classification tasks (Figs. 2, 5, and Supplementary data 2). Among the 1D models, our multi-task model stands out, maintaining the highest overall performance for both tasks.

Discussion

In this study, we demonstrated that a multi-task deep learning model could detect MS and classify the LVEF range using a 12-lead ECG scan image. Our top-performing models, the multi-task both formats model and multi-task model with clinical features, demonstrated high performance in detecting MS and LVEF < 50% in both old and new ECG formats. They consistently exhibited comparable or superior performance when compared to their single-task counterparts in the majority of scenarios. Additionally, our model also achieved a high AUC and F1-score in MS prediction from the prevalence-matched population. The F1 score achieved in this prevalence-matched population was notably higher than that in our test sets, where the prevalence exceeded 26%. This finding might suggest that the model performs better in populations with lower prevalence, which more closely resemble real-world populations.

Given the variation in ECG format among different machines, it becomes crucial for the ECG image classification model to acquire visual features that can be universally applied across these formats. Our findings reveal that amalgamating ECG scans from various formats into a unified training dataset resulted in improved performance compared to segregating datasets based on ECG format. Additionally, the multi-task model designed for the old ECG formats also exhibited commendable performance when tested with a new-format dataset, underscoring its efficacy in predicting ECGs with diverse formats. However, a preprocessing protocol is needed to automatically prepare the ECG image prior to interpretation by the model.

The multi-task model also has a computational advantage since it shares the same backbone for predicting MS and LVEF. Several studies demonstrated performance gains in using a multi-task model in medical image analysis¹⁸. In addition to the computational edge, the multi-task model may also have a clinical advantage. When determining whether a patient has MS, the ability to simultaneously predict LVEF range might provide the impact of the scar on cardiac function. Likewise, when predicting LVEF < 50%, the model could suggest the etiology of impaired cardiac pum** by detecting the ischemic scar. Taken together, the advantages of our developed multi-task deep learning model suggest the strong potential of its use as a screening tool for MS and LVEF < 50% from ECG scan images in limited-resource settings.

Previous studies have demonstrated the potential of deep learning for MS detection, mainly focusing on raw ECG traces^19,20. A similar study achieved a comparable AUC to our model, with CMR as the gold standard²⁰. Another study utilizing data from CMR-confirmed ECGs and a publicly available ECG dataset without CMR measurements, reported a model with superior performance to our model. However, their model combines vectorcardiography with ECG for prediction¹⁹. We hypothesize that vectorcardiography may provide more detailed heart conduction information than ECG alone.

The PhysioNet in Cardiology Challenge 2020 presented a large multi-institutional database of ECG signals with remarkable results, utilizing deep learning and CNN²¹. As we have no access to the raw ECG traces, we extracted the signals from the ECG images and trained 1D CNN models. The results show that the models trained on ECG images perform better than the ones trained on the extracted signals; these results align with previous research study³⁵ was employed as the function estimator. We solely applied imputation to the training dataset to maintain fairness during evaluation, leaving the data in the test sets unchanged.

1D signal extraction from images

We performed a 1D signal extraction from 2D images by extracting the pixels for each lead. Here, we exclude 5 training samples due to overlapped signals and signals overlap with the text. After extraction, some y-axis pixels (voltage axis) may correspond to the same x-axis pixels (time axis). We use a simple preprocessing to average the voltage for duplicated time pixels. Additionally, we interpolate the missing extracted data using the next available values. Then, we align all extracted signals using Christov R-peak detection and crop to the same window size of − 0.5 to + 1.5 s^36,37.

Model development

We aimed to create a deep learning model that can classify MS and the LVEF range using an ECG scan image. We selected an LVEF cut point of < 50% since HF patients with LVEF < 50% are managed differently than patients with LVEF ≥ 50% (preserved ejection fraction)⁸.

We propose a multi-task ECG classification model that uses a neural network with a shared backbone for image feature extraction, and layers for multi-task MS and LVEF classification (Fig. 6). The model architecture is specified in the Supplementary Methods. We choose ResNet-like algorithms as the foundational framework for our model, drawing inspiration from their proven success in diverse image classification tasks, for example, ImageNet³⁸. Furthermore, existing research attests to the effectiveness of ResNet-like algorithms in 1D ECG classification tasks^39,40,41. Since we had both old and new ECG datasets, the model was trained under different training paradigms to compare the performance of the model between the two datasets. We propose three training paradigms and a baseline for comparison.

Multi-task MS and LVEF classification

In this paradigm, we combined the old- and new-format training and development datasets to train a multi-task model. In multi-task learning, we weighed the cross-entropy loss ratio between MS and LVEF predictions equally (Fig. 6a). However, we also experimented with other cross-entropy loss ratios including 60:40 and 70:30. Doing so makes the loss contribution from scar predictions more significant. Therefore, the model learns to prioritize MS prediction over LVEF (Supplementary Data 3).

Transferred multi-task MS and LVEF classification

Regarding visual features, the old- and new-format ECG records could be too distinguished from each other when perceived by the model so the model could not learn from both formats simultaneously. Thus, this paradigm divides the training process into two steps. First, the pretrained model used only the old-format data for pretraining. We then fine-tuned the model using new-format ECGs (Fig. 6b).

Single-task MS and LVEF classification

The single-task model was trained to predict MS and LVEF separately (Fig. 6c). This paradigm helps to verify if multi-task training helps improve classification performance. We combined the data from the old and new formats for model training, with cross-entropy loss.

Baseline

A baseline prediction is the most greatly represented class for each task. In our models, our baseline predictions are no MS and LVEF ≥ 50%.