Introduction

Coronary artery disease (CAD) is the leading cause of death and disability worldwide, with decreasing incidence in developed countries but increasing in develo** countries1. CAD remains asymptomatic until the coronary stenosis becomes moderate or severe, resulting in the chest pain, dyspnea, and syncope2. However, approximately 30% of myocardial infarctions (MI) do not manifest with clear symptoms3. A missed MI diagnosis can lead to serious complications, such as left ventricular systolic dysfunction, heart failure (HF), and death.

HF was reported to affect 1–2% of the population4,5, and that rate is expected to increase over the next decade, posing a significant healthcare burden6. Left ventricular ejection fraction (LVEF) is used to classify HF patients for appropriate treatment7,8. Therefore, detection of MI and LVEF at an earlier stage can help clinicians make better decisions for further investigation and proposing an appropriate treatment9.

Cardiac magnetic resonance (CMR) imaging has the ability to accurately and non-invasively assess the heart's functional and anatomical abnormalities, including characterizing myocardial scars (MS), which is a common sequelae of MI9,10. Importantly, MS was reported to be one of the critical determinants, predicting the future development of heart failure11. However, CMR is expensive and requires highly trained personnel to perform imaging and interpretation, limiting its availability in remote or underdeveloped settings.

Electrocardiogram (ECG) is frequently employed as an initial investigation for diagnosing cardiovascular diseases due to its accessibility, whereas CMR is typically reserved for more in-depth investigations12. ECG scans contain patterns that can suggest cardiovascular diseases like CAD, MS, and left ventricular systolic dysfunction13,14. Trained cardiologists can identify these patterns, and diagnose the corresponding diseases. However, the availability of well-trained cardiologists is limited in remote or develo** areas. Moreover, the interpretation of ECG scans is susceptible to both human error and interrater variability. Alternatively, computer-based ECG interpretation could help mitigate these limitations. Previous studies reported that machine learning systems can detect cardiovascular diseases, such as MI, arrhythmias, and left ventricular systolic dysfunction, using a 12-lead ECG14,15,16,17.

However, obtaining ECG data in machine-readable format (i.e., ECG tracings) can be challenging in resource-limited settings, including Thailand, since most ECG records are stored in paper or scanned format. Hence, computer image-based ECG classification for CAD detection may be advantageous in these circumstances.

In this paper, rather than using a separate classification model for each task, we introduce multi-task convolutional neural network (CNN) models that read 12-lead ECG scan images to identify both CAD scars and abnormal LVEF of less than 50% for clinical screening purposes. We provide our models as open-source software to improve CAD screening in resource-limited settings.

Results

Study population

A total of 13,707 patients and 14,826 ECGs were retrospectively enrolled in this study. To prevent cross-dataset contamination, 774 ECGs were excluded, resulting in a total of 14,052. The population comprises two ECG formats, specifically the non-grid (old) format and the grid (new) format ECGs, collected using different machines. The following baseline characteristics represented the total number of ECGs. The average age of all ECGs was 72.29 ± 13.81 years and 50.53% were acquired from male patients. The prevalence of MS and LVEF < 50% in the total ECGs was 27.11% and 18.72%, respectively, while 12.35% belonged to the LVEF < 40% group. A total of 10.04% of all ECGs had both subendocardial and transmural scarring. ECGs with only subendocardial scarring or only transmural scarring accounted for 9.45% and 7.61% of the population, respectively. Table 1 shows the baseline data of the overall ECG population and each dataset used in the study. Out of the total ECGs, 5,407 (38.48%) had no clinical feature data except for age and sex, while all other ECGs had complete clinical data. The missing data were imputed as described in the methods section. Summary workflow of MS and LVEF classification system is shown in Fig. 1a.

Table 1 Baseline data of the overall study population, and of those included in each dataset.
Figure 1
figure 1

Horizontal flow diagrams describing the ECG classification framework. (a) Summary workflow of MS and LVEF classification from 12-lead ECG scans using a multi-task deep neural network. ECG and CMR were performed on the same date. Both 2D ECG images and 1D extracted signals were input to CNNs to classify MS and LVEF. Evaluation were done by using 2 format test dataset, prevalence-specific analysis, Grad-CAM++, and cardiologists. (b) Data preprocessing: The ECG in PDF format is converted to an image, the gridlines are removed, and the leads are rearranged so that they are in the same order on every image. (c) Data splitting: The old-format and new-format test datasets are both split by year into training, development, and test sets. CMR cardiac magnetic resonance, CNN convolutional neural network, ECG electrocardiogram, LVEF left ventricular ejection fraction, MS myocardial scar, PDF Portable Document Format.

Model performance

We trained eight deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats), (6) Multi-task with clinical features, (7) Single-task (MS) with clinical features, and (8) Single-task (LVEF) with clinical features.

Overall, our multi-task both formats model outperformed the transferred multi-task model and the multi-task old-format only model in both MS classification and LVEF classification, except for MS classification on new-format test dataset. The AUCs for the multi-task both formats model for MS classification were 0.838 (95% CI 0.812–0.862) and 0.811 (95% CI 0.788–0.832) for the old-format and new-format test datasets, respectively. For detecting LVEF < 50%, the AUCs of the multi-task both formats model were 0.939 (95% CI 0.921–0.954) and 0.931 (95% CI 0.915–0.944) for the old-format and new-format test datasets, respectively (Fig. 2). Model performance results compared to baseline prediction are shown in the Supplementary Data 1.

Figure 2
figure 2

ROC curves of each evaluated 2D model. ROC curves showing the specificity and sensitivity of MS classification for each of the evaluated models using the old-format (a) and new-format (b) test sets; and of LVEF < 50% classification for each of the evaluated models using the old-format (c) and new-format (d) test sets. The AUC was quite similar between the old- and new-format test sets for all models for both MS and LVEF < 50%. AUC area under the ROC curve, LVEF left ventricular ejection fraction, MS myocardial scar, ROC receiver operating characteristic.

Regarding the ECG interpretation performance of cardiologists on new-format test data, the AUCs for MS classification by an experienced cardiologist and an in-training cardiologist were 0.683 (95% CI 0.659–0.707) and 0.657 (95% CI 0.632–0.681), respectively. Both cardiologists had similar sensitivity (44.1% vs. 44.50%, respectively); however, the experienced cardiologist had a higher specificity (92.60%) than the in-training cardiologist (86.40%) (Supplementary Fig. 1).

All of our developed models that were designed to evaluate ECG images as input features outperformed the XGBoost model, which was designed to evaluate only standard clinical features, by up to 50.40% (Fig. 2). The multi-task with clinical features model was able to classify MS in the old-format test dataset with specificity of 66.92%, 44.61%, and 30.50% at 80.00%, 90.00%, and 95.00% sensitivity, respectively. For LVEF < 50% classification when using the old-format test dataset, the model achieved the specificity of 90.03%, 84.76%, and 66.90% at 80.00%, 90.00%, and 95.00% sensitivity, respectively (Fig. 3).

Figure 3
figure 3

Classification plots of the 2D multi-task with clinical features model. The plots show the TPR and FPR value at various probability thresholds with 95%CI of MS classification when using the old-format (a) and new-format (b) test sets; and of LVEF < 50% classification when using the old-format (c) and new-format (d) test sets. FPR false positive rate, LVEF left ventricular ejection fraction, MS myocardial scar, TPR true positive rate, 95%CI 95% confidence interval.

Incorporating clinical features

Incorporating clinical features into our multi-task models resulted in improved model performance in some cases (Fig. 2). For the single-task model, adding clinical features provided a performance boost only in MS classification in old-format test datasets. For the multi-task model, the performance boost was observed only in MS classification in new-format test datasets. The multi-task with clinical features model greatly outperformed the XGBoost model (which used only clinical features) with an AUC of 0.841 (95% CI 0.819–0.860) compared to 0.682 (95% CI 0.655–0.707) in MS classification using the new-format test dataset. (Fig. 2).

Prevalence-specific analysis

When using the prevalence-specific test dataset, our multi-task both formats model achieved an AUC for MS classification of 0.812 (95% CI 0.810–0.814) and an F1-score of 0.931 (95% CI 0.931–0.932).

Performance in detecting LVEF < 40%

Our sensitivity analysis showed similar performance among models when comparing LVEF < 50% detection and LVEF < 40% detection. The multi-task model with clinical features achieved the highest AUC of 0.942 (95% CI 0.925–0.956) when using the old-format test dataset, while the multi-task model achieved the highest AUC of 0.939 (95% CI 0.924–0.951) when using the new-format test dataset (Supplementary Fig. 2).

Localization of models’ decision

We applied Grad-CAM++ to visualize the areas of ECG images that influenced the model decision. Figure 4 shows examples of heatmaps generated on top of the ECGs for multi-task and multi-task with clinical model. The heatmap highlighted the area with ECG tracings with greater emphasis on the area associated with models’ decisions. In cases with MS and LVEF < 50%, we observed that the model focused on abnormal Q waves, QRS complexes, and T wave inversions. For cases with no MS and LVEF ≥ 50%, the multi-task model generally highlighted QRS complexes in lead I, II, V2, and V6, while the multi-task with clinical model focused on lead I. Interestingly, the multi-task with clinical model appears to focus more on fewer leads compared to the multi-task model.

Figure 4
figure 4

Examples of “Grad-CAM++” technique applied to ECG images. The “Grad-CAM++” technique for deep convolutional networks applied to ECG images highlighting areas that contributes to the prediction of Multi-task both format model and Multi-task with Clinical features model in (a) 3 cases of MS and LVEF < 50% from old-format test dataset. (b) 3 cases of No MS and LVEF ≥ 50% from old-format test dataset. (c) 3 cases of MS and LVEF < 50% from new-format test dataset. (d) 3 cases of No MS and LVEF ≥ 50% from new-format test dataset. The heatmaps were generated using PyTorch library for CAM methods47. ECG Electrocardiogram, LVEF left ventricular ejection fraction, MS myocardial scar.

1D signal extraction and 1D model performance

We extracted 1-Dimension (1D) signals from the ECG images and trained five 1D deep learning models: (1) Multi-task both formats, (2) Multi-task old-format only, (3) Transferred multi-task model, (4) Single-task for MS classification (both formats), (5) Single-task for LVEF classification (both formats).

The results consistently demonstrate the superior performance of our 2D CNN models over their 1D counterparts in both MS and LVEF range classification tasks (Figs. 2, 5, and Supplementary data 2). Among the 1D models, our multi-task model stands out, maintaining the highest overall performance for both tasks.

Figure 5
figure 5

ROC curves of each evaluated 1D model. ROC curves showing the specificity and sensitivity of MS classification for each of the 1D models using the old-format (a) and new-format (b) test sets; and of LVEF < 50% classification for each of the evaluated models using the old-format (c) and new-format (d) test sets. AUC area under the ROC curve, LVEF left ventricular ejection fraction, MS myocardial scar, ROC receiver operating characteristic.

Discussion

In this study, we demonstrated that a multi-task deep learning model could detect MS and classify the LVEF range using a 12-lead ECG scan image. Our top-performing models, the multi-task both formats model and multi-task model with clinical features, demonstrated high performance in detecting MS and LVEF < 50% in both old and new ECG formats. They consistently exhibited comparable or superior performance when compared to their single-task counterparts in the majority of scenarios. Additionally, our model also achieved a high AUC and F1-score in MS prediction from the prevalence-matched population. The F1 score achieved in this prevalence-matched population was notably higher than that in our test sets, where the prevalence exceeded 26%. This finding might suggest that the model performs better in populations with lower prevalence, which more closely resemble real-world populations.

Given the variation in ECG format among different machines, it becomes crucial for the ECG image classification model to acquire visual features that can be universally applied across these formats. Our findings reveal that amalgamating ECG scans from various formats into a unified training dataset resulted in improved performance compared to segregating datasets based on ECG format. Additionally, the multi-task model designed for the old ECG formats also exhibited commendable performance when tested with a new-format dataset, underscoring its efficacy in predicting ECGs with diverse formats. However, a preprocessing protocol is needed to automatically prepare the ECG image prior to interpretation by the model.

The multi-task model also has a computational advantage since it shares the same backbone for predicting MS and LVEF. Several studies demonstrated performance gains in using a multi-task model in medical image analysis18. In addition to the computational edge, the multi-task model may also have a clinical advantage. When determining whether a patient has MS, the ability to simultaneously predict LVEF range might provide the impact of the scar on cardiac function. Likewise, when predicting LVEF < 50%, the model could suggest the etiology of impaired cardiac pum** by detecting the ischemic scar. Taken together, the advantages of our developed multi-task deep learning model suggest the strong potential of its use as a screening tool for MS and LVEF < 50% from ECG scan images in limited-resource settings.

Previous studies have demonstrated the potential of deep learning for MS detection, mainly focusing on raw ECG traces19,20. A similar study achieved a comparable AUC to our model, with CMR as the gold standard20. Another study utilizing data from CMR-confirmed ECGs and a publicly available ECG dataset without CMR measurements, reported a model with superior performance to our model. However, their model combines vectorcardiography with ECG for prediction19. We hypothesize that vectorcardiography may provide more detailed heart conduction information than ECG alone.

The PhysioNet in Cardiology Challenge 2020 presented a large multi-institutional database of ECG signals with remarkable results, utilizing deep learning and CNN21. As we have no access to the raw ECG traces, we extracted the signals from the ECG images and trained 1D CNN models. The results show that the models trained on ECG images perform better than the ones trained on the extracted signals; these results align with previous research study35 was employed as the function estimator. We solely applied imputation to the training dataset to maintain fairness during evaluation, leaving the data in the test sets unchanged.

1D signal extraction from images

We performed a 1D signal extraction from 2D images by extracting the pixels for each lead. Here, we exclude 5 training samples due to overlapped signals and signals overlap with the text. After extraction, some y-axis pixels (voltage axis) may correspond to the same x-axis pixels (time axis). We use a simple preprocessing to average the voltage for duplicated time pixels. Additionally, we interpolate the missing extracted data using the next available values. Then, we align all extracted signals using Christov R-peak detection and crop to the same window size of − 0.5 to + 1.5 s36,37.

Model development

We aimed to create a deep learning model that can classify MS and the LVEF range using an ECG scan image. We selected an LVEF cut point of < 50% since HF patients with LVEF < 50% are managed differently than patients with LVEF ≥ 50% (preserved ejection fraction)8.

We propose a multi-task ECG classification model that uses a neural network with a shared backbone for image feature extraction, and layers for multi-task MS and LVEF classification (Fig. 6). The model architecture is specified in the Supplementary Methods. We choose ResNet-like algorithms as the foundational framework for our model, drawing inspiration from their proven success in diverse image classification tasks, for example, ImageNet38. Furthermore, existing research attests to the effectiveness of ResNet-like algorithms in 1D ECG classification tasks39,40,41. Since we had both old and new ECG datasets, the model was trained under different training paradigms to compare the performance of the model between the two datasets. We propose three training paradigms and a baseline for comparison.

Figure 6
figure 6

Model architecture of five different training paradigms. (a) Multi-task model: This model combines old- and new-format datasets to train multi-task MS and LVEF range classification. (b) Transferred multi-task model: This model is pretrained using old-format ECG data and then fine-tuned using new-format ECG data. (c) Single-task model: In this model, the old and new ECG formats are combined to train the MS model and LVEF model separately. (d) Multi-task model with clinical features: This model combines clinical features with ECG image data to predict MS and LVEF range. (e) Single-task model with clinical features: This model combines clinical features with ECG image data and was trained separately to predict MS and LVEF. (f) 1D Single-task model: 1D ECG signals are extracted from the images to train MS model and LVEF model separately. 1D 1 Dimension, BRNN bidirectional recurrent neural network, ECG electrocardiogram, LVEF left ventricular ejection fraction, MS myocardial scar, RNN recurrent neural network.

Multi-task MS and LVEF classification

In this paradigm, we combined the old- and new-format training and development datasets to train a multi-task model. In multi-task learning, we weighed the cross-entropy loss ratio between MS and LVEF predictions equally (Fig. 6a). However, we also experimented with other cross-entropy loss ratios including 60:40 and 70:30. Doing so makes the loss contribution from scar predictions more significant. Therefore, the model learns to prioritize MS prediction over LVEF (Supplementary Data 3).

Transferred multi-task MS and LVEF classification

Regarding visual features, the old- and new-format ECG records could be too distinguished from each other when perceived by the model so the model could not learn from both formats simultaneously. Thus, this paradigm divides the training process into two steps. First, the pretrained model used only the old-format data for pretraining. We then fine-tuned the model using new-format ECGs (Fig. 6b).

Single-task MS and LVEF classification

The single-task model was trained to predict MS and LVEF separately (Fig. 6c). This paradigm helps to verify if multi-task training helps improve classification performance. We combined the data from the old and new formats for model training, with cross-entropy loss.

Baseline

A baseline prediction is the most greatly represented class for each task. In our models, our baseline predictions are no MS and LVEF ≥ 50%.

Incorporating clinical features

To validate whether the multi-task model can be enhanced via the addition of clinical features, we trained a model using both ECG scan images and patient clinical features (Fig. 6d,e). Comprehensive details of the model are in the Supplementary Methods.

We also trained an eXtreme Gradient Boosting (XGBoost)42 model with only the clinical features to compare the models. This model represents the predictive performance of the CAD screening approach without using ECG images.

1D models

We used a similar ResNet architecture for our 1D CNN which resembles the architecture of our 2D CNN model. The structures are nearly identical, except for the first layer, where we replace the 2D convolution with a 1D convolution (Fig. 6f). However, unlike the 2D model, the 1D version handles 12 lead signals instead of a single ECG image. We vectorize the lead signals and input them individually to the ResNet with shared weights. Finally, we average the embeddings from the 12 leads and pass them to the prediction head. Similar to our image-based experiments, we trained the model using three training paradigms: Multi-task MS and LVEF classification, Transferred multi-task MS and LVEF classification, and Single-task MS and LVEF classification.

Training and evaluation strategy

All models were trained on an NVIDIA RTX3080Ti Graphics Processing Unit (NVIDIA, Santa Clara, CA, USA). The training strategy is detailed in the Supplementary Methods. We evaluated the performance of all models using the old- and/or new-format test sets in MS and LVEF < 50% classification. We also compare the performance between the models trained on ECG images and the ones trained on the extracted signals. We compared the area under the receiver operating curves (AUCs) of the fully trained models on each test dataset against the baseline. To account for the presence of class imbalances, their F1 scores were also evaluated. Since we aim to develop a model that could be deployed for screening purposes, we also assessed the true positive rate (TPR) and false positive rates (FPR) for each model at different probability threshold levels.

To compare the performance of our models with the cardiologists, we recruited two cardiologists—one experienced and one in training. We asked the two cardiologists to independently review and interpret ECGs from new-format test datasets. The details of the interpretation process are in the Supplementary Methods. AUC, sensitivity, specificity, and F1 score were the diagnostic performance parameters used as evaluation metrics.

Statistical analysis

All statistical analyses were performed using Python, version 3.8.13 (Python Software Foundation, Wilmington, DE, USA) and Medcalc, version 20.100 (MedCalc Software Ltd., Ostend, Belgium). The scikit-learn library (scikit-learn, Paris, France)43 was used for evaluation metrics calculation. A two-sided 95% confidence interval (95%CI) of AUC values was estimated using the binomial exact method44.

Localization of models’ decision

To understand the rationale behind the model’s prediction, we employed “Grad-CAM++”, a technique to visualize the activation map for CNN45. This technique generated a heatmap that highlights the pertinent region in the ECG image that significantly contributes to the prediction of the model. Red areas indicated higher attention from the model, while blue areas indicated less contribution to the prediction.

Sensitivity analysis

Prevalence-specific analysis

The reported prevalence of MS was estimated to be 7.9% in the general US population aged 45–84 years with no clinical cardiovascular disease46. We evaluated the performance of our model using a new-format test set with a case–control ratio matching a prevalence of MS of 8.0%. We applied bootstrap sampling to simulate a screening program in a normal population. This was achieved by randomly selecting 460 cases and 40 controls from the test dataset 1000 times to estimate the AUC (Fig. 1a).

Performance in detecting LVEF < 40%

We evaluated the performance of the models trained to predict LVEF < 50% and then used them to predict LVEF < 40% in the test population. This 40% LVEF cut point was selected to evaluate the performance of the models for detecting patients with heart failure with reduced ejection fraction.

Conclusion

The results of this study demonstrate the potential of computer-based MS and LVEF classification using ECG scan images in clinical screening context. Implementing this prediction model would reduce costs, decrease dependence on CMR, facilitate care in areas with a shortage of cardiologists, and accelerate the treatment initiation or referral in identified cases. These results lay the groundwork for improved CAD screening in Thailand and in other limited-resource settings worldwide. Further study is needed to explore the model's applicability to other cardiovascular diseases.