Introduction

Coronavirus disease 2019 (COVID-19) is an illness caused by a novel coronavirus, which is now called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first outbreak was on December 21, 2019, in Wuhan City, China [1]. The World Health Organization (WHO) declared COVID-19 a global pandemic on March 11, 2020. It has escalated to 180 million cases with 3.9 million deaths and 165 million recovered as recorded on June 24, 2021 [2]. Among the worst-hit nations are the USA, India, and Brazil.

Effective screening is essential to triage the patients and treat them accordingly. COVID-19 is diagnosed by the real-time reverse transcription-polymerase chain reaction (RT-PCR) of nasopharyngeal swabs [3]. Chest radiography imaging and computed tomography (CT) are essential supplementary diagnostic tools for investigating patients suspected of having COVID-19. It is also a vital tool for patient follow-up. However, it requires an experienced and certified radiologist to triage the patients accurately. CXR findings are often non-specific and thus challenging to categorize due to COVID-19 or not. Therefore, a computer-aided diagnosis with automatic classification of lung abnormalities would be beneficial to assist radiologists to confirm their diagnosis and speed up the process.

Recently, many researchers used a convolutional neural network (CNN), a deep learning algorithm to assist in the diagnosis of COVID-19. Deep learning uses automatic feature extraction and pattern recognition to classify an image. CNN is based on the shared-weight architecture of the convolution kernels or filters, which slide along input features and produce the feature maps. CNN uses fully connected networks where each neuron in one layer is connected to all neurons in the next layer. In each layer, the data are transformed into a higher and more abstract level. The deeper the network, the more complex is the information learned. CNN is commonly used for image classification and segmentation.

Wang et al. proposed the COVID-Net model, which combined human-driven principled network design prototy** with machine-driven design exploration to detect COVID-19 cases from CXR images [4]. They used residual architecture design principles in the first stage of human-driven principled network design. Then they used generative synthesis to identify the optimal macro-architecture and micro-architecture designs for the COVID-Net model. They reported an accuracy of 92.6% on the test dataset, a sensitivity of 87.1% for COVID-19 cases, and a high positive predictive value (PPV) of 96.4% for COVID-19 cases. Mangal et al. used a pre-trained CheXNet [

  • Quantitative analysis using six assessment metrics on 18 CNN models with transfer learning for diagnosing COVID-19 on CXR images. This is an objective assessment by computer.

  • Identification of COVID-19 pneumonia-related lung changes on CXR identified visually by two certified radiologists on 50 CXR images. This is the ground truth of the diagnosis.

  • Qualitative analysis of the top four and bottom three CNN models using Grad-CAM heatmaps, performed by two certified radiologists in comparison with the ground truth. This is a subjective assessment by radiologists.

  • Material and Methods

    Overview of 18 CNN Architectures

    VGG uses up to 19 weight layers, which is a very deep convolutional network during its era for large-scale image classification. They explored the conventional Convolutional Networks (ConvNets) and increased the depth of architecture with very small (3 × 3) convolution filters [19]. Our study used two versions of VGG, which are VGG-16 and VGG-19, where the number represents the number of layers. ResNet explicitly reformulates the layers as learning residual functions with reference to the layer inputs. Their baselines were inspired by the VGG nets except that this model has fewer filters and lower complexity. [20]. Our study used three versions of ResNet, which are ResNet-18, ResNet-50, and ResNet-101, where the number represents the number of layers. AlexNet comprises 5 convolution layers and 3 fully connected layers with a final 1000-way softmax layer. They used the “dropout” regularization method to reduce overfitting and non-saturating neurons to make training faster [21]. SqueezeNet is a small CNN architecture with equivalent accuracy to AlexNet although it is 50 times fewer parameters and 510 times smaller than AlexNet. It replaced the 3 × 3 filters with 1 × 1 filters, decreased the number of input channels to 3 × 3 filters, and downsample late in the network so that the convolution layers have a large activation map [22].

    Inception-v3 scales up the networks by factorizing convolutions and aggressive dimension reductions inside the neural network. They demonstrated the training of high-quality networks on relatively modest size training sets using the combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label smoothing. They showed high-quality results for low receptive field resolution of 79 × 79, which could help detect relatively small objects [23]. GoogLeNet applies the Inception network, and its architecture is based on the Hebbian principle and the intuition of multi-scale processing. The main benefit is that it allows the increase of the depth and width of the network without a huge computational complexity [24]. Inception-ResNet-v2 combined the ideas of residual connections and the Inception architecture. It shows the benefit of accelerating the Inception networks’ training speed and improving the recognition performance significantly [25]. Xception architecture was inspired by the Inception module, but it is entirely based on depth-wise separable convolutions with linear residual connections. It uses the same number of parameters as Inception-v3 but in a more efficient use of these parameters [26].

    DarkNet-19 uses 3 × 3 filters and doubles the number of channels after every pooling step. It uses global average pooling to make predictions and a 1 × 1 filter to compress the feature representation between 3 × 3 convolutions [27]. DarkNet-53 is a variant of DarkNet-19, where it has 53 convolutional layers [28]. DenseNet-201 uses a feed-forward to link each layer to every other layer. In each layer, the feature maps of all the preceding layers are used as inputs. Its feature maps are then used as inputs into all the following layers. It solves the vanishing gradient problem, improves feature propagation, encourages feature reuse, and reduces the number of parameters [6].

    MobileNet-v2 is a mobile architecture based on an inverted residual structure and linear bottleneck. The shortcut connections are between the thin bottleneck layers. The intermediate expansion layer used lightweight depth-wise convolutions to filter the features. Its architecture consists of an initial fully convolution layer with 32 filters and 19 residual bottleneck layers [29]. ShuffleNet utilizes pointwise group convolution and channel shuffle. It reduces the computation cost while maintaining accuracy. Its computation is 13 times faster than AlexNet for comparable classification accuracy. It was designed for mobile devices [30]. NasNet designs a new search space to search for an architectural building block on a small dataset and then transfer the block to a larger dataset. They used the neural architecture search (NAS) as the primary search method. The model used a new regularization technique called “Scheduled Drop Path” that improves generalization [31]. Our study used two versions of NasNet, which are NasNet-Large and NasNet-Mobile.

    Dataset Preparation

    The CXR images in our study were obtained from the public and private domains. The dataset from the public domain is called COVIDx [32], which consists of CXR images from five sources: Actualmed COVID-19 Chest X-Ray Dataset Initiative (Actmed) [33], COVID-19 Image Data Collection: Prospective Predictions Are the Future (COHEN) [34], Fig. 1 COVID-19 Chest X-Ray Dataset Initiative (Fig1) [35], (COVID-19 Radiography Database (SIRM) [36], and RSNA Pneumonia Detection Challenge (RSNA) [37]. The dataset from the public domain are available in the websites listed in the references. The dataset from the private domain was provided by the Department of Biomedical Imaging, Faculty of Medicine, University of Malaya (UM), Malaysia. The dataset from the private domain is not available to the public following the ethnic agreement which is specified for this study only. We obtained both CXR images of normal and COVID-19 subjects from both public and private domains. We chose the CXR images in the posteroanterior (PA) and anteroposterior (AP) views of the lung for this study. The number of images from each domain and source is recorded in Table 2. The size of the normal images range from 1024 × 1024 (smallest) to 2520 × 3032 (largest); the COVID-19 images range from 220 × 206 (smallest) to 4280 × 3520 (largest). There are no specific gray levels in the public domain images since they were taken from various databases. The private domain DICOM images were 12-pixel depth indicating 4096 gray levels in each CXR image. Figure 1 shows a COVID-19 CXR image and a normal lung CXR image provided by UM.

    Fig. 1
    figure 1

    CXR images for a a patient diagnosed with COVID-19 and b a normal lung

    Table 2 The number of CXR images obtained from the public and private domain

    The 18 CNN models were trained with a combined dataset consisting of 200 normal CXR images (100 from COVIDx and 100 from UM) and 200 COVID-19 CXR images (100 from COVIDx and 100 from UM). These images were split to a ratio of 7:3 for training and validation. For each class (normal and COVID-19), 140 images were used for training, and 60 images were used for validation. The remaining images were used for testing the CNN models to evaluate their performances. The testing dataset consists of 150 normal CXR images (100 from COVIDx and 50 from UM) and 150 COVID-19 CXR images (100 from COVIDx and 50 from UM). The dataset split for training, validation, and testing is recorded in Table 3.

    Table 3 The implementation details of the dataset split for training, validation, and testing

    Hardware and Software

    The training, validation, and testing of the CNN models were performed using an Intel(R) Core (TM) i5-10,500 CPU @ 3.10 GHz with 8 GB RAM. The YAKAMI DICOM Tool [38] was used to convert the DICOM images to JPEG file format. Then, the Deep Network Designer Toolbox in MATLAB R2020b (The Mathworks, Inc.) was used for training and testing the 18 CNN models. The MATLAB Grad-CAM Library [39] was used to run the Gradient-weighted Class Activation Map** (Grad-CAM) to visualize the classification decision.

    Transfer Learning

    Our study applied transfer learning to the 18 CNN models available in MATLAB’s Deep Network Designer. The 18 CNN models were previously trained using the ImageNet images [40]. Since we do not have a large dataset of CXR images to train a deep learning model from scratch, transfer learning was applied to the pre-trained CNN models. In this approach, the CNN models are used as a feature extractor while kee** their initial architecture. Referring to Fig. 2, the lower layers for the feature extractor portion are frozen. The original fully connected, softmax and classification output layers are removed and replaced with a new set with an output size of 2 to indicate the binary classification of COVID-19 or normal classes. We did not attempt to optimize the CNN models or adjust their weights in the feature learning portions. The transfer learning approach is a more efficient and common way for the considerably small size of data; therefore, we do not need to train the CNN models from scratch.

    Fig. 2
    figure 2

    A pre-trained CNN architecture is adapted with transfer learning to perform a binary classification (COVID-19 or normal)

    This study used the recommended default hyperparameter settings provided by MathWorks’ Deep Learning Guide. Figure 3 is the training setting used for all the CNN models in this study. No tuning approach was done since it is not the main focus of this study.

    Fig. 3
    figure 3

    Training setting of the hyperparameters for all the CNN models

    Assessment Metric

    There are four possible outcomes in a confusion matrix for binary classification: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). True positive (TP) refers to the number of cases correctly classified as positive where the disease is present. True negative (TN) refers to the number of cases correctly classified as negative where the disease is absent. False negative (FN) refers to the number of cases wrongly classified as negative where the disease is present. False positive (FP) refers to the number of cases wrongly classified as positive where the disease is absent.

    The TP, TN, FN, and FP are used to calculate the assessment metrics including specificity, sensitivity, precision, NPV, accuracy, and F1-score. These metrics are used to evaluate the performance of the 18 CNN models in this study. The formulas for the specificity, sensitivity (or recall), precision, NPV, accuracy, and F1-score are given in Eq. (1) to Eq. (6), respectively:

    $${\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}}\,{ + }\,{\text{FP}}\,}},$$
    (1)
    $${\text{Sensitivity/Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}}\,{ + }\,{\text{FN}}}},$$
    (2)
    $${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}}\,\,{ + }\,{\text{FP}}}},$$
    (3)
    $$\text{Negative Predictive Value (NPV)}=\frac{\text{TN}}{{\text{TN}}+{\text{FN}}},$$
    (4)
    $${\text{Accuracy}} = \frac{{{\text{TP + }}\,{\text{TN}}}}{{{\text{TP}}\,{ + }\,{\text{TN}}\,{ + }\,{\text{FP}}\,{ + }\,{\text{FN}}}},$$
    (5)
    $${\text{F1 score}} = \frac{{{2} \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} = \frac{{{2} \times {\text{TP}}}}{{\left( {{2} \times {\text{TP}}} \right)\,{\text{ + FN + }}\,{\text{FP}}}}.$$
    (6)

    Majority Voting

    Majority voting has been adopted with deep learning to improve the COVID-19 detection accuracy [41, 42]. Our study used the hard approach of majority voting, which gives a label of the class for each image according to the highest number of labels (votes) among all the CNN models. It is applied for the 18 CNN models, then repeated for the top 4 CNN models with an accuracy higher than 90%.

    Qualitative Analysis with Grad-CAM

    The prediction made by the CNN models can be evaluated quantitatively using the assessment metrics described earlier. However, we do not know which part of the images was used as the features in the decision-making for the prediction. Therefore, it is equally important to display some sort of “visual explanation” for the decision made by the CNN models. We used Grad-CAM for this purpose [39]. It uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the significant regions in the image for the prediction. It is a useful tool to interpret the model’s decision. For the 18 CNN models, the feature map layer was specified for each model to produce the Grad-CAM heatmap as shown in Table 4.

    Table 4 The selected feature map layer used to produce Grad-CAM heatmap in each CNN models

    From the quantitative analysis, we chose the top four and bottom three CNN models for further qualitative analysis. We produced the Grad-CAM heatmaps of the testing dataset with COVID-19 (50 CXR images from UM). Two certified radiologists with more than 5 and 10 years of CXR interpretation experience independently evaluated these CXR images by drawing a contour over the infected region within the lung using the ITK-SNAP software [43]. For each CXR image, the radiologists were given seven Grad-CAM heatmaps (top four and bottom three) to vote for the closest heatmap with their diagnosis indicated by the contour of the infected region. If there were more than one heatmap with the correct region identified by the CNN model, they were all given one vote. If all the heatmaps showed the wrong region, no vote was given for that image. This process was repeated for 50 CXR images with seven heatmaps each. The bottom three CNN models were included in this vote to ensure that they were the least accurate model compared to the top four models. The radiologists performed blind analysis during the voting without knowing the name of the CNN models. We aim to find the most suitable CNN models for COVID-19 detection by combining both quantitative and qualitative analysis.

    Results

    Quantitative Analysis

    Table 5 records the depth of layers, total layers (convolution, dense, pooling, etc.), and the number of parameters (in million) for the 18 CNN models, arranged from the highest to the lowest number of parameters. All the models used the same input image size of 224 × 224 × 3. Figure 4 shows the training time (left bars), and the validation and testing accuracy (right bars) for each model arranged from the highest to the lowest number of parameters. Generally, a model with a larger number of parameters requires a longer training time.

    Table 5 The depth of layers, total layers, number of parameters, and image input size for the 18 CNN models are arranged from the highest to the lowest number of parameters
    Fig. 4
    figure 4

    Training time (left bars), validation, and testing accuracy (right bars) for 18 CNN models, with descending order of the number of parameters (written in bracket, in the unit of million)

    SqueezeNet used the lowest number of parameters (1.24 million) and the shortest training time (514 s = 8 min 34 s), yet a relatively good validation accuracy of 92.5% and testing accuracy of 90.67%. VGG-16 has the highest validation accuracy of 96.67% and testing accuracy of 94.33%, but a relatively long training time (3942 s = 1 h 5 min 42 s). In general, there is a tradeoff in achieving higher accuracy. Nevertheless, NasNet-Large used the longest training time yet it has the lowest validation and testing accuracy. Therefore, the relation between the training time with the validation and testing accuracy is inconclusive among these 18 CNN models.

    Table 6 records the classification results (TP, FP, FN, and TN) for the 18 CNN models, arranged from the highest to the lowest number of parameters, and for the majority voting with 18 models and the top 4 models. These values were used to calculate the assessment metric specificity, sensitivity, precision, NPV, accuracy, and F1-score as recorded in Table 7. The 18 CNN models were arranged from the highest to the lowest accuracy (%) in Table 7. It was found that VGG-16 has the highest accuracy of 94.3%, highest specificity of 93.5%, highest precision of 93.3%, and highest F1-score of 94.3%. VGG-19 demonstrated the highest sensitivity value of 95.6% and the highest NPV value of 96.0%. DarkNet-19 and GoogLeNet also demonstrated the highest NPV value of 96.0%. The top 4 models were identified based on an accuracy higher than 90%, which are VGG-16, ResNet-101, VGG-19, and SqueezeNet. The majority voting with 18 models produced an accuracy of 93.0%, which is lower than the majority voting with the top 4 models with an accuracy of 94.0%.

    Table 6 Classification results (TP, FP, FN, TN) for the 18 CNN models, arranged in the descending order of the number of parameters; and for the majority voting with 18 models and the top 4 models
    Table 7 Assessment metric values for the 18 CNN models (arranged from the highest to the lowest accuracy) and for the majority voting with 18 and the top 4 models

    The assessment metric results in Table 7 are plotted in Fig. 5 from the highest to the lowest number of parameters as the plot moves from the left to the right side. DarkNet-53 demonstrated the most consistent values among the six assessment metrics, while Xception has the largest variation of values. It is observed that there is no specified trend of performance with the number of parameters used in each CNN model. Our study focuses on the performance of different types of CNN models, instead of the number of parameters, to diagnose COVID-19. The majority voting with either 18 or the top 4 models produced consistently higher values of all the assessment metrics. The confusion matrices of the top 4 models, the majority voting with 18 models, and the majority voting with the top 4 models are plotted in Fig. 6.

    Fig. 5
    figure 5

    Assessment metrics for the 18 CNN models, arranged from the highest to the lowest number of parameters as the plot moves from the left to right, and for the majority voting with 18 models and the top 4 models

    Fig. 6
    figure 6

    Confusion matrices for the top 4 models (VGG-16, ResNet-101, VGG-19, SqueezeNet), majority voting with 18 models and the top 4 models

    Qualitative Analysis

    From the above quantitative results in Table 7 and Fig. 5, it is inconclusive which CNN model is the best model for identifying COVID-19 from the normal lung CXR images. Therefore, it is necessary to perform qualitative analysis to investigate the most suitable CNN model for diagnosing COVID-19. Figure 7(a) and b show the Grad-CAM heatmaps of the 18 CNN models for the correctly classified COVID-19 and normal CXR images, respectively. The red region is the most significant region where the CNN models extracted the “features” during the prediction process. The blue region is the least significant region for decision-making. It is observed that some of the red regions for decision-making are not within the thoracic cavity. Therefore, the prediction performed by some CNN models was based on the features of a wrong region although it produced the true positive (TP) or true negative (TN) results. The ground truth of the infected lung area is shown in the bottom right corner of Fig. 7(a) by two radiologists. The majority voting method does not have a Grad-CAM heatmap because it is a different approach that produces the label of the images based on the majority votes of the prediction from each CNN model.

    Fig. 7
    figure 7

    The Grad-CAM heatmaps of 18 CNN models for a correctly classified a COVID-19 CXR (116_1.Ser2.Img1.jpg) where the ground truth identified by two radiologists are in grayscale with red contour indicating the affected area and b normal CXR (102.Ser1.Img1_anon.jpg)

    To identify which CNN model interpreted the correct region within the lung during the classification process, the qualitative analysis of these heatmaps is necessary with the assistance of the radiologist. Only the top four models (VGG-16, ResNet-101, VGG-19, and SqueezeNet) and bottom three models (NasNet-Mobile, NasNet-Large, and Xception) from Table 7 were chosen to produce the Grad-CAM heatmaps for 50 CXR images (from UM datasets) for the qualitative analysis. Figure 8 shows another COVID-19 CXR image with the ground truth drawn by two radiologists and seven Grad-CAM heatmaps of the top four and bottom three models. The radiologists voted the best heatmap by comparing them with the contour of the infected region drawn by themselves. The result of their voting is recorded in Table 8. The total number of voting is unequal between the two radiologists because in any case without a correct heatmap, no score was given. Referring to Table 8, SqueezeNet has the highest score (printed in bold) on its Grad-CAM heatmaps to the radiologist’s diagnosis. The bottom three models have the least score among both radiologists. This result confirms that the poorly performed CNN models in terms of quantitative analysis agreed with the qualitative analysis by the radiologists.

    Fig. 8
    figure 8

    COVID-19 CXR image (115_1.Ser1.Img1) with ground truth identified by two radiologists; Grad-Cam heatmaps of the top four and bottom three models

    Table 8 The voting results by two blinded radiologists on 50 Grad-CAM heatmaps of top 4 and bottom 3 CNN models

    Discussion

    This study has demonstrated both quantitative and qualitative analysis of 18 CNN models with transfer learning to diagnose COVID-19 on CXR images. The state-of-the-art CNN models can classify COVID-19 from normal lung CXR images with accuracy between 74.3% and 94.3% in our study as recorded in Table 7. Six assessment metrics were calculated including specificity, sensitivity, precision, NPV, accuracy, and F1-score. Yet, it is difficult to conclude which is the most suitable model from the quantitative analysis result. Most of the CNN models produced competitively good results of assessment metric values. Referring to Table 7, the top four CNN models with accuracy higher than 90% are VGG-16, ResNet101, VGG-19, and SqueezeNet. The majority voting with the hard approach produced aN accuracy of 94.0% when combining the top 4 models and 93.0% when combining all the 18 models. The slightly lower accuracy in combining 18 models is due to the averaging effect from the poorer models.

    To date, the majority of the CNN studies for the detection of COVID-19 excluded qualitative analysis by radiologists. The new contribution from our study is the subjective qualitative analysis of the CNN models by certified radiologists alongside the quantitative analysis. Our study has combined both objective assessment (quantitative analysis by computer) and subjective assessment (qualitative analysis by radiologists) to enhance the evaluation of the CNN models. It gives us better confidence in our investigation of the best CNN model for diagnosing COVID-19 on CXR images.

    Mangal et al. used RISE [44] to generate saliency maps to visualize their model’s predictions [

    Conclusion

    The main contribution of this study is the combination of both objective quantitative and subjective qualitative analysis in evaluating the performance of CNN models with transfer learning to diagnose COVID-19. In this study, the quantitative analysis of 18 CNN models with transfer learning revealed that the top four models for diagnosing COVID-19 on CXR images are VGG-16, ResNet-101, VGG-19, and SqueezeNet. The VGG-16 scored the highest accuracy of 94.3% and the highest F1-score of 94.3%. The majority voting with all the 18 CNN models and top 4 models produced an accuracy of 93.0% and 94.0% respectively. The qualitative analysis using Grad-CAM heatmaps of the top four and bottom three models revealed that SqueezeNet is the closest model to the subjective diagnosis of two certified radiologists. SqueezeNet demonstrated a competitively good accuracy of 90.7% and F1-score of 90.8% with the shortest training time of 8 min 34 s. It used 111 times fewer parameters than VGG-16 and its training time was 7.7 times faster than VGG-16. Therefore, our study recommends both VGG-16 and SqueezeNet as additional tools for the diagnosis of COVID-19.