Introduction

Coronavirus disease or COVID-19 is a contagious disease that was caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The disease was first discovered and became prevalent in Wuhan, Hubei Province, China, and has since spread around the world. As we know, on March 11, 2020, the World Health Organization (WHO) proclaimed the flare-up of coronavirus pandemic [34]. As of July 12, 2020, more than 12,401,262 confirmed cases of COVID-19 and 559,047 confirmed deaths due to the disease [33]. Currently, it is indispensable to expand effective screening strategies to distinguish COVID-19 cases and segregate the infected from others, so all infected countries are trying to enhance the capacity of the entire health care system through multifunctional testing and mass vaccination to reduce the pandemic ahead of time that is an ultimate goal in the fight against COVID-19. Although the reverse transcriptase-polymerase chain reaction (RT-PCR) is a touchstone diagnosis method, it has certain deficiencies, such as the accurate detection of suspect patients causes delay due to the strict necessity of conditions at the clinical laboratory [37] and low detection rate due to the variable characteristics of testing [31].

Meanwhile, the limitations of RT-PCR testing have prompted researchers to find a rapid and definitive method for diagnosing COVID-19 infection. At the same time, the WHO put forward a quick advice guide [35] on June 11, 2020, suggesting that in addition to detecting clinical symptoms, chest imaging examinations might be a part of the diagnostic examination for patients suspected or likely to have COVID-19 disease as well as patients who have already recovered from COVID-19. Even though computed tomography (CT) imaging can prove to be useful, it also has several limitations, such as financial costs, and pregnant women and children are at greater risk of exposure to radiation since CT imaging needs high radiation doses during the screening process [6]. Certainly, chest X-ray is propitious in emergency diagnosis and treatment considering the way that this system is quick and simple to operate, and radiologists can yet recognize. In one of the prior research, it observed that patients exhibit inconsistencies in chest X-ray images that are typical for those infected with COVID-19 [18].

The objective of this research is to enhance COVID-19 detection accuracy from chest X-ray images. In this regard, we consider a framework based on CNN, because CNN is a powerful feature extraction and classification methodology, and therefore manifests excellent recognition performance in image classification. Of course, in the case of medical image analysis, significant diagnostic accuracy is a prime objective along with critical findings, and in recent years, the findings of critical facts related to medical imaging are comprehensively led by CNN-based framework hence it motivates us to do so. In this paper, we propose a modular CNN-based architecture, PDCOVIDNet, for detecting COVID-19 from chest X-rays using a dilated convolution [36] along with a traditional convolution. The advantage of using dilated convolution is that it captures more distinctive features by shifting the receptive field. The benchmark data set [5] used in the study is publicly available, and the authors of the benchmark generated data from three different open access data repositories containing chest X-ray images [13, 20, 27]. The pipeline of PDCOVIDNet starts with a data augmentation strategy, and then optimizes and fine-tunes the settings to train parallel-expanded CNN modalities by generating dominant features in the receiving fields transferred at different scales. Next, the generated features are fused into the neural network system to produce the final prediction. Also, we use two gradient-weighted class activation maps (such as Grad-CAM [25] and Grad-CAM++ [4]) to aid our system. These maps provide predictive explanations, and can identify important features related to COVID-19 infection. From experimental evaluation, it shows that the proposed method can identify important features related to COVID-19 disease, and the best accuracy achieved is \(96.58\%\). The key contributions of this paper are as follows:

  • We propose and develop a novel CNN framework called PDCOVIDNet to detect COVID-19 from chest X-ray images. Our proposed framework uses a dilated convolution in the parallel stack of convolutional blocks that can capture and propagate important features in parallel over the network which enhances detection accuracy significantly.

  • We visualize the X-ray images to analyze the COVID and non-COVID cases, and further investigate the incorrect classification.

  • Finally, we empirically evaluate our approach with the state-of-the-art approaches to highlight the effectiveness of PDCOVIDNet in detecting COVID-19.

The rest of the paper is organized as follows. “Related work” section reviews the state-of-the-art models used in detecting COVID-19 using chest X-ray images. The benchmark dataset and the augmentation strategy are described in “Data pre-processing” section. Next, “PDCOVIDNet architecture” section explains the main details of the proposed model and its adjustment to the detection of COVID-19 cases. In “Experimental evaluation” section, we provide the experimental results and show the comparison between PDCOVIDNet and other models. Observation on visualization techniques and incorrect classification results are illustrated in “Visualization using Grad-CAM and Grad-CAM++ and “Investigation on the incorrect classification” sections, respectively. Finally, “Conclusion and future work” section provides the conclusion of this paper with the future research direction.

Related work

With the rapid spread of COVID-19 in many countries around the world, imaging technology can quickly detect COVID-19, which helps to control the spread of disease. Chest X-ray is a promising imaging technology with a historical prospect of an image diagnosis system. It can be fully explored through various feature extraction methods especially CNN based approaches, thereby playing an important role in the diagnosis of COVID-19 disease.

Due to the need for a faster interpretation of chest X-ray images, a CNN-based AI system provides [23] proposed for the automatic detection of COVID-19 using chest X-ray images. The average classification accuracy of binary classification (such as COVID-19 and No-Findings) was 98.08%, and the average classification accuracy of multi-class classification (such as COVID-19, No-Findings and Pneumonia) was 87.02%. Finally, the author provided an intuitive explanation and evaluated by expert radiologists.

One of the limitations that we noticed from existing research is that all methods try to extract features from a fixed receptive field instead of a changing receptive field. As we derive from our proposed architecture, the changing receptive field can capture the pixel relationship of different scales by a dilated convolution, thus making the model more robust. As far as we know, no previous research focused on dilated convolution in a stack of parallel convolution blocks to detect COVID-19 infection in chest X-rays, but it was on conventional convolution blocks without a parallel framework. Conversely, in our proposed method, dilated convolution in a stack of parallel convolution blocks turns out to be much utile, since it can cover a larger receptive field without a loss of resolution. What’s more, the proposed parallel stack architecture can ensure that the branches add together before the last convolutional layer. The branch ensemble strategy limits the expansion of feature size and reduces the variance error, thereby improving the prediction performance of the proposed model. In literature reviews, several methods [1, 2, 8, 17] focus only on quantitative analysis, while other methods [16, 19, 22, 23, 31], focus on quantitative as well as qualitative analysis using visualization and localization techniques to prove that their analysis can be used for COVID-19 detection, which aids to allow for human-interpretable explanation. Finally, due to the inadequate number of COVID-19 cases, creating ample benchmarks is a major challenge in COVID-19 detection. Considering the small data set, running a large number of iterative CNN architectures may lead to overfitting, but the data augmentation strategy may be a partial solution to the shortcomings.

Data pre-processing

First, we will introduce the benchmark data set and the expansion strategy in detail to aid the training of the proposed model. Next, we will discuss in detail the proposed PDCOVIDNet architecture design method and the training strategy covering the optimal parameter adjustment. Finally, in order to make suspicious disease detection more convincing, we will integrate visualization techniques to highlight key issues with visual markers.

Chest X-ray image dataset

The benchmark data set [5] used in our experimental evaluation consists of three main categories (such as COVID-19, Normal, and Viral Pneumonia), yielding 219 COVID-19 positive, 1341 normal and 1345 viral pneumonia chest X-ray images. In the case of accumulating COVID-19 positive images, the authors used two open-access repositories, such as the Italian Society of Medical and Interventional Radiology (SIRM) COVID-19 database [27], and the Novel Corona Virus 2019 dataset developed by Joseph et al. [13]. On the other hand, P. Mooney [20] created normal and viral pneumonia images from the chest X-ray image (pneumonia) database used for this benchmark [5]. Moreover, the benchmark is public, and metadata is distributed to provide appropriate document guidance to generate references to each image. Since resizing is one of the essential steps in data preprocessing, all images resized to \(224\times 224\) pixels. Figure 1 shows sample images from the benchmark dataset, including COVID-19, normal and viral pneumonia. As shown in the Table 1, we have trained, validated and tested the images in an appropriate ratio.

Fig. 1
figure 1

Some image labels available in the benchmark dataset [5]

Table 1 Images partition in training, validation and testing

Data augmentation

In order to properly train the CNN model, it is often useful to manually increase the size of the data set using data enhancement that reduces noise and preserves the original quality. This process is executed just-in-time during the training process, so the performance of the model can be improved by solving the problem of overfitting. For image augmentation, we have many options to choose values of different scales, including horizontal flip, height and width offset, rotation, shearing, zoom, and fill modes. Each option has its ability to represent images in different ways to provide important features during the training phase and thus enhances the model’s performance better. Table 2 shows the image augmentation settings used in our experiment.

Table 2 Images augmentation settings

PDCOVIDNet architecture

In this section, we will briefly describe our proposed PDCOVIDNet architecture. In our proposed model, we have three main components, such as feature extraction, detection, and visualization. First of all, our proposed PDCOVIDNet is a parallel stack of convolutional layers, activation layer and max-Pooling layer. Then, we add parallel layers at the feature level, and perform a convolution again with activation on resulting feature maps. Afterward, the flattened features provide into two layers of Multi-Layer Perceptrons (MLP), but an adjustment needs to determine the proportion of neurons at each layer that drop, to avoid overfitting. Finally, the last layer with softmax activation function performs the classification task, and then generates a class activation map, which acts as an interpreter of classification, merged with the last convolution layer. Figure 2 shows the overall system architecture of the proposed PDCOVIDNet. We split the workflow into two parts: the feature extraction phase and the classification and visualization phase. In the next section, we will briefly explain the feature extraction process.

Feature extraction

Fig. 2
figure 2

An overall system architecture of PDCOVIDNet

Fig. 3
figure 3

Dilated convolution and its activation in PDCOVIDNet. At the top, we have a dilated convolution with a dilation rate of 1, and the corresponding receive field size is \(3 \times 3\), which is equivalent to the standard convolution. It has a \(5 \times 5\) receiving field size at the bottom and a dilation rate of 2. All dilated convolutions have a kernel size of \(3 \times 3\) and n filters

To obtain a suitable network architecture, different numbers of filters in each convolution layer, filter sizes, different layers in MLP and different hyper-parameters have experimented. In the first stage, PDCOVIDNet consists of five dilated convolutional blocks in parallel-stack form, expressed semantically as \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\), that are alternately max-pooled. More specifically, in \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\), i represents the dilation rate and X represents the total number of filters. In addition, each \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\) block performs the following operation: convolution and then activation twice in a sequential manner. In our architecture, the size of all filters is \(3 \times 3\). Let’s start with the input in Fig. 2. The input image feed in the parallel branch only changes the dilation rate, namely dilation rates 1 and 2. If we consider the first block in the upper branch (PDr=iconv(n=64)), the dilation rate is 1, the total number of filters is 64, and the operations in the semantic block are performed. Not only the first block, but also the rest of the blocks in the upper branch and the lower branch is the same except for the dilation rate. The dilation rates of the upper and lower branches are 1 and 2, respectively. Moreover, different filter sizes can be used, for example, the upper branch is \(3 \times 3\) and the lower branch is \(5 \times 5\), but in that case, the parameter size will be increased as it will require more multiplications. We have also experimented with this configuration of filter size which results in lower accuracy compared to the proposed 3 \(\times \) 3 filter size for both the upper and lower branches. Figure 3 illustrates how dilated convolution is incorporated into our proposed model. As shown in Fig. 3, the input image provides in two \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\) blocks in parallel, only changing the dilation rate, such as \(d=1,\)..., N. A convolution with a dilation rate of 1 is equivalent to a standard convolution, while a convolution with a dilation rate of greater than 1 expands the receptive field when processing input at a higher resolution, thereby achieving fine details of the image. The receptive field refers to the portion of the image where a filter extracts feature without change filter size, and is simply an input with a fixed gap, i.e., if there is dilation of d, then each input skips of \(d-1\) pixels. According to this definition, considering that our input is a 2D image, the dilation rate of 1 is a standard convolution, and dilation rate of 2 means that each input skips a pixel. To understand the relation between the dilation rate of d and the receptive field size of rf, it is often useful to understand the effect of d on rf when the kernel is fixed in size. Equation 1 [7] depicts the form of a receptive field size where the kernel of size k is dilated by the factor d.

$$\begin{aligned} \begin{aligned} {rf}&= d(k-1)+1. \end{aligned} \end{aligned}$$
(1)

Equation (1) refers to form the following equation that renders the size of the output o, where \(m\times m\) input with a dilation factor, padding and stride of d, p and s respectively.

$$\begin{aligned} \begin{aligned} o&= \left\lfloor \dfrac{m+2p-rf}{s}\right\rfloor +1. \end{aligned} \end{aligned}$$
(2)

After using two receptive fields of different sizes, it captures important features in the observation area at different scales. We define a convolution layer with filters \(\mathbf {F} \in \mathbb {R}^{1\times n}\) given as

$$\begin{aligned} \mathbf {F} =\begin{bmatrix} f_1^{k\times k}&f_2^{k\times k}&\cdots&f_{n=X}^{k\times k} \end{bmatrix}, \end{aligned}$$
(3)

where \(k\times k\) is filter size and X is total number of filters. For a dilated convolution with a dilation rate of d on \(m\times m\) input feature maps at layer l , the convolution generates feature maps from the input, denoted as \(\mathbf {X}_l^{m\times m}\), and calculated by

$$\begin{aligned} \mathbf {X}_l^{o\times o} =\mathbf {X}_{{l-1},{d=i}}^{m\times m}*\mathbf {{F}}_l+\mathbf {{B}}_l, \end{aligned}$$
(4)

where \(\mathbf {{B}}_l\) is a bias of the l-layer, and \(\mathbf {{F}}_l\) is the l-layer filter of size \(k \times k\). The features of layer l are generated at the dilation rate of d with the feature map in layer \((l-1)\). In \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\) block, after the convolution layer, we introduce a nonlinear layer with an activation function that uses the features generated at an earlier stage to create a new feature map as output. In the case of activation, we prefer the rectified linear unit (ReLU) [21] because it can integrate the nonlinear layer and the rectification layer in CNN. ReLU has several advantages, and most importantly, it can effectively propagate gradients. Therefore, if the initial weight takes into account the unique characteristics of CNN, the possibility of gradient disappearance can be reduced. Note that, the activation function performs element-by-element operations on the input feature map, so the output is the same size as the input. Assuming that the layer l is the active layer of the n-th filter, it obtains the input feature \(\mathbf {X}_{{(l-1)},{n}}^{m\times m}\) with the feature map \(m\times m\) from the previous convolution layer, and generates the same number of features defined as:

$$\begin{aligned} \mathbf {X}_{l,n}^{m\times m} =max(\mathbf {X}_{{(l-1)},{n}}^{m\times m},0) \end{aligned}$$
(5)

where \(\mathbf {X}_{l,n}^{m\times m}\) maps negative values to zero.

After performing the operation of the \(\hbox {PD}_{\mathrm{r}=\mathrm{i}}\hbox {conv}(\hbox {n}=\hbox {X})\) block, we apply the MaxPool block called PDr=i MaxPool, which performs the max-pooling operation to reduce the feature size. Max-Pooling, which takes the maximum value in each window, is an efficient approach to downscale the filtered image, because when using a filter size of \(2\times 2\) with a stride of 2, three-fourths of generated features are ignored in each layer substantially it reduces the computational complexity for the next layer. The max-pooling window used in our experiment was \(2\times 2\) and the stride was 2, because as reported in earlier studies [24], the overlap** window did not improve significantly over the non-overlap** window. Then, in Fig. 2, we see that the features generated from the parallel branches are concatenated and provided to the next convolution layer. The inspiration behind this concatenation-convolution operation is that each branch generates features from images at different layers of CNN have different properties, so we concatenate low-level features of parallel branches to explore feature relationship of dilated convolution hence final convolution layer might detect dominant features for classification. In the last convolution layer, a total of 512 filters with a filter size of \(k\times k\) and a dilation rate of 1 are applied to create final low-level features followed by ReLU activation. After that, we inaugurate a flatten layer, which converts the square feature map into a one-dimensional feature vector and prepares it for the next phase, which is finally a classification task. Our final task is the classification and visualization phase, which will briefly illustrate in the next section.

Classification and visualization

At this stage, a two-layer MLP (often called a fully connected (FC) layer) feeds the results of the flatten layer through two neural layers to perform the classification task. It attempts to render the activation from the previous FC layers into class scores (i.e., in classification). In addition, we include a Dropout [28] layer after each FC layer. This layer can randomly discard some FC layer weights during training to reduce overfitting. The number of randomly selected weight drops is defined by the dropout limit, which ranges from 0 to \(100\%\). Indeed, the best adjustment is to determine the optimal number of weights to use in each layer and the dropout ratio to avoid overfitting at the same time, making the network more robust. In this study, we chose a dropout size of 0.3, two FC layers of size 1024 and 1024, respectively, and used the softmax activation function to determine the classes of the input chest X-ray images as COVID-19, normal and viral pneumonia. Finally, the layer details of the proposed model are shown in Table 3.

Table 3 PDCOVIDNet layer details

Although CNN models are powerful in producing impressive results, there are still many questions about why and how to produce such good results. Owing to its black-box nature, it is sometimes challenging to adopt it in a real-life application (such as a medical diagnosis system) where we need an interpretable model. However, early studies [4, 25, 38] focused their attention on visualizing the behavior of CNN models, and various visualization methods emphasized the importance of distinguishing classes, so they could execute models with interpretability. In our proposed model, we use Grad-CAM [25] and Grad-CAM++ [4] to highlight the important regions that are class-discriminative saliency maps, where the model emphasizes a gradient-based approach that computes the gradients for a target image class on the feature maps of the final convolution layer in a CNN model. For a given image, let \(A^{k}_{i,j}\) denotes the activation map at a spatial location (ij) for the k-th filter. The class-discriminative saliency map \(L^{c}\) for the target image class c is then computed as [25]:

$$\begin{aligned} \begin{aligned} L^{c}_{i,j}&= ReLU\left( \sum _{k}w^{c}_{k}A^{k}_{i,j}\right) \end{aligned} \end{aligned}$$
(6)

In Eq. (6), the role of ReLU is to capture features that have a positive impact on the target class. Then, in the case of Grad-CAM, gradients that are flowing back to the final convolutional layer are globally averaged to calculate the target class weights of the k-th filter, as described in Eq. (7). Here, Z is the total number of pixels in the activation map, and \(Y^{c}\) is the probability that the target category is classified as c.

$$\begin{aligned} \begin{aligned} w^{c}_{k}&= \frac{1}{Z}\sum _{i}\sum _{j}\frac{\partial Y^{c}}{\partial A^{k}_{i,j}} \end{aligned} \end{aligned}$$
(7)

On the other hand, Grad-CAM++ contributes to the weighted average of pixel-level gradients rather than the global average of gradients, so the pixel weights in a particular feature map contribute to the overall decision of detection. Grad-CAM++ redevelops Eq. (7) to ensure that their contribution to the weighted average of the gradients remains unchanged without losing generality, i.e.,

$$\begin{aligned} w^{c}_{k}= & {} \sum _{i}\sum _{j}\alpha ^{kc}_{i,j}.ReLU\left( \frac{\partial Y^{c}}{\partial A^{k}_{i,j}}\right) \end{aligned}$$
(8)
$$\begin{aligned} \alpha ^{kc}_{i,j}= & {} \frac{\frac{\partial ^{2} Y^{c}}{(\partial A^{k}_{i,j})^{2}}}{2\frac{\partial ^{2} Y^{c}}{(\partial A^{k}_{i,j})^{2}}+\sum _{a}\sum _{b} A^{k}_{a,b}{\frac{\partial ^{3} Y^{c}}{(\partial A^{k}_{i,j})^{3}}} } \end{aligned}$$
(9)

In Eq. (9), (ij) and (ab) are iterators over the same activation map \(A^{k}\) [4].

Experimental evaluation

In this section, we will present the performance of our proposed model to classify chest X-ray images, broadly categorized into three classes: COVID-19, Normal and Viral Pneumonia. In “Data pre-processing” section, a brief description of the benchmark data set and the augmentation approach are discussed. In our experiment, we set the training, validation, and test ratios to \(80\%\), \(10\%\), and \(10\%\), respectively. We compared our proposed PDCOVIDNet with VGG16, ResNet50, InceptionV3 [29] and DenseNet121, and did not use any pre-trained weights (such as ImageNet) since ImageNet weights come from images of general objects, not chest X-ray images. All of our experiments are executed in Keras with the TensorFlow backend.

Hyper-parameters tuning

Hyper-parameters become critical because they directly control the behavior of the model, so fine-tuned hyper-parameters have a huge impact on the performance of the model. We used the Adam [14] optimizer to train 50 epochs for each model with a learning rate of \(1e-4\), with a batch size of 32. In addition, we applied the categorical cross-entropy loss function to the training, which measures the loss between the probability of the class predicted from the softmax activation function and the true probability of the category.

Performance evaluation metrics

For experimental evaluations, we utilized several evaluation metrics such as Accuracy, Precision, Recall, and F1 score, i.e.,

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{Total\,Samples} \end{aligned}$$
(10)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$
(11)
$$\begin{aligned} Recall= & {} \frac{TP}{TP+FN} \end{aligned}$$
(12)
$$\begin{aligned} F1= & {} 2\times \frac{Precision \times Recall }{Precision + Recall } \end{aligned}$$
(13)

where TP stands for true positive, while TN, FP, and FN stand for true negative, false positive, and false negative, respectively. The F1 score may be a more reliable measure because the benchmark dataset is unbalanced, such as COVID-19 with 219 images and non-COVID with 2686 images. Subsequently, we used the Receiver Operating Characteristics (ROC) curve to display the results and measured the area under the ROC curve [(often called AUC (Area Under the Curve)] to provide information about the effectiveness of the model.

Evaluation of individual model

The overall results are shown in Tables 4 and 5, where Table 4 describes the class-wise classification results on different evaluation metrics, and Table 5 shows the weighted average results. From the Table 4, we can see that almost all models tend to enhance the classification of most categories (such as normal and viral pneumonia) because they have more training weights than the COVID-19 case. For COVID-19, the highest performance belongs to PDCOVIDNet, whose precision, recall, and F1 scores are \(95.45\%\), \(91.30\%\), and \(93.33\%\), respectively. Our model provides consistent results for the precision, recall, and F1 under normal cases, with each performance index being \(97.04\%\), and the recall for ResNet50 is \(97.78\%\), slightly higher than PDCOVIDNet. Also, the precision and F1 scores of DenseNet121 are \(95.52\%\) and \(95.17\%\), respectively, which is comparable to PDCOVIDNet. Next, in the case of viral pneumonia, the precision, recall, and F1 score of PDCOVIDNet are \(96.32\%\), \(97.04\%\), and \(96.68\%\), respectively. In particular, for cases of viral pneumonia, ResNet50 is only \(1\%\) better precision than PDCOVIDNet. In the Table 4, the accuracy of all evaluation models is summarized, and it can be seen that PDCOVIDNet is superior to other models. At the same time, it is evident that PDCOVIDNet has the ability to resist class imbalances since COVID-19 cases are smaller than normal or viral pneumonia cases. However, the more structured residual blocks of the model, the worse the classification performance (e.g., ResNet50). As shown in Table 5, considering the weighted average of all performance evaluation indicators, the best results are obtained by using PDCOVIDNet. In the weighted average comparison, PDCOVIDNet’s results are much better than other models, which can be explained by the fact that the proposed model can extract feature maps at different scales from chest X-ray images. In particular, compared with PDCOVIDNet, Densenet121 is missing \( 2 \%\) in each evaluation indicator. Although ResNet50 provides the best performance for normal and viral pneumonia, unexpectedly, it fails to achieve the most successful model in the performance measurement.

Table 4 Class-wise classification results of individual model
Table 5 Weighted average results of each model

It is often hard to measure the performance of the model using precision, recall and accuracy, so we need to look at the ROC curve which allows a false positive rate since it plots the true positive rate against a false positive rate. In Fig. 4, ROC curves show the micro and macro average and class-wise AUC scores achieved with the PDCOVIDNet, and show consistent AUC scores across all classes, indicating stable predictions of the proposed model. In ROC curves, we obtained AUC scores of 0.9918, 0.9927, and 0.9897 for COVID-19, normal and viral pneumonia, respectively. We can see that the area under the curve of all classes is relatively similar, but normal’s AUC is slightly higher than other classes.

Fig. 4
figure 4

Comparison of the ROC curve for COVID-19, Normal and Viral Pneumonia using PDCOVIDNet

Figure 5 shows the confusion matrices for all evaluated models. In Fig. 5, it is clear that of the 23 test images, two of the COVID-19 images are classified as normal and viral pneumonia, and of the 135 images, only one image of viral pneumonia is related to COVID-19, but none of the normal images belong to COVID-19. One of the reasons may be that COVID-19 is an especial case of viral pneumonia, so they have common features that mislead the PDCOVIDNet model. It becomes clear that the VGG16 model has the same recall for the classification of normal and viral pneumonia, although it shows a significant decline in the positive prediction of COVID-19 cases. However, ResNet50 shows the ability to detect normal images, but in the case of COVID-19 detection, it shows almost the same performance as the VGG16 model, although it shows inadequate performance when predicting viral pneumonia images. As shown in the confusion matrix (Fig. 5d), we can say that InceptionV3 can correctly classify more cases of viral pneumonia than COVID-19 and normal cases. Next, DenseNet121 demonstrates the same performance as InceptionV3 in detecting COVID-19, and shows nearly the same performance with VGG16 in detecting viral pneumonia and normal cases. Finally, we can claim that PDCOVIDNet is powerful in detecting COVID-19 cases from chest X-ray images. For this reason, we believe that the proposed model focuses on discriminating features that can help distinguish between other types (e.g., normal and viral pneumonia).

Fig. 5
figure 5

Confusion matrix of all evaluated models with test set. Classes 0, 1, and 2 represent COVID-19, Normal, and Viral Pneumonia

Visualization using Grad-CAM and Grad-CAM++

In our evaluation, we used Grad-CAM and Grad-CAM++ visualization methods to visually represent the salient regions where PDCOVIDNet insisted on making the final classification decision on chest X-ray images. Accurate and decisive salient region detection is important for the interpretation of classification, while also ensuring the reliability of the results. In this regard, a two-dimensional heat map is generated from feature weights with different brightness, which corresponds to the importance of the feature. The heat map is overlaid on the input image to locate the salient region. Figure 6 shows the visualization results of Grad-CAM and Grad-CAM++ using PDCOVIDNet to locate salient regions when the input image is classified as COVID-19 or normal or viral pneumonia, where the regions distinguishing the classes in the lung have been localized. For COVID 19, both Grad-CAM and Grad-CAM++ generate seemingly the same results, so for the detection of critical areas, the overlap** positions of the heat maps can be considered. In the case of viral pneumonia, the salient regions detected using Grad-CAM and Grad-CAM++ are undifferentiated, while under normal class there are differences, and seems to fail to detect the salient regions as the heat map highlights outside X-ray than inside the lung. To help AI-based systems, it is certainly effective to provide the system with some human-understandable numerical measures (such as probability) as shown in Fig. 6.

Fig. 6
figure 6

Input images and their Grad-CAM, Grad-CAM++, and human-understandable prediction with probability score according to the PDCOVIDNet. The first row provides input with COVID-19, as well as two types of visualization effects, and finally the prediction results.The second and third rows show the investigation of normal and viral pneumonia, respectively. Here, True means the actual class of the image, and Prediction means the predicted class

Investigation on the incorrect classification

Fig. 7
figure 7

Investigation on the incorrectly classified images along with the probability of predicted class. Note that, V.Pneumonia refers to Viral Pneumonia

In this section, we will further investigate the incorrect classification caused by the use of PDCOVIDNet. The total number of incorrectly classified images is 10, as shown in Fig. 7. Two COVID-19 images are classified as normal and viral pneumonia, and in both cases, COVID-19 is far behind the prediction as the probability of COVID-19 prediction is very low compared with others. Normal images are not classified as COVID-19, but among the four images with normal classification errors, one prediction is on the edge of viral pneumonia, while the other predictions are very different. Correspondingly, among the four incorrectly classified images of viral pneumonia, one image belongs to COVID-19 and the other images are normal, and one of the predictions is very close.

For an incorrectly classified COVID-19 case, we observed that the X-ray quality was not good, and in another case, the posteroanterior (PA) field of vision hardly moved, so a lot of dark areas were generated hence it misleads the system. An image of viral pneumonia is classified as COVID-19. One reason may be due to their overlap** infection characteristics, for example, both infections cause severe damage to the lungs. Indeed, this distinction is often confusing, so precise clinical findings should be reviewed. In a few cases, our system is equally confused when detecting normal and viral pneumonia. One cause may be the progressive change in radiological manifestations. For example, the true image of viral pneumonia is classified as normal because it is arduous to predict that this may be an early stage of viral infection. This statement also applies to normal cases where viral infections may be at the stage of preterm, so some normal images belong to viral pneumonia.

Conclusion and future work

In this paper, we proposed a CNN-based method, called PDCOVIDNet, for detecting COVID-19 from chest X-ray images. As we have seen, PDCOVIDNet can effectively capture COVID-19 features by dilated convolution in the parallel stack of convolution blocks, so it has an excellent classification performance compared to some well-known CNN architectures. The dataset used in the experiment has a limited number of COVID-19 images, and at once, it is still develo**, but data augmentation techniques have able to surmount the challenge as CNN based architecture needs more data for effective training. Our experimental evaluation shows that PDCOVIDNet outperforms the state-of-the-art models, with its precision and recall are \(95.45\%\) and \(91.3\%\), respectively. As well, PDCOVIDNet demonstrates its potential through other performance metrics such as the weighted average of precision, recall and F1 scores, and finally the overall model accuracy. We apply the proposed model as well as two visualization techniques to identify the class-discriminative regions because they have a greater influence in classifying the input chest X-ray image into its anticipated classes.

As future work, we will explore and integrate diverse data sets with more COVID-19 cases to make our proposed model more robust. Besides, we will develop mobile applications with human-explainable function for screening COVID-19 cases so that an infected person can be diagnosed early; therefore, it can help stop the spread of this pandemic while providing a new way to prevent future pandemics. Furthermore, we will extend the model to be able to analyze a patient’s short term historical chest X-ray pattern that will predict whether the infection will become a life-threatening or not. Since studying the radiological markers of COVID-19 is an active research area, there is still much to uncover. Thus, we will further ameliorate the visualization technique for interpreting the unique features of COVID-19 more critically.