Background

COVID-19 is a widespread disease causing thousands of deaths daily. Early diagnosis of this disease proved to be one of the most effective methods for infection tree pruning [1]. The large number of COVID-19 patients is rendering health care systems in many countries overwhelmed. Hence, a trusted automated technique for identifying and quantifying the infected lung regions would be quite advantageous.

Radiologists have identified three types of irregularities related to COVID-19 in Computed Tomography (CT) lung images: (1) Ground Glass Opacification (GGO), (2) Consolidation, and (3) pleural effusion [2, 3]. Develo** a tool for semantically segmenting medical lung images of COVID-19 patients would contribute and assess in quantifying those three irregularities. It would help the front-liners of the pandemic to better manage the situation of overloaded hospitals.

Deep learning (DL) has become a conventional method for constructing networks capable of successfully modeling higher-order systems to achieve human-like performance. Tumors have been direct targets for DL-assisted segmentation of medical images. In [28].

SegNet architecture

SegNet is a Deep Neural Network originally designed to model scene segmentors such as road image segmentation tool. This task requires the network to converge using highly imbalanced datasets since large areas of road images consist of classes such as road, sidewalk, sky. In the dataset section, we demonstrated numerically how the dataset used in this work exhibit disparity in class representation. As a consequence, SegNet was our first choice for this task.

SegNet is a DNN with an encoder-decoder depth of three. The encoder layers are identical to the convolutional layers of the VGG16 network. The decoder constructs the segmentation mask by utilizing pooling indices from the max-pooling of the corresponding encoder. The creators removed the fully connected layers to reduce complexity, which reduces the number of parameters of the encoder sector from \({1.34}{{\mathrm{e}}}{+8}\) to \({1.47}{{\mathrm{e}}}{+7}\). See [29].

Network training

Training the neural networks is done using the ADAM stochastic optimizer due to its fast convergence rate compared to other optimizers [30]. The input images are resized to \(256\times 256\) to reduce the training time and also for memory requirements. The one-hundred images dataset is divided into three sets for training, validation, and testing, with proportions of 0.72, 0.10, and 0.18 respectively. In spite of class imbalance discussed earlier, class weights are calculated using median frequency balancing and handed over to the pixel classification layer to formulated a weighted cross-entropy loss function [31]:

$$\begin{aligned} \gamma = - \frac{1}{K} \sum _{k=1}^{K}\sum _{n=1}^{N} w_i \cdot l_k^n \cdot \mathbf{log }(p_k^n) \end{aligned}$$
(1)

where K is the number of instances, N number of classes, \(l_k^n\) and \(p_k^n\) are label and prediction in class n in instance i, and \(w_i\) is the weight of class i.

Each network is trained nine times using different hyperparameters to find the best possible configuration. Table 2 lists these training hyperparameters. For the training performance, it was completed in 160 epochs for all the experiments. Training time variation was negligible among networks, with an average of 25 min. Figure 4 illustrates training performance and loss for the best binary segmentors (U-NET #4 and SegNet #4) and the best multi-class segmentors (U-NET #4 and SegNet #7). The criteria used to conclude the best experiments are discussed in the results section.

Table 2 Hyperparameters used for training the DNNs

The training process was done using the Deep Learning Toolbox version 14.0 in MATLAB R2020a (9.8.0.1323502) in a Windows 10 version 10.0.18363.959 machine with an INTEL core-i5 9400F and an NVIDIA 1050ti 4GB VRAM GPU using CUDA 10.0.130. Usage of the GPU reduced training times by a factor of 35 on average.

Evaluation criteria and procedure

To fully quantify the performance of our models, we utilized five known classification criteria: sensitivity, specificity, G-mean, Sorensen-Dice (aka. F1), and F2 score. The following Eqs. (2)–(6) describe these criteria:

$$\begin{aligned} {\text {Sensitivity}} = \frac{{\text {TP}}}{{\text {TP}} + {\text {FN}}} \end{aligned}$$
(2)
$$\begin{aligned} {\text {Specificity}} = \frac{{\text {TN}} }{{\text {TN}} + {\text {FP}}} \end{aligned}$$
(3)
$$\begin{aligned} {\text {Sorensen-Dice}} = \frac{2\times {\text {TP}}}{2\times {\text {TP}} + {\text {FP}} + {\text {FN}}} \end{aligned}$$
(4)
$$\begin{aligned} {\text {G-mean}} = \sqrt{{\text {sensitivity}} \times {\text {specificity}}} \end{aligned}$$
(5)
$$\begin{aligned} {\text {F2-score}} = \frac{5\times {\text {Precision}}\times {\text {Sensitivity}}}{4\times {\text {Precision}} + {\text {Sensitivity}} } \end{aligned}$$
(6)

These criteria are selected because of the dataset imbalance nature discussed in the Materials and Methods section.

The evaluation was carried out as follows: the global accuracy of the classifier was calculated for each test image and averaged over all the images. Using the mean values of global accuracies, the best experiment of each network was chosen for a ”Class Level” assessment. Then, statistical scores (2)–(6) were calculated for each class and tabulated properly.

Results

Binary segmentation

Test images results

Table  3 shows results for both models of binary classifiers after evaluating every experiment of each network. We can see from the results that our networks achieve accuracy values larger than 0.90 in all cases, and 0.954 accuracy in the best case (experiment 4 of the network SegNet). The standard deviation of experiment 4 is 0.029. The second best network is experiment 4 of the U-NET architecture with an accuracy of 0.95 and a standard deviation of 0.043.

Table 3 Global accuracy metrics of the Test data images calculated for the nine experiments of the UNET and SegNet networks as binary class segmentors. The “plot” columns visualize the mean accuracy and the standard deviation of each experiment

The best experiment of each architecture is selected for further performance investigation on the class level.

Class Level

Based on the criteria discussed in the “Methods” section, the best two networks found in the previous section are evaluated. We can see that the SegNet network surpasses U-NET with noticeable margins for all metrics except sensitivity and G-mean, where both networks produce similar results. See Table 4.

Table 4 Statistical results for the binary segmentor

Multi class segmentation

Test images results

Similarly, we obtain the best experiment for each multi-classification network. The best experiment of the SegNet architecture is number 7, giving an accuracy of 0.907 with a standard deviation of 0.06. We also found that the overall best accuracy of 0.908 is achieved by the fourth experiment of U-NET network with a standard deviation of 0.065. All the experiments achieve higher accuracy than 0.8 except for the first three experiments of SegNet. Refer to Table 5.

Table 5 Global accuracy metrics of the Test data images calculated for the nine experiments of the UNET and SegNet networks as multi-class segmentors. The “plot” columns visualize the mean accuracy and the standard deviation of each experiment

Class level

In the same manner as the binary segmentation results section, the best experiment of each architecture is evaluated as presented in Table 6. Both networks struggled to recognize the C3 class. Nevertheless, they achieve good results for C1 and C2 . We also notice the high specificity rate regarding all the classes. The U-NET architecture recorded higher values for all parameters except the specificity.

Table 6 Statistical results for the multi segmentor

Discussion

Binary classification problem

It can be referred from Table 5 that SegNet outperforms U-NET architecture by a noticeable margin. Both networks have an exceptionally high true positive count for the ”Not Infected” class. The results state in a quantifiable manner how reliable is the DNN models in distinguishing between the non-infected and infected classes, i.e. ill portions of the lungs. Further experiments involving a larger dataset is likely to confirm this. The high sensitivity (0.956) and specificity (0.945) of the best network (SegNet) indicate its goodness in modeling a trained radiologist for the task at hand.

Regarding the standard deviation of the results demonstrated in Table 3, the values ranged from 0.060 to 0.086. These low values indicate highly consistent accuracies in the test partition of the dataset.

The results of our SegNet show enhancements over Inf-Net and Semi-Inf-Net presented in [25] in terms of Dice, specificity, and sensitivity Metrics. As well, the U-NET outperforms them only in terms of sensitivity. Both works utilize the same dataset. As a binary segmentor, Inf-Net focuses on edge information and allocates a portion of computations to highlight it. This would remove focus from the important internal texture and allocate more weight to the fractal shaped edge, especially that no evidence of high contrast between the infection and lung tissue is found. Secondly, the parallel partial decoder used by the network gives less weights to low-level features which are considered a key for texture highlighting. Another reason might be that the SegNet was trained on dataset images that contain only lung areas.

SegNet outperforms the Semi-Inf-Net network, an architecture utilizes pseudo labeling to generate additional training data, by a small margin. This might be because the used pseudo-labeling technique generated 1600 labels from only 50 labeled images which were used to train the network.

SegNet also surpasses the COVID-SegNet architecture proposed in [22] in sensitivity and Dice metrics. This might be because, according to the authors, COVID-19 lesions were difficult to distinguish from the chest wall. COVID-SegNet was able to segment the lung region with close-to-perfect performance, yet was not able to match this accuracy in segmenting the infection regions that are close to the wall. A more detailed comparison, in which both architectures are trained and tested on the same dataset, might be necessary to further generalize this result.

It should be noted here that increasing the mini-batch size has a negative effect on the networks performance; further tests may lead to a generalized statement regarding this. A previous study investigated the mini-batch size role in the VGG16 network convergence. It concluded that smaller mini-batch sizes coupled with a low learning rate would yield a better training outcome [32]. Another study concluded that smaller mini-batch sizes tend to produce more stable training for ResNet networks by updating the gradient calculations more frequently [33].

Multi class problem

Table 6 shows how good the U-NET is in segmenting the Ground Glass Opacification and the Consolidation. The U-NET produced moderate results in segmenting the pleural effusion; a Dice of 0.23 and F2 score of 0.38 which downplays its role as a reliable tool for pleural effusion segmentation.

The C3 class, as discussed in the Dataset section, is the least represented class in the dataset. Therfore, such a result is expected from a multi-class segmentation model constructed using 72 image instances only.

The standard deviation values of the multiclass segmentors were, on average, a little higher than those of the binary segmentors. Yet, they still indicate that the networks are solid performers in terms of accuracy. The high specificity rates clearly state that the models are reliable in identifying non-infected tissue (class C0 ).

Five-fold cross validation

Due to the small size of images in the dataset, a five-fold cross-validation was performed as an overall assessment. The dataset images were first scrambled to form a newly randomized dataset. Then, for each iteration, images were divided into three sets: 70% for training, 10% for validation, and 20% for testing in a successive manner. The validation set was utilized for monitoring the network performance during training, and to keep the overall training data count as close as possible to the procedure performed in the Networks Training section.

Table 7 presents the statistical results using criteria described in the Evaluation Criteria and Procedure section. We notice low values of standard deviation for each score, except for the sensitivity of the \({\hbox {C}}_3\) class, with mean values close to the ones reported in Tables 4 and 6.

Table 7 Five-fold experiment results for the best network of each architecture

Network feature visualization

Deep Dream is a method used to visualize the features extracted by the network after the training process [34]. Since SegNet proved to be a reliable segmentor considering its high statistical scores, the generated Deep Dream image should lay out the key features distinguishing each class (non-infected, infected). We plotted the Deep Dream image in Fig. 5. We can apparently visualize a discerning pattern between the two classes in this image.

Conclusions

In this paper, the performance of two deep learning networks (SegNet & U-NET) was compared in their ability to detect diseased areas in medical images of the lungs of COVID-19 patients. The results demonstrated the ability of the SegNet network to distinguish between infected and healthy tissues in these images. A comparison of these two networks was also performed in a multiple classification procedure of infected areas in lung images. The results showed the U-NET network’s ability to distinguish between these areas. The results obtained in this paper represent promising prospects for the possibility of using deep learning to assist in an objective diagnosis of COVID-19 disease through CT images of the lung.

Fig. 1
figure 1

Dataset sample. CT scan (left), masked lungs (middle), and labeled classes (right), where black is class C0, dark gray is C1, light gray is C2, and white is C3.

Fig. 2
figure 2

Accumulation of the dataset’s labels. All the labels of the dataset were summed up to form a graphic that illustrates the regions of the lungs most prune to infection

Fig. 3
figure 3

The DNN architectures The SegNet (top) where the encoder-decoder of the network are illustrated using the gray and white bubbles, and U-NET (bottom) where the contractive and expansive layer patches are encapsulated in blue and yellow bubbles

Fig. 4
figure 4

SegNet and U-NET binary and multi-class segmentors’ training accuracy and loss Four plots of training loss and accuracy for the best configuration of each segmentor

Fig. 5
figure 5

SegNet binary segmentor Deep Dream image Deep dream image laying out key features the network is using to segment the CT scans. infected tissue (right), non-infected (left)