Introduction

The coronavirus pandemic is an ongoing global pandemic disease, which is also called COVID-19. World Health Organization (WHO) declared the COVID-19 as a public health emergency of international concern on 01/30/2020, and as a pandemic on 03/11/2020. Till 17/09/2020, COVID-19 has caused 29.87 million confirmed cases and 940.72 thousand death tolls.

In practice, there are two main diagnosis methods. One is real-time reverse-transcription polymerase chain reaction (PCR), which is an RNA testing of respiratory secretions sampled with the help of nasopharyngeal swab. The other is the imaging methods, among which chest computed tomography (CT) can get better performance than chest X-ray. Studies have shown that chest CT is faster than more sensitive than PCR methods [1].

For the chest CT, the main biomarkers differentiating COVID-19 from healthy people are the asymmetric peripheral ground-glass opacity (GGO) without pleural effusions. Manual interpretation by radiologists is tedious and easy to be influenced by fatigue, emotion, and other factors. A smart diagnosis system via computer vision and artificial intelligence can benefit patients, radiologists, and hospitals.

Traditional artificial intelligence (AI) and modern deep learning (DL) methods have achieved excellent results in analyzing medical images; e.g., [2] proposed a radial-basis-function neural network (RBFNN) to detect pathological brains. [3] presented a kernel-based extreme learning classifier (K-ELM) to create a novel pathological brain detection system. Their method was robust and effective. [4] proposed a novel extreme learning machine trained by the bat algorithm (ELM-BA) approach. [5] used a six-layer convolutional neural network (6L-CNN) to recognize sign language fingerspelling. [6] presented the GoogleNet. [7] suggested the use of ResNet18 for mammogram abnormality detection. [8] presented a weakly labeled data augmentation method on COVID-19 chest X-ray images. [9] used a deep learning approach to characterize COVID-19 pneumonia in chest CT images. [10] combined support vector machine and convolutional neural network to detect COVID-19 from chest X-ray images.

The results based on deep learning methods, particularly, convolutional neural networks (CNNs), are significantly better than those of orthodox computer-vision methods. However, two reasons limited the wide usage of CNNs: (i) Tuning the hyperparameters of CNN is boring and time-consuming. (ii) Small medical dataset may cause overfitting to CNN.

A solution to (i) accelerate develo** CNN and (ii) to avoid overfitting is “transfer learning (TL).” The TL techniques can develop a deep neural network quickly with comparable or even better performance than recent deep learning methods. TL stores knowledge gained from solving another problem and applies that gained knowledge to a different but related problem.

This paper aims to apply a relatively new transfer learning framework—DenseNet to solve the task of COVID-19 diagnosis. The contributions of this study are six folds: (i) DenseNet was introduced as the backbone pre-trained model, and we modified it to our task. (ii) A composite learning factor strategy was used for training DenseNet. (iii) Data augmentation was used to enhance the training set. (iv) An optimization of transfer learning setting (OTLS) strategy was proposed. (v) Precomputation was introduced to save memory. (vi) We compared our method with state-of-the-art COVID-19 diagnosis approaches and proved it gives better performances.

Materials

We selected 142 patients with COVID-19 pneumonia from local hospitals. Another 142 cases were randomly selected from healthy medical people (tested negative). The COVID-19 patients were in the observation group: 95 males and 47 females. The healthy checkup was the control group: 88 males and 54 females. Inclusion criteria for confirmed COVID-19: (1) positive nucleic acid test and (2) the CT image data is complete. Table 1 lists the demographic statistics of all involved subjects.

Table 1 Demographic statistics of subjects

Image acquisition CT configuration and method: Philips Ingenity64 row spiral CT machine, KV: 120, MAS: 240, layer thickness 3 mm, layer spacing 3 mm, screw pitch 1.5: lung window (W: 1500 Hounsfield unit (HU), L: − 500 HU), Mediastinum window (W: 350 HU, L: 60 HU), thin-layer reconstruction according to the lesion display, layer thickness, and layer distance are 1-mm lung window image. The patients were placed in a supine position, breathing deeply after holding in, and conventionally scanned from the lung tip to the costal diaphragm angle. All the acquired images are with resolution of \(1024 \times 1024\).

All images are transmitted to the medical image PACS for observation, and two junior radiologists \(({J}_{1},{J}_{2})\) with chest diagnostic experience collectively read the radiographs and recorded the distribution, size, and morphology of the CT manifestations of the lesions. About 1–4 slices were chosen. Those 142 COVID-19 subjects and 142 healthy people finally generate \(320+320=640\) images.

When there are differences between the two analyses, a senior doctor \(S\) was consulted to reach a consensus by majority voting.

$$\mathcal{L}=\left\{\begin{array}{cc}{J}_{1}& {J}_{1}={J}_{2}\\ \mathrm{MV}({J}_{1},{J}_{2},S)& {J}_{1}\ne {J}_{2}\end{array}\right.$$
(1)

where J1, J2, and S represent the opinions (COVID-19 or healthy) by corresponding radiologists. MV represents majority voting, \(\mathcal{L}\) is the label result.

Slice level selection method was used to extract slice images. For COVID-19 pneumonia patients, the slice showing the largest size and number of lesions was selected. For healthy control group, any level of the image can be selected.

Methodology

For the ease of understanding, Table 13 shows the abbreviations used in this paper. In what below, we shall expatiate each steps of our proposed method.

Improvement I: Preprocessing

The chest CT images \({\mathbb{X}}\) is a combination of both COVID-19 subjects \({\mathbb{C}}\) and healthy controls \({\mathbb{H}}\)

$${\mathbb{X}}=\mathrm{concat}({\mathbb{C}},{\mathbb{H}})$$
(2)

\({\mathbb{X}}\) were collected from a variety of sources and different scanning machines. Suppose there are n images in the dataset, then we can write in short as \({\mathbb{X}}=\left\{{x}_{1},{x}_{2},\dots ,{x}_{n}\right\}\)

The first step: A contrast normalization technique, histogram stretching (HS) [11], is chosen to preprocess all the images.

$${\mathbb{Y}}=\mathrm{HS}\left({\mathbb{X}}\right)=\left\{{y}_{1},{y}_{2},\dots ,{y}_{n}\right\}$$
(3)

Suppose x denotes the original chest CT image and y stands for the contrast-normalized image. Histogram stretching operation is defined below:

$$y\left(a,b\right)=\frac{\left[x\left(a,b\right)-{x}_{l}\right]}{\left({x}_{h}-{x}_{l}\right)}$$
(4)

where

$${x}_{l}=\underset{(a,b)}{\mathrm{min}}\left[x(a,b)\right]$$
(5)
$${x}_{h}=\underset{(a,b)}{\mathrm{max}}\left[x(a,b)\right]$$
(6)

xl and xh represent the lowest and highest intensity gray-levels of the chest CT image x. The (a, b) stands for the coordinates.

The second step: We crop the image. The reasons are as follows: (i) There are rulers on the right side of the image and (ii) there are texts and checkup beds on the bottom of the image, as shown in Fig. 1. Those will impair the following classification performance, since the contents at bottom and right sides are unrelated with our diagnosis. After the crop, we get the image set \({\mathbb{Z}}\).

Fig. 1
figure 1

Why we need to crop the chest CT image (the right and bottom show unrelated contents, outlined by red boxes)

$${\mathbb{Z}}=\mathrm{Crop}\left({\mathbb{Y}}\right)=\left\{{z}_{1},{z}_{2},\dots ,{z}_{n}\right\}$$
(7)

The third step: We resize the image to [224, 224]. The reasons are the following: (i) Original size is too large and contains redundant spatial information, (ii) reduced size can accelerate deep neural network processing, and (iii) \(224\times 224\) is the standard input for following DenseNet input. After resizing, we get the image set \({\mathbb{R}}\) from \({\mathbb{Z}}\). Figure 2 shows an illustration of the preprocessed images in our dataset we shall use in this study.

Fig. 2
figure 2

Illustration of preprocessed COVID-19 dataset

$$\begin{array}{c}R=Resize\left({\mathbb{Z}},\left[224 224\right]\right)\\ =\left\{r,{r}_{2},\dots ,{r}_{n}\right\}\end{array}$$
(8)

The resizing procedure can save storage memory significantly. Suppose we store each image in single-precision floating-point (SPFP) format, i.e., 32 bits = 4 byte the original size for each image is \(1024\times 1024\times 3\times 4=\mathrm{12,582,912 byte}\). After resizing, the storage for each image is \(224\times 224\times 3\times 4=\mathrm{602,112 byte}\). Using simple math, we find the resizing procedure can save \(1-\frac{\mathrm{602,112}}{\mathrm{12,582,912}}=95.21\mathrm{\%}\) storage, and therefore, we can store all the chest CT images in the RAM, so as to accelerate the running speed of proposed algorithms.

Improvement II: Data Augmentation on Training Set

To avoid overfitting, five disparate data augmentation (DA) techniques were utilized to expand training set. Suppose the dataset \({\mathbb{R}}\) is divided as three parts: training set \(X\) and validate set Y and test set Z via random hold-out (RHO) method

$${\mathbb{R}}\stackrel{\mathrm{RHO}}{\to }\left\{X,Y,Z\right\}$$
(9)

where

$$\begin{array}{c}X=\left\{x(i)\right\}\\ Y=\left\{y(i)\right\}\\ Z=\left\{z(i)\right\}\end{array}$$
(10)

and their sizes obey following equation

$$\left|X\right|+\left|Y\right|+|Z|=|{\mathbb{R}}|$$
(11)

where \(\left|.\right|\) means the cardinality of the set.

In total, we have collected 640 lung window images from chest CT. Afterwards, we use the random hold-out (RHO) method. About 50% were used for training randomly, 20% for validation, and the rest 30% were used for test. The summary of the dataset is listed in Table 3.

Table 2 RHO setting

For each image xi, we perform following five DA operations: (i) Scaling. Chest CT images are scaled with scaling factor s from 0.7 to 1.3 with increase of 0.02, skip** s = 1.

$$\begin{array}{c}\overrightarrow{{x}^{\mathrm{scale}}(i)}=scale\left[x(i)\right]\\ =\left[{x}_{1}^{\mathrm{scale}}\left(i,{s}_{1}\right),\dots {x}_{30}^{\mathrm{scale}}\left(i,{s}_{30}\right)\right]\end{array}$$
(12)

where

$$\begin{array}{c}\begin{array}{c}{s}_{1}=0.7,{s}_{2}=0.72,\dots ,{s}_{15}=0.98\\ {s}_{16}=1.02,\dots ,{s}_{30}=1.3\end{array}\end{array}$$
(13)

(ii) Rotation. Angle θ was in the value from − 30 to 30° in increase of 2°, skip** \(\theta =0\).

$$\begin{array}{c}\overrightarrow{{x}^{\mathrm{rot}}(i)}=rotate\left[x(i)\right]\\ =\left[{x}_{1}^{\mathrm{rot}}\left(i,{\theta }_{1}\right),{x}_{2}^{\mathrm{rot}}\left(i,{\theta }_{2}\right),\dots {x}_{30}^{\mathrm{rot}}\left(i,{\theta }_{30}\right)\right]\end{array}$$
(14)

Where

$$\begin{array}{c}{\theta }_{1}=-30^\circ ,{\theta }_{2}=-28^\circ ,\dots ,{\theta }_{15}=-2^\circ \\ {\theta }_{16}=2^\circ ,\dots ,{\theta }_{30}=30^\circ \end{array}$$
(15)

(iii) Noise injection. The m-mean n-variance Gaussian noises [12] are added to the chest CT images \({x}_{i}\) to produce 30 new noised images.

$$\begin{array}{c}\overrightarrow{{x}^{\mathrm{noise}}(i)}=\text{noise}\underline{}\text{inject}\left[x(i)\right]\\ =\left[{x}_{1}^{\mathrm{noise}}\left(i,{m}_{1},{n}_{1}\right),\dots {x}_{30}^{\mathrm{noise}}\left(i,{m}_{30},{n}_{30}\right)\right]\end{array}$$
(16)

where

$${m}_{1}={m}_{2}=\cdots ={m}_{30}=0$$
(17)
$${n}_{1}={n}_{2}=\cdots ={n}_{30}=0.01$$
(18)

(iv) Random translation. The chest CT image \({x}_{i}\) is translated 30 times with random shift, of which the horizontal and vertical values t = [tx, ty] are in the range of [− 15, 15] pixels, and obey uniform distribution \(\mathcal{U}\).

$$\begin{array}{c}\overrightarrow{{x}^{\mathrm{translate}}(i)}=translate\left[x(i)\right]\\ =\left[{x}_{1}^{\mathrm{translate}}\left(i,{t}_{1}\right),\dots {x}_{30}^{\mathrm{translate}}\left(i,{t}_{30}\right)\right]\end{array}$$
(19)

where

$$\begin{array}{c}{t}_{x}\sim \mathcal{U}\left[-\mathrm{15,15}\right]\\ {t}_{y}\sim \mathcal{U}\left[-\mathrm{15,15}\right]\end{array}$$
(20)

(v) Gamma correction. The gamma correction factor r varies from 0.4 to 1.6 with increase of 0.04, skip** the value of 1

$$\begin{array}{c}\overrightarrow{{x}^{\mathrm{Gamma}}(i)}=Gamma\_correct\left[x(i)\right]\\ =\left[{x}_{1}^{\mathrm{Gamma}}\left(i,{r}_{1}\right),\dots {x}_{30}^{\mathrm{Gamma}}\left(i,{r}_{30}\right)\right]\end{array}$$
(21)

where

$$\begin{array}{c}{r}_{1}=0.4,{r}_{2}=0.44,\dots ,{r}_{15}=0.96\\ {r}_{16}=1.04,\dots ,{r}_{30}=1.6\end{array}$$
(22)

Finally, horizontal mirror [13] was inducted to all the 150 new generated images.

$$\overrightarrow{{x}^{\mathrm{mirror}}(i)}=\mathrm{horizontal}\_\mathrm{mirror}\left(\overrightarrow{{x}^{DA5}(i)}\right)$$
(23)

where \(\overrightarrow{{x}^{DA5}}\) means the concatenation of five temporary DA results.

$$\overrightarrow{{x}^{DA5}(i)}=\mathrm{concat}\left[\begin{array}{c}\overrightarrow{{x}^{\mathrm{scale}}(i)}\\ \overrightarrow{{x}^{\mathrm{rot}}(i)}\\ \overrightarrow{{x}^{\mathrm{noise}}(i)}\\ \overrightarrow{{x}^{\mathrm{translate}}(i)}\\ \overrightarrow{{x}^{\mathrm{Gamma}}(i)}\end{array}\right]$$
(24)

Thus, one original image will lead to 301 images. Suppose size function is \(S\), we have \(S\left[x(i)\right]=1\), \(S\left[\overrightarrow{{x}^{DA5}\left(i\right)}\right]=5\times 30=150\), and \(S\left[\overrightarrow{{x}^{\mathrm{mirror}}(i)}\right]=150\). Thus, adding up we get \(1+150+150=301\).

$$x(i)\stackrel{\mathrm{DA}}{\to }\mathrm{concat}\left[x\left(i\right),\overrightarrow{{x}^{DA5}\left(i\right)},\overrightarrow{{x}^{\mathrm{mirror}}(i)}\right]$$
(25)

There are other DA techniques, such as color jittering and shear transform. In this study, we choose above five DA techniques, because they can efficiently improve the performances of our established deep learning model. Using other DA techniques may even further improve the performance, and we will leave it as our future studies.

Using Fig. 2a as an example, we can generate in total 150 new images and another 150-image of horizontally flipped images. Figure 3a–e show the \(\overrightarrow{{x}^{\mathrm{DA}5}}\) results. For the page limit, we do not display the horizontally flipped images. After data augmentation, the training set will be \(301\times\) of its original size. That means, we have \(320\times 301=\mathrm{96,320}\) training images in the augmented training set.

Fig. 3
figure 3

Data-augmented training samples (mirror results were not shown)

Basics of Transfer learning

The basic ideas of transfer learning (abbreviated as TL) [14] are utilizing a complicated and successfully pre-trained model (DenseNet in this study), taught from a sizable amount of source data, viz., ImageNet, and then “transfer” the learnt knowledge to the relatively simple task (classify COVID-19 from HC) with a tiny quantity of data. Mathematically, suppose the source data is \({D}_{S}\) representing ImageNet, and the source label \({L}_{S}\) the 1000-category labeling

$$\mathcal{S}=\left\{{D}_{S}, {L}_{S},{f}_{S}\right\}$$
(26)

where \({f}_{S}\) means the source objective-predictive function, i.e., DenseNet in this study. Now we have the target triple: target data \({D}_{T}\) represents the augmented training set, \({L}_{T}\) presents the 2-class labeling (COVID-19 or health), and \({f}_{T}\) represents the classifier to be established.

$$\mathcal{T}=\left\{{{D}_{T},L}_{T},{f}_{T}\right\}$$
(27)

Using transfer learning, the classifier to be created can be written as \({f}_{T}\left({D}_{T},{L}_{T}|{D}_{S},{L}_{S},{f}_{S}\right)={f}_{T}\left({D}_{T},{L}_{T}|\mathcal{S}\right)\). Without transfer learning, the classifier is written as \({f}_{T}\left({D}_{T},{L}_{T}\right)\). \({f}_{T}\left({D}_{T},{L}_{T}|\mathcal{S}\right)\) is expected to be much closer to the ideal classifier \({f}_{T}^{\mathrm{Ideal}}\) than using only the target domain, viz. suppose there is a large number of samples \(X\), then

$$\begin{array}{c}\left|{f}_{T}\left({D}_{T},{L}_{T}|\mathcal{S}\right)\left(X\right)-{f}_{T}^{\mathrm{Ideal}}\left(X\right)\right|\\ <\left|{f}_{T}\left({D}_{T},{L}_{T}\right)\left(X\right)-{f}_{T}^{\mathrm{Ideal}}\left(X\right)\right|\end{array}$$
(28)

where | | is some error function over all the samples \(X\).

In practice, three elements are vital to help the transfer: (i) The triumph of PTM helps the user get rid of hyper-parameter tuning. (ii) The initial layers in PTM can be thought as feature descriptors which extract low-level features, e.g., tints, edges, blobs, shades, and textures. (iii) The target model may only need to re-train the last several layers of pre-trained model, since we believe the last several layers carry out the complex identification tasks. The basic idea of transfer learning is shown in Fig. 4.

Fig. 4
figure 4

Idea of transfer learning

Improvement III: Use DenseNet as Backbone

Pretrained models (PTMs) are useful tools to help the AI practitioners to swiftly develop a deep neural network suitable for a specific task. Common PTMs include AlexNet, VGGNet, ResNet, GoogleNet, and DenseNet. AlexNet was proposed by [15], which overwhelmed the other competitor models in the competition of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) at year 2012. The structure of AlexNet is similar to LeNet but with some new techniques like local response normalization (LRN), max pooling, and ReLU nonlinearity. [16] presented VGGNet that was a 19-layer deep neural network and won the 2nd place of ILSVRC-2014. GoogleNet was proposed by [17] combined some new ideas including \(1\times 1\) convolutions, reducing size of feature maps and making the network deeper and wider. The winner ResNet [18] of ILSVRC-2015 presented a deeper network with 152 layers, which includes a new idea of a shortcut connection. Then in 2016, [19] developed a new idea so called “DenseBlock (DB),” which introduced the feature reuse into the whole network.

In this study, we choose DenseNet as the backbone for develo** a smart COVID-19 diagnosis system, as DenseNet provides the best performance based on the ImageNet classification task. In the traditional CNN shown in Fig. 5a, all layers are gradually connected as

Fig. 5
figure 5

Comparison of plain CNN block, ResNet block, and DenseNet block

$$\mathrm{CNN}:{x}_{l}={N}_{l}({x}_{l-1})$$
(29)

where l stands for the layer index and N means the non-linear operation. xl represents the feature from of the \({l}_{\mathrm{th}}\) layer. CNN makes the network challenging to go wider and deeper, as it may face problems of either exploding or gradient vanishing. Then, ResNet employed the shortcut connection by skip** at least two layers.

$$\mathrm{ResNet}:{x}_{l}={N}_{l}({x}_{l-1})+{x}_{l-1}$$
(30)

Figure 5b shows its structure, of which the input is xl − 1 and the output after two conv layers Nl(xl − 1) is added with the shortcut to input layer xl − 1, and thus, the summation is the output of l-th layer.

Further, DenseNet block shown in Fig. 5c revises the model by concatenating all the feature maps \([{x}_{0},{x}_{1},...,{x}_{l-1}]\) sequentially instead of summation of the output feature maps from all previous layers

$$\mathrm{DenseNet}:{x}_{l}={N}_{l}(\mathrm{concat}[{x}_{0},{x}_{1},...,{x}_{l-1}])$$
(31)

DenseNet offers concatenations of all feature maps from previous layers [20], which means, all the feature maps propagate to the later layers and connected to the newly generated feature maps. The new developed DenseNet introduce some advantages like feature reuse, decrease the problem of either exploding or gradient vanishing [21].

From Fig. 5c, we find that k feature maps are generated for each operation \({N}_{l}\).

$$\begin{array}{c}\begin{array}{c}{x}_{1}={N}_{1}\left({x}_{0}\right)\\ {x}_{2}={N}_{2}\left(\mathrm{concat}\left[{x}_{0},{x}_{1}\right]\right)\\ {x}_{3}={N}_{3}\left(\mathrm{concat}\left[{x}_{0},{x}_{1},{x}_{2}\right]\right)\end{array}\\ {x}_{4}={N}_{4}\left(\mathrm{concat}\left[{x}_{0},{x}_{1},{x}_{2},{x}_{3}\right]\right)\end{array}$$
(32)

As there are five layers in Fig. 5c, we can get k0 + 4 k feature maps finally as the final feature map is \([{x}_{0},{x}_{1},{x}_{2},{x}_{3},{x}_{4}]\); k0 stands for the number of feature map (\({x}_{0}\)) from previous layer. The default value of k in this study is 32.

Note that there are a large number of inputs of the network; a bottleneck layer was introduced to the DenseNet, which is implemented by a 1 × 1 convolution before the 3 × 3 convolution layer, which is helpful in reducing the feature maps and saving the computation cost [22].

Transition layers (TLs) were proposed among DenseBlocks. TL has two advantages: (i) It can compress the number of feature maps: suppose k feature maps are generated by a DenseBlock and assume the compression factor as \(\theta \in (\mathrm{0,1}]\). Then the feature maps \({N}_{FM}\) will be reduced to ⌊\(\theta \times k\rfloor\).

$${N}_{\mathrm{FM}}=\left\{\begin{array}{cc}k& \mathrm{before TL}\\ \lfloor \theta \times k\rfloor & \mathrm{after TL}\end{array}\right.$$
(33)

If \(\theta\) = 1, the number of feature maps will be the same.

(ii) It can downsample the feature maps within the transition layers, i.e., \(1\times 1\) conv followed by \(2\times 2\) pooling between two consecutive DenseBlock. Feature map sizes are the same within each dense block so that they can be concatenated together easily.

Figure 6 shows the operations in transition layers: batch normalization, convolution, and pooling operations.

Fig. 6
figure 6

How DenseNet classify chest CT images (TL transition layer, DB DenseBlock, GAP global average pooling, FCL fully connected layer)

Figure 6 illustrates the structure of DenseNet, which includes three DBs, input layer, TLs, and global average pooling (GAP) layer. The TLs consist of a batch normalization layer, a 1 × 1 conv layer and a 2 × 2 average pooling layer with stride of 2. Particularly, the GAP is similar to traditional maximum pooling (MP) and average pooling (AP) methods, but it undertakes a more extreme feature map reduction, that reduces the size of feature map \({S}_{\mathrm{FM}}\) from \(\omega \times \omega \times c\) to 1 × 1 × c.

$${S}_{\mathrm{FM}}=\left\{\begin{array}{cc}\omega \times \omega \times c& \mathrm{before GAP}\\ 1\times 1\times c& \mathrm{after GAP}\end{array}\right.$$
(34)

That means, GAP layer reduces the whole slice into a single digit.

We needed to modify the structure of DenseNet before making it feasible for our COVID-19 diagnosis task. The last dense layer, viz., FCL was modified, since original last FCL was created to classify 1000 categories, of which 20 randomly ones are listed below: recreational vehicle, printer, coho, milk can, Irish wolfhound, parallel bars, tree frog, dhole, Gila monster, toucan, spider web, organ, walking stick, broccoli, loggerhead, bassoon, colobus, racket, schooner, and Kerry blue terrier. We can observe that none of those 20 categories are related to chest CT or COVID-19, which were the main classification task in this study.

Because the size of output neurons in standard DenseNet (1000-way) does not equal the number of classes in this study (two-way as COVID-19 and healthy control), it is necessary to modify the last FCL and the classification layer.

The modification is presented in Table 2. In this TL environment, a new randomly initialized FCL with 2 output neurons, and a new classification layer with 2 categories (COVID-19 and HC), was used to replace the previous cognate layers. The parameters of softmax layer were updated accordingly.

Table 3 Modification of layers of DenseNet

Improvement IV: Optimization of TL Setting

How to optimize the DenseNet transfer learning setting? There are many hyperparameters need to determine, e.g., do we use DenseNet-121, DenseNet-169, or DenseNet-201? How to we divide the frozen layers, middle layers, and new layers? To solve above issues, we propose an optimization of transfer learning setting (OTLS) framework in this study.

Take DenseNet-121 shown in Fig. 7a as an example, it contains the 121 learnable layers: (6 + 12 + 24 + 16) ∗ 2 = 116 layers in the dense blocks and 5 layers from first conv layer, last FCL, and three TLs. The details of DenseNet-121 are shown in Table 4. Figure 7b shows the inner structure of blocks DB1 and TL1.

Fig. 7
figure 7

Structure of DenseNet-121 (CP means the first block of conv layer and pooling layer, DB means the dense block, TL means the transition layer, GAP means global average pooling, FCL means fully connected layer, DL means dense layer)

Table 4 Detailed information of each layer/block in DenseNet-121

For DenseNet-201, traditional transfer learning uses simple learning factor (SLF), which employs the same learning factor across all the layers within a neural network. This study employs an advanced strategy called composite learning factor (CLF) [23]: the early transferred layers are frozen with learning factor \({\mathrm{LF}}_{\mathrm{Frozen}}=0\), i.e., no update, the middle transferred-layers are updated slowly with LF of 1, and the final added new layers learn fast with LF of 10. From the overall viewpoint, there are four different composite-learning-factor-settings (CLFSs) for DenseNet. Table 5 shows the detailed information of CLFSs.

Table 5 Composite-learning factor setting

The LF of final new layers is ten times of that of middle transferred-layer, for the middle transferred-layers are with pre-trained weights/biases and new layers are with random-initialized weights/biases. Its structure is shown in Fig. 8a. Here CLF Setting III means the layers from CP to DB3 are frozen. The layers from TL3 to DB4 are transferred directly with pre-trained weights, and they are updated with learning factor \({\mathrm{LF}}_{\mathrm{Middle}}=1\); note that the learning rate (LR) equals global learning rate (GLR) times learning factor (LF).

Fig. 8
figure 8

CLF Setting

$$\mathrm{LR}=\mathrm{GLR}\times \mathrm{LF}$$
(35)

The final new layers of FCL are randomly initialized with \({\mathrm{LF}}_{\mathrm{New}}=10\). In all, LF is chosen as

$$\mathrm{LF}=\left\{\begin{array}{cc}0& \text{Fronzen Layers}\\ 1& \text{Middle Layers}\\ 10& \text{New Layers}\end{array}\right.$$
(36)

We can observe that LF has three possible associated values to frozen layers, middle layers, and new layers. The LF value configurations of \({\mathrm{LF}}_{\mathrm{Frozen}}=0\), \({\mathrm{LF}}_{\mathrm{Middle}}=1\), and \({\mathrm{LF}}_{\mathrm{New}}=10\) are obtained from Wang [23]. To test other combination values of \(\left({\mathrm{LF}}_{\mathrm{Frozen}},\mathrm{ L}{\mathrm{F}}_{\mathrm{Middle}},\mathrm{ L}{\mathrm{F}}_{\mathrm{New}}\right)\) is one of our further research directions.

Except to optimize the CLF setting, we also need to seek for the optimal DenseNet structure. There are currently three stable DenseNet models: DenseNet-121, DenseNet-169, and DenseNet-201. Those three models have similarity structures with DenseNet-121, except that the number of 1 × 1 and 3 × 3 layers. For DenseNet-121, it contains 24 and 16 subblocks in Dense Block 3 and 4, respectively. For DenseNet-169, as shown in Fig. 8b, it includes 32 subblocks in Dense Block both 3 and 4. For DenseNet-201, the numerals change to 48 and 32 within Dense Block 3 and 4, respectively. Figure 8 shows the CLF settings of DenseNet-121 and DenseNet-169 adapted to our COVID-19 diagnosis task.

Precomputation Analysis

Precomputation is employed here. After freezing the layers by setting their cognate learning factor to 0, we can calculate the activation maps at the last frozen layer for all the images in the dataset. Then we save the feature maps to hard drive storage. Those feature maps are used as input images to train the trainable layers (middle layers and new layers in Table 5), which can be regarded as a smaller standalone neural network.

Using DenseNet-201 as example, we check the RAM storage. Again, we assume parameters are stored in the format of SPFP format (4 bytes). We can have the storage comparison of all four CLF settings (considering weights, biases, offsets, scales, etc.) in Fig. 9, where y-axis uses log scale for ease of view. We can see CLFS-IV only costs parameters of 6.98 million, and the memory storage of around 27.95 MB, which are the least values of all CLF settings.

Fig. 9
figure 9

Storage comparison of four transfer learning settings in DenseNet-201

Implementation

The experiment ran ten times. At each time, the training-validation hold-out division was reset at random. The test set was kept away from training at the very beginning, so its information would not be leaked to the training procedure. The training stopped when either it reached given maximum epoch, or the validation performance \({\mathcal{P}}_{\mathrm{val}}\) decreased over preset training epochs. Stochastic gradient descent with momentum (SGDM) approach was selected as the training algorithm

Table 6 Pseudocode of our proposed DenseNet-OTLS method

Table 6 shows the pseudocode of proposed DenseNet-OTLS method. In Phase I, we use the preprocessing approaches described in “Improvement I: Preprocessing” to make the data tractable and fit our deep neural network model. In Phase II, we use the proposed OTLS frame to seek the optimal base smodel (BM) and composite learning factor setting (CLFS). The core function here is “TrainNetwork” with four arguments: (i) BM \(M\), (ii) CLFS S, (iii) train data t, and (v) validation data v.

$${\mathbb{M}}=\mathrm{TrainNetwork}(\mathrm{BM},\mathrm{CLFS},t,v)$$
(37)

Using the trained model and new data, we can get the prediction results \(\upgamma\) on particular data d by the model \({\mathbb{M}}\).

$$\upgamma =\mathrm{Predict}({\mathbb{M}},d)$$
(38)

With the ground truth labels \(\mathcal{L}\), we can calculate the performance \(\mathcal{P}\).

$$\mathcal{P}=\mathrm{Compare}(\upgamma , \mathcal{L})$$
(39)

The performance can be training performance \({\mathcal{P}}_{\mathrm{train}}\), or validation performance \({\mathcal{P}}_{\mathrm{val}}\), or test performance \({\mathcal{P}}_{\mathrm{test}}\), based on the properties of data d. The optimal BM \({M}^{*}\) and optimal CLSF \({S}^{*}\) can be obtained on validation set by

$${[M}^{*},{S}^{*}]=\mathrm{argmax}\left[{\mathcal{P}}_{\mathrm{val}}(M,S)\right]$$
(40)

In Phase III, “Predict” function was conducted on the test set, and we will finally get the test performance \({\mathcal{P}}_{\mathrm{test}}\).

Indicators

The test performances across all 10 runs were noted, and the six indicators were assessed: sensitivity (SEN), specificity (SPC), accuracy (ACC), precision (PRC), F1 score, and Matthews correlation coefficient (MCC). We used those six indicators because they are widely reported I recent literature. There are some other indicators, but we did not use them, since we believed these six indicators were sufficient to measure the performance of proposed classifiers.

Assume positive means COVID-19, and negative means healthy control. The first four measures were defined as

$$\mathrm{SEN}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(41)
$$\mathrm{SPC}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}$$
(42)
$$\mathrm{PRC}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(43)
$$\mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(44)

where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively. F1 and MCC are defined as

$$\begin{array}{c}{F}_{1}={\left(\frac{{\mathrm{SEN}}^{-1}+{\mathrm{PRC}}^{-1}}{2}\right)}^{-1}\\ =\frac{2\times \mathrm{TP}}{2\times \mathrm{TP}+\mathrm{FP}+\mathrm{FN}}\end{array}$$
(45)
$$\mathrm{MCC}=\frac{\mathrm{TP}\times \mathrm{TN}-\mathrm{FP}\times \mathrm{FN}}{\sqrt{\left(\mathrm{TP}+\mathrm{FP}\right)\times \left(\mathrm{TP}+\mathrm{FN}\right)\times \left(\mathrm{TN}+\mathrm{FP}\right)\times (\mathrm{TN}+\mathrm{FN})}}$$
(46)

The MCC was used in machine learning as an indicator for binary classification since 1975. MCC itself is a correlation coefficient between observation and prediction. Its value is between − 1 and 1, i.e., \(- 1 \le {\text{MCC}}\le 1\). When MCC equals to − 1, 0, and 1, the corresponding meanings are shown below as

$$\mathrm{MCC}=\left\{\begin{array}{cc}1& \text{perfect classification}\\ 0& \text{random prediction}\\ -1& \text{totally wrong classification}\end{array}\right.$$
(47)

The average and standard deviation (SD) of six indicators of ten runs on the test set were analyzed and used for comparison. Two important global hyperparameters were set below: (i) the number of training epochs was set to 10, since the whole training procedure of a transfer learning should be swift. (ii) The global learning rate (GLR) was set to a trivial value of 10–4 to slow down learning, because transfer learning will be carried out on a pre-trained DenseNet model.

Results and Discussions

Optimization of Transfer Learning Setting

For the three DenseNet variants and four CLF settings, the comparison results were listed in Table 7. The best two are DenseNet-201-CLFS-III and DenseNet-201-CLFS-IV. The former obtains a sensitivity of 96.41 ± 1.86%, a specificity of 96.88 ± 1.21%, a precision of 96.90 ± 1.18%, an accuracy of 96.64 ± 1.21%, an F1 score of 96.63 ± 1.24%, and an MCC of 93.32 ± 2.42%. The latter one produces a sensitivity of 96.88 ± 1.85%, a specificity of 96.72 ± 1.47%, a precision of 96.76 ± 1.39%, an accuracy of 96.80 ± 0.82%, an F1 score of 96.79 ± 0.84%, and an MCC of 93.65 ± 1.60%. Considering CLFS IV will use less storage, we finally choose “201-IV” as our best model.

Table 7 Validation performance based on the best configuration (unit: %)

Figure 10 shows the error bar plot of \({\mathcal{P}}_{\mathrm{val}}\), showing that DenseNet-201 yields better performances than DenseNet-121 and DenseNet-169 overall. The reason may be because DenseNet-201 has the deepest neural structure; thus, it can map more complicated patterns, such as the ground-glass opacity (GGO) lesions of a chest CT image of COVID-19 patients.

Fig. 10
figure 10

Error bar of validation performances

The detailed information of 10 runs of the best model “201-IV” is shown in Table 8. Each row shows the result of one run. The last row shows the mean and standard deviation of all 10 runs. From the last row, we can see the standard deviation values of accuracy and F1 are much smaller than those of other four indicators.

Table 8 Validation performance \({\mathcal{P}}_{\mathbf{v}\mathbf{a}\mathbf{l}}\) (Unit: %)

Performance on Test Set via “201-IV”

Since we have determined from the validation set that the optimal combination is “201-IV,” we run this proposed model on the test set for 10 new runs, with results of each run shown in Table 9.

Table 9 Test performance \({\mathcal{P}}_{\mathbf{t}\mathbf{e}\mathbf{s}\mathbf{t}}\) based on best model “201-IV” found by proposed DenseNet-OTLS (unit: %)

The comparison between Table 9 with Table 7 is shown in Fig. 11, from which we can see the following:

Fig. 11
figure 11

Comparison of validation performance and test performance

(i) The performance on test set \({\mathcal{P}}_{\mathrm{test}}\) is a bit lower than that performance on validation set \({\mathcal{P}}_{\mathrm{val}}\) in terms of all six measures. The reason is the test set is brand new data to the models, but validation set contributes to the trained model. So \({\mathcal{P}}_{\mathrm{val}}\) looks better than \({\mathcal{P}}_{\mathrm{test}}\). (ii) We find the SD on test set is smaller than that on validation set. The reason is because the size of test set is larger than validation set, as can be found in Table 3.

Comparison to State-of-the-Art Approaches

We compare our method “DenseNet-OTLS” with other COVID-19 classification approaches: RBFNN [2], K-ELM [3], ELM-BA [4], 6L-CNN [5], GoogLeNet [6], and ResNet-18 [7]. The comparison results based on test set of 10 runs are shown in Table 10, where it is clear that proposed approach is significantly better than all the six state-of-the-art methods:

Table 10 Comparison with COVID-19 identification approaches (Unit: %)

In all, the comparison results of eight methods are shown in Fig. 12. This picture indicates that this proposed DenseNet-OTLS approach can achieve the highest performance among all state-of-the-art approaches.

Fig. 12
figure 12

Comparison of our method with seven state-of-the-art approaches

Composite Learning Rate Versus Simple Learning Rate

We compare this composite learning factor (CLF) strategy with traditional simple learning factor (SLF), as described in “Improvement IV: Optimization of TL Setting.” SLF is designed here as that we set the frozen layers with learning factor of 0, and the rest layers with learning factor of 1. We compare our CLF result with SLF and show the comparison in Table 11 and Fig. 13.

Table 11 Composite learning factor versus simple learning factor (Unit: %)
Fig. 13
figure 13

Error bar plot of CLF versus SFL

The results indicate that CLF yields better performance than SLF. The reason is all the layers are divided into three types (FL, ML, and NL) using CLF: (i) frozen layers (FLs) inherit structure and weights from pretrained models; (ii) middle layers (MLs) inherit network structure, and use pretrained weights as initial; and (iii) newly layers (NLs) have no relevancy with PTMs.

Effect of Preprocessing

We justify the effectiveness of preprocessing in this experiment. Remember our preprocessing consists of three steps: (i) HS, (ii) crop, and (iii) resize. Suppose we do not do this three-step preprocessing, i.e., we only carry out the resizing due to the transfer learning requirement. The comparison is carried out over test set and the results are itemized in Table 12.

Table 12 Effectiveness of proposed three-step preprocessing

Table 12 indicate that without proper preprocessing (HS and Crop), the performance of the system will decrease to a sensitivity of 93.33 ± 2.21%, a specificity of 91.77 ± 2.75%, and an accuracy of 92.55 ± 2.07%. This comparison clearly indicates the effectiveness of this proposed three-step preprocessing.

Conclusion

This paper proposed a novel COVID-19 diagnosis method based on DenseNet and optimization of transfer learning setting (OTLS) framework. The OTLS includes optimization the composite learning factor setting (CLFS) and optimization the DenseNet structure. The experiments showed our method DenseNet-OTLS is superior to six state-of-the-art approaches.

The shortcomings of this research are two-fold: (i) We did not validate the optimal combination configuration of DA techniques. We shall quantify the effect of each DA technique. (ii) We did not validate the optimal values of \(\left({\mathrm{LF}}_{\mathrm{Frozen}},\mathrm{ L}{\mathrm{F}}_{\mathrm{Middle}},{\mathrm{LF}}_{\mathrm{New}}\right)\). We will try to develop some automatic learning factor optimization method.

Furthermore, we shall try to increase COVID-19′s diagnosis performance further. One solution way is to make combination of different transfer learning setting, creating an ensemble DenseNet deep neural network. Another research direction is to output the localization the lesions of COVID-19, which can assist the chest radiologists to make more accurate diagnosis.(see. Table 13)

Highlights

  • DenseNet was introduced as the backbone pre-trained model, and we modified it to this COVID-19 diagnosis task.

  • Composite learning factor strategy was used for training DenseNet.

  • Data augmentation was used to enhance the training set.

  • An optimization of transfer learning setting (OTLS) was proposed to search for the optimal optimization setting.

  • Precomputation was introduced to save memory.

  • We compared proposed DenseNet-OTLS method with state-of-the-art COVID-19 diagnosis approaches.