Keywords

1 Introduction

White matter hyperintensities (WMH) characterized by bilateral, mostly symmetrical lesions are commonly seen on FLAIR magnetic resonance imaging (MRI) of clinically healthy elderly people; furthermore, they have been repeatedly associated with various neurological and geriatric disorders such as mood problems and cognitive decline [1]. Detection of such lesions on MRI has become a crucial criterion for diagnosis and predicting prognosis in early stage of diseases.

Different from brain tumor segmentation [2] in MR images where most of the abnormal regions are large and with spatial continuity, in the task of WMH segmentation, both large and small lesions with high discontinuity are commonly found as shown in Fig. 1. Generally, small abnormal region contains relatively less contextual information due to the poor spatial continuity. Furthermore, the feature representation of small lesions tend to be trivial when image features are extracted in a global manner. One solution to tackle this issue is to use an ensemble model or aggregation model [3] to learn different attributes i.e., multi-levels of feature representation from the training data.

Fig. 1.
figure 1

From left to right: axial slices from 34 to 37 of one case from the MICCAI WMH Challenge public training set, showing the high discontinuity of white matter hyperintensities. The red pixels indicate the WMH annotated by a neuroradiologist. (Color figure online)

Although there exist various computer-aided diagnostic systems for automatic segmentation of white matter hyperintensities [4, 5], the reported results are largely incomparable due to different datasets and evaluation protocols. The MICCAI WMH Segmentation Challenge 2017Footnote 1 was the first competition held to compare state-of-the-art algorithms on this task. The winning method [6] of the challenge employed the modified U-Net [7] architecture and ensemble models (U-Net ensembles in short). Three U-Net models of same architecture were trained with shuffled data and different weight initializations.

In traditional fully convolutional networks [8], each convolutional layer is followed by a max-pooling operation which causes the loss of spatial information. In the task of WMH segmentation, this sub-sampling operation can be devastating because small-volume hyperintensities with less than 10 voxels are commonly found. Instead of using single convolutional layer before the sub-sampling layer, we hypothesize that a convolutional stack with multiple convolutional layer is able to extract rich local information and it would be more effective by propagating the feature maps learned to the high-resolutional deconvolutional layer by skip connections similar to the U-Net approach [7].

In this paper, we present a stacked architecture of fully convolutional network called Stack-Net which aims at preserving the local spatial information of small lesions and propagating them to deconvolutional layers. We further aggregate two Stack-Nets with different receptive fields to learn multi-scale spatial information from both large and small abnormal regions. Our method outperforms the state-of-the-art in lesion recall by 4% on the MICCAI Challenge Dataset with 60 cases. In addition, we test our pre-trained models on a private Multiple Sclerosis (MS) lesion dataset with 30 subjects. Results further demonstrate the effectiveness of the aggregation idea.

2 Method

2.1 Convolutional Stack and Multi-scale Convolutions

Formulation. Let \(f( \cdot )\) represent the nonlinear activation function. The k\(^{th}\) output feature map of l \(^{th}\) layer Y\(_{lk}\) can be computed as: Y\(_{lk}\) = f(W\(_{lk}^{r}*\)x) where the input image is denoted by x; the convolutional kernel with fixed size r\(\times \)r related to the k feature map is denoted by W\(_{lk}^{r}\); the multiplication sign refers to the 2D convolutional operator, which is used to calculate the inner product of the filter model at each location of the input image. We now generalize this convolution structure from layer to stack. Let a convolutional stack S contain L convolutional layers, the k\(^{th}\) output feature map Y\(_{Sk}\) of S can be computed as

$$\begin{aligned} Y_{Sk} = f(W_{Lk}^{r}\,*\,f(W_{(L-1)k}^{r}\dots f(W_{0k}^{r}\,*\,x))\dots ) \end{aligned}$$
(1)

Obviously, the use of multiple connected convolutional layers to replace single layer would lead to the increase of computational complexity. However, the local spatial information of small lesions could be largely reduced after the first pooling layer (with 2 \(\times \) 2 kernel or larger). As a result, we only replaced the first two convolutional layers before sub-sampling layers with convolutional stack as shown in Fig. 2.

We further employed multi-scale convolutional kernels to learn different contextual information from both large and small abnormal regions. In our task, we aggregate the proposed two Stack-Nets with different receptive fields. Stack-Net with small receptive field i.e., 3\(\times \)3 kernel, is expected to learn local spatial information of small-volume hyperintensities while Stack-Net with large kernel is designed to learn spatial continuity of large abnormal regions. Two models were trained and optimized independently and were aggregated by using a voting strategy during the testing stage. Let \(P_{i}\) be the 3D segmentation probablity masks predicted by one single model \(M_{i}\). Then the final segmentation probability map of the aggregation of n models is defined as: \(P_{aggr}\) = \(\frac{1}{n}\) \(\sum _{i}P_{i}\). The threshold for generating binary mask is set to 0.4.

Fig. 2.
figure 2

Overview of the multi-scale convolutional-stack aggregation model. We replaced the traditional single convolutional-layer with convolutional-stack to extract and preserve local information of small lesions. The depth of the convolution-stack was flexible and set to 5 in our experiments. Two convolutional kernels i.e., 3 \(\times \) 3 and 5 \(\times \) 5 were used in two Stack-Nets to learning multi-scale context information. The detailed parameters setting/architecture was presented in Fig. 3.

Fig. 3.
figure 3

Detailed parameters setting of the deep networks. The number of stacked layers is 2.

Architecture. As shown in Fig. 2, we built two Stack-Nets with different convolutional kernels, which takes as input the axial slices (2D) of two modalities from the brain MR scans during both training and testing. Different from the winning architecture U-Net Ensembles [6] in the MICCAI challenge, we replaced the first two convolutional layers by a convolutional stack, with \(3\times 3\) and \(5\times 5\) kernel size respectively. Each convolutional stack is followed by a rectified linear unit (ReLU) and a 2\(\times \)2 max pooling operation with stride 2 for downsampling. The depth of the convolutional stack was set as L = 5. In total the network contains 24 convolutional and de-convolutional layers.

Training. For the data preprocessing, each slice and the corresponding segmentation mask were cropped or padded to \(200 \times 200\) to guarantee a uniform input for the model. Then we obtained the brain mask using simple thresholding and mask filling. Gaussian normalization was applied to each subject to rescale the intensities. Dice loss function [9] was employed during the training process. Data augmentation including rotation, shearing and zoom was used during the batch training. The optimal number of epochs was set to 50 by contrasting training loss and validation loss over epochs. The batch size was set to 30 and learning rate was set to 0.0002 throughout all of the experiments.

3 Materials

3.1 Datasets and Experimental Setting

Two clinical datasets: the public MICCAI WMH dataset with 60 cases from 3 centers and a private MS Lesion dataset with 30 cases collected from a hospital in Munich, were employed in our experiments. For each dataset, the FLAIR and T1 modality of each subject were co-registered. Properties of the data were summarised in Table 1. In the experiments reported in Sects. 4.1 and 4.2, five-fold cross-validation setting was used. Specifically, subject IDs were used to split the public training dataset into training and validation sets. In each split, slices from 16 subjects from each center were pooled into training set, and the slices from the remaining 4 subjects from each center for testing. This procedure was repeated until all of the subjects had been used in testing phase. The Dice score, lesion recall and lesion F1-score of all testing subjects were averaged afterwards.

Table 1. Detailed information of MICCAI WMH challenge dataset from three centers and private Multiple Sclerosis dataset from a hospital in Munich.

3.2 Evaluation Metrics

Three evaluation metrics were used to evaluate the segmentation performance of the algorithm in different aspects from MICCAI WMH Challenge. Given a ground-truth segmentation map G and a segmentation map P generated by an algorithm, the evaluation metrics are defined as follows. Dice score: DSC = 2(G\(\cap \)P)/(\(|G|+|P|\)). This metric measures the overlap** volume of G and P. Recall for individual lesions: Let \(N_{G}\) be the number of individual lesions delineated in G, and \(N_{P}\) be the number of correctly detected lesions after comparing P and G. Each individual lesion is defined as a 3D connected component. Then the recall for individual lesions is defined as: Recall = \({N_P}/{N_G}\). F1-score for individual lesions: Let \(N_{P}\) be the number of correctly detected lesions after comparing P and G. \(N_{F}\) be the number of wrongly detected lesions in P. Each individual lesion is defined as a 3D connected component. Then the recall for individual lesions is defined as: F1 = \({N_P}\)/(\(N_{P}+N_{F}\)).

4 Results

4.1 Comparison with the State-of-the-Art

We conducted experiments on the public MICCAI WMH Challenge dataset (3 subsets, 60 subjects) in a 5-fold cross validation setting. We compared the segmentation performance of the proposed Stack-Net and aggregation model with the winning method in MICCAI WMH Challenge 2017. The Stack-Net with 3 \(\times \) 3 kernel slightly outperforms U-Net ensembles on Dice score and lesion F1-score and achieved comparable lesion recall. The aggregation model outperforms U-Net ensembles by 4% on lesion recall, suggesting that the Stack-Net is capable of learning attributes of small-volume lesions. We conducted a paired Z-test over the 60 pairs, where each pair is the lesion recall values obtained on one validation scan by the proposed aggregation model and U-Net ensembles. Small p-value (p < 0.01) indicates that the improvements are statistically significant. Figure 4 shows a segmentation case in which we can see that our aggregation model is more effective in detecting small lesion. Furthermore, the proposed method claimed to be the first place (teamname: \(sysu\_media\_2\)) on the hidden set after independent evaluation by the challenge organizer. Please see the details in http://wmh.isi.uu.nl/results/.

Fig. 4.
figure 4

Segmentation result of a testing case from Utrecht by U-Net ensembles and our model respectively. The green area is the overlap between the segmentation result and the ground truth. The red pixels are the false negatives, and the black ones are the false positives. (Color figure online)

Fig. 5.
figure 5

(a) Distribution of small, medium and large lesions detected by U-Net ensembles, each component of our aggregation model; (b) Overall Dice, lesion recall and lesion F1-score achieved by six Stack-Nets with different depths.

To better understand how each part of the proposed model worked effectively on the volume-varied lesions, we grouped all the lesions into three types: small, medium and large, by defining the volume range of each type. Three sets \(S_{small} = \{s|~volume(lesion)<10\}\), \(S_{medium} = \{s|~10<volume(lesion)<20\}\) and \(S_{large} = \{s|~volume(lesion)>20\}\) were obtained. Then the number of detected lesions of three types was calculated by comparing the predicted segmentation masks and ground-truth segmentation masks on all the test subjects. Figure 5(a) further shows the distribution of detected lesions with small, medium and large volumes respectively. Our aggregation model detected 2008 small lesions while the U-Net ensembles detected 1851, i.e., 8% improvement over U-Net ensembles. We conducted a paired Z-test over the 60 pairs, where each pair is the recall values of small lesion obtained on one validation scan by the proposed aggregation model and U-Net ensembles. Small p-value (p < 0.01) indicates that the improvements are statistically significant. We also observed that the aggregation model achieved a comparable Dice score which measures the overlap** volumes, demonstrating that it was effective in dealing with both large and small lesions. The Stack-Net with 5 \(\times \) 5 kernel slightly outperformed U-Net ensembles in the detection of large lesions. It demonstrates that large convolutional kernel is effective in learning contextual information from large abnormality regions with spatial continuity (Table 2).

Table 2. Comparsion with the winning method in MICCAI WHM Challenge 2017. Values in bold indicates results outperforming the state-of-the-art.
Table 3. Segmenation performance on the MS lesion dataset. Figures in bold indicate the best performance.

4.2 Analysis on the Stack-Net

To investigate the effect of the depth in Stack-Net, we evaluate six models with 3 \(\times \) 3 kernel with depths ranging from 1 to 6 on the MICCAI WMH dataset using 5-fold cross validation. Using the same grou** criteria and calculation strategy as mentioned above, we calculate the averaged Dice, averaged lesion recall and averaged lesion F1-score on all test subjects after 5 splits. As one can observe from Fig. 5(b) that using the thin convolutional stack i.e., one or two convolutional layer yields relatively poor segmentation performance on three evaluation metrics. This is because spatial information is reduced drastically after the sub-sampling layer while the thin stack is not able to preserve rich information and the reduced spatial information is propagated to the deconvolutional layers.

4.3 Cross-Center Evaluation on the MS Lesion Dataset

To further evaluate the idea of multi-scale spatial aggregation in a cross-center-evaluation manner, we trained the models on MICCAI WMH dataset, and tested them on the MS lesion dataset from a hospital in Munich. MS lesions have a very similar appearance with WM lesions, but most of them are medium or large lesions. Table 3 reported resulted from a comparison of segmentation performance of individual network and aggregation model. We observed that the aggregation model achieved significantly better lesion F1-score compared to individual networks, suggesting combination of multi-scale spatial information can help to remove false positives. Interestingly, we found the lesion recall did not improve after aggregating the individual Stack-Nets. This is due to the fact that most of the MS lesions are in medium or large size, which made the function of convolutional stack achieve limited improvement over the lesion recall. It further suggested that the aggregation of models with multi-scale receptive field is effective in learning multi-scale spatial information.

5 Conclusions

In this paper, we explored an architecture specifically designed for small lesion segmentation, to learn attributes of small regions. We found the convolutional stack was effective in preserving local information of small lesions and the rich information was propagated to the high-resolution deconvolutional stack. By aggregating multi-scale Stack-Net with different receptive fields, our method outperformed the state-of-the-art on MICCAI WMH Challenge dataset. We further showed multi-scale context aggregation model was effective in MS lesion segmentation under a cross-center evaluation.