Key words

1 Introduction

Image segmentation is an essential and challenging task in medical image analysis. Its goal is to delineate the object boundaries by assigning each pixel/voxel a label, where pixels/voxels with the same labels share similar properties or belong to the same class. In the context of neuroimaging, robust and accurate image segmentation can effectively help neurosurgeons and doctors, e.g., measure the size of brain lesions or quantitatively evaluate the volume changes of brain tissue throughout treatment or surgery. For instance, quantitative measurements of subcortical and cortical structures are critical for studies of several neurodegenerative diseases such as Alzheimer’s, Parkinson’s, and Huntington’s diseases. Automatic segmentation of multiple sclerosis (MS) lesions is essential for the quantitative analysis of disease progression. The delineation of acute ischemic stroke lesions is crucial for increasing the likelihood of good clinical outcomes for the patient. While manual delineation of object boundaries is a tedious and time-consuming task, automatic segmentation algorithms can significantly reduce the workload of clinicians and increase the objectivity and reproducibility of measurements. To be specific, the segmentation task in medical images usually refers to semantic segmentation. For example, for paired brain structures (e.g., left and right pairs of subcortical structures), the instances of the same category will not be specified in the segmentation, in contrast to instance and panoptic segmentation.

There are many neuroimaging modalities such as magnetic resonance imaging, computed tomography, transcranial Doppler, and positron emission tomography. Moreover, neuroimaging studies often contain multimodal and/or longitudinal data, which can help improve our understanding of the anatomical and functional properties of the brain by utilizing complementary physical and physiological sensitivities. In this chapter, we first present some background information to help readers get familiar with the fundamental elements used in deep learning-based segmentation frameworks. Next, we discuss the learning-based segmentation approaches in the context of different supervision settings, along with some real-world applications.

2 Methods

2.1 Fundamentals

2.1.1 Common Network Architectures for Segmentation Tasks

Convolutional neural networks (CNNs) dominated the medical image segmentation field in recent years. CNNs leverage information from images to predict segmentations by hierarchically learning parameters with linear and nonlinear layers. We begin by discussing some popular models and their architectures: (1) U-Net [1], (2) V-Net [2], (3) attention U-Net [7] was introduced for volumetric segmentation that learns from volumetric images.

Fig. 1
A model of U-shaped net architecture depicts a feature map of different channel numbers of various operations highlighted by arrows. Arrows represent Conv 3 multiple 3, R e L U, copy and crop, max pool 2 multiple 2, up-conv 2 multiple 2, and conv 1 multiple 1.

U-Net architecture. Blue boxes are the feature maps. Channel numbers are denoted above each box, while the tensor sizes are denoted on the lower left. White boxes show the concatenations and arrows indicate various operations. Ⓒ2015 Springer Nature. Reprinted, with permission, from [1]

V-Net is another popular model for volumetric medical image segmentation. Based upon the overall structure of the U-Net, the V-Net [2] leverages the residual block [8] to replace the regular conv, and the convolution kernel size is enlarged to 5 × 5 × 5. The residual blocks can be formulated as follows: (1) the input of a residual block is processed by conv layers and nonlinearities, and (2) the input is added to the output from the last conv layer or nonlinearity of the residual block. It consists of a fully convolutional neural network trained end-to-end.

Attention U-Net is a model based on U-Net with attention gates (AG) in the skip connections (Fig. 2). The attention gates can learn to focus on the segmentation target. The salient features are emphasized with larger weights from the CNN during the training. This leads the model to achieve higher accuracy on target structures with various shapes and sizes. In addition, AGs are easy to integrate into the existing popular CNN architectures. The details of the attention mechanism and attention gates are discussed in Subheading 2.1.2. More details on attention can also be found in Chap. 6.

Fig. 2
A U-shaped architecture of the input image, and segmentation map on both ends of arrows indicate Conv 3 multiple 3 multiple 3 + R e L U, upsampling, max-pooling, skip connection, gating signal, concatenation, and attention gate.

Attention U-Net architecture. Hi, Wi, and Di represent the height, width, and depth of the feature map at the ith layer of the U-Net structure. Fi indicates the number of feature map channels. Replicated from [4] (CC BY 4.0)

nnU-Net is a medical image segmentation pipeline that can achieve a self-configuring network architecture based on the different datasets and tasks it is given, without any manual intervention. According to the dataset and task, nnU-Net will generate one of (1) 2D U-Net, (2) 3D U-Net, and (3) cascaded 3D U-Net for the segmentation network. For cascaded 3D U-Net, the first network takes downsampled images as inputs, and the second network uses the image at full resolution as input to refine the segmentation accuracy. The nnU-Net is often used as a baseline method in many medical image segmentation challenges, because of its robust performance across various target structures and image properties. The details of nnU-Net can be found in [6].

2.1.2 Attention Modules

Although the U-Net architecture described in Subheading 2.1.1 has achieved remarkable success in medical image segmentation, the downsampling steps included in the encoder path can induce poor segmentation accuracy for small-scale anatomical structures (e.g., tumors and lesions). To tackle this issue, the attention modules are often applied so that the salient features are enhanced by higher weights, while the less important features are ignored. This subsection will introduce two types of attention mechanisms: additive attention and multiplicative attention.

Additive Attention

As discussed in the previous section, U-Net is the most popular backbone for medical image analysis tasks. The downsampling enables it to work on features of different scales. Suppose we are working on a 3D segmentation problem. The output of the U-Net encoder at the lth level is then a tensor Xl of size [Fl, Hl, Wl, Dl], where Hl, Wl, Dl denote the height, width, and depth of the feature map, respectively, and Fl represents the length of the feature vectors. We regard the tensor as a set of feature vectors \( {\boldsymbol{x}}_i^l \):

$$ {\mathcal{X}}^l={\left\{{\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\boldsymbol{x}}_i^l\in {\mathbb{R}}^{F_l} $$
(1)

where n = Hl × Wl × Dl. The attention gate assigns a weight αi to each vector xi so that the model can concentrate on salient features. Ideally, important features are assigned higher weight that will not vanish when downsampling. The output of the attention gate will be a collection of weighted feature vectors:

$$ {\hat{\mathcal{X}}}^l={\left\{{\alpha}_i^l\cdot {\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\alpha}_i^l\in \mathbb{R} $$
(2)

These weights αi, also known as gating coefficients, are determined by an attention mechanism that delineates the correlation between the feature vector x and a gating signal g. As shown in Fig. 3, for all \( {\boldsymbol{x}}_i^l\in {\mathcal{X}}^l \), we compute an additive attention with regard to a corresponding gi by

$$ {s}_{att}^l={\boldsymbol{\psi}}^{\top}\left[{\sigma}_1\left({\boldsymbol{W}}_x^{\top }{\boldsymbol{x}}_i^l+{\boldsymbol{W}}_g^{\top }{\boldsymbol{g}}_i+{\boldsymbol{b}}_g\right)\right]+{b}_{\psi } $$
(3)

where bg and bψ represent the bias and Wx, Wg, ψ are linear transformations. The output dimension of the linear transformation is \( {\mathbb{R}}^{F_{int}} \) where Fint is a self-defined integer. Denote these learnable parameters by a set Θatt. The coefficients \( {s}_{att}^l \) are normalized to [0, 1] by a sigmoid function σ2:

$$ {\alpha}_i^l={\sigma}_2\left({s}_{att}^l\left({\boldsymbol{x}}_i^l,{\boldsymbol{g}}_i;{\Theta}_{att}\right)\right) $$
(4)

Basically, the attention gate is thus a linear combination of the feature vector and the gating signal. In practical applications [3, 4, 9], the gating signal is chosen to be the coarser feature space as indicated in Fig. 2. In other words, for input feature \( {\boldsymbol{x}}_i^l \), the corresponding gating signal is defined by

$$ {\boldsymbol{g}}_i={\boldsymbol{x}}_i^{l+1} $$
(5)

Note that an extra downsampling step should be applied on Xl so that it has the same shape as Xl+1. In experiments to segment brain tumor on MRI datasets [9] and the pancreas on CT abdominal datasets [4], AG was shown to improve the segmentation performance for diverse types of model backbones including U-Net and Residual U-Net.

Fig. 3
A schematic structure of G i, in W g and X i superscript l in W x leads to R e L U of sigma 1, si, sigmoid of sigma 2 and then to x cap of l and i.

The structure of the additive attention gate. \( {\boldsymbol{x}}_i^l \) is the ith feature vector at the lth level of the U-Net structure and gi is the corresponding gating signal. Wx and Wg are the linear transformation matrices applied to \( {\boldsymbol{x}}_i^l \) and gi, respectively. The sum of the resultant vectors will be activated by ReLU and then its dot product with a vector ψ is computed. The sigmoid function is used to normalize the resulting scalar to [0, 1] range, which is the gating coefficient αi. The weighted feature vector is denoted by \( \hat{{\boldsymbol{x}}_i^l} \). Adapted from [4] (CC BY 4.0)

Multiplicative Attention

Similar to additive attention, the multiplicative mechanism can also be leveraged to compute the importance of feature vectors. The basic idea of multiplicative attention was first introduced in machine translation [11]. Evolving from that, Vaswani et al. proposed a groundbreaking transformer architecture [10] which has been widely implemented in image processing [12, 13]. In recent research, transformers have been incorporated with the U-Net structure [14, 15] to improve medical image segmentation performance.

The attention function is described by matching a query vector q with a set of key vectors {k1, k2, ..., kn} to obtain the weights of the corresponding values {v1, v2, ..., vn}. Figure 4a shows an example for n = 4. Suppose the vectors q, ki, and vi have the same dimension \( {\mathbb{R}}^d \). Then, the attention function is

$$ {s}_i=\frac{{\boldsymbol{q}}^{\top }{\boldsymbol{k}}_i}{\sqrt{d}} $$
(6)

We note that the dot product can have large magnitude when d is large, which can cause gradient vanishing problem in the softmax function; si is normalized by the size of the vector to alleviate this. Equation 13.6 is a commonly used attention function in transformers. There are some other options including si = qki and si = qWki where W is a learnable parameter. Generally, the attention value si is determined by the similarity between the query and the key. Similar to the additive attention gate, these attention values are normalized to [0, 1] by a softmax function σ3:

$$ {\alpha}_i={\sigma}_3\left({s}_1,...,{s}_n\right)=\frac{e^{s_i}}{\sum_{j=1}^n{e}^{s_j}} $$
(7)
Fig. 4
A. a network diagram has q connected to keys K 1, 2, 3, and 4 and to S 1, 2, 3, and 4 of interconnecting sigma 3 that gives values V 1, 2, 3, and 4 at attention and SoftMax of sigma 3. B. a flowchart depicts scaled dot-product attention connecting linear v, k, and q, to concat.

(a) The dot-product attention gate. ki are the keys and q is the query vector. si are the outputs of the attention function. By using the softmax σ3, the attention coefficients αi are normalized to [0, 1] range. The output will be the weighted sum of values vi. (b) The multi-head attention is implemented in transformers. The input values, keys, and query are linearly projected to different spaces. Then the dot-product attention is applied on each space. The resultant vectors are concatenated by channel and passed through another linear transformation. Image (b) is adapted from [10]. Permission to reuse was kindly granted by the authors

The output of the attention gate will be \( \hat{\boldsymbol{v}}={\sum}_{i=1}^n{\alpha}_i{\boldsymbol{v}}_i \). In the transformer application, the values, keys, and queries are usually linearly projected into several different spaces, and then the attention gate is applied in each space as illustrated in Fig. 4b. This approach is called multi-head attention; it enables the model to jointly attend to information from different subspaces.

In practice, the value vi is often defined by the same feature vector as the key ki. This is why the module is also called multi-head self-attention (MSA). Chen et al. proposed the TransUNet [15], which leverages this module in the bottleneck of a U-Net as shown in Fig. 5. They argue that such a combination of a U-Net and the transformer achieves superior performance in multi-organ segmentation tasks.

Fig. 5
An illustration of the architecture of TransNet has the following parts. 1. An embedded sequence has layer norm, M S A, and M L P, 2. M R I scan of the human brain, 3. a box of C N N, hidden feature, and linear projection, and 4. 1 by 8, 1 by 4, and 1 by 2 of a cubic box.

The architecture of TransUNet. The transformer layer represented by the yellow box shows the application of multi-head attention (MSA). MLP represents the multilayer perceptron. In general, the feature vectors in the bottleneck of the U-Net are set as the input to the stack of n transformer layers. As these layers will not change the dimension of the features, they are easy to be implemented and will not affect other parts of the U-Net model. Replicated from [15] (CC BY 4.0)

2.1.3 Loss Functions for Segmentation Tasks

This section summarizes some of the most widely used loss functions for medical image segmentation (Fig. 6) and describes their usage in different scenarios. A complementary reading material for an extensive list of loss functions can be found in [16, 17]. In the following, the predicted probability by the segmentation model and the ground truth at the ith pixel/voxel are denoted as pi and gi, respectively. N is the number of voxels in the image.

Fig. 6
A network diagram links the following. Cross-entropy, W C E, D P C E, topK, and focal loss to E L L, dice C E, focal, topK, dice, Tversky, G D, p G D, S S, boundary, H D loss at distribution-based, compound, region-based, and boundary-based loss.

Loss functions for medical image segmentation. WCE: weighted cross-entropy loss. DPCE: distance map penalized cross-entropy loss. ELL: exponential logarithmic loss. SS: sensitivity-specificity loss. GD: generalized Dice loss. pGD: penalty loss. Asym: asymmetric similarity loss. IoU: intersection over union loss. HD: Hausdorff distance loss. Ⓒ2021 Elsevier. Reprinted, with permission, from [16]

Cross-Entropy Loss

Cross-entropy (CE) is defined as a measure of the difference between two probability distributions for a given random variable or set of events. This loss function is used for pixel-wise classification in segmentation tasks:

$$ {\ell}_{CE}=-\sum \limits_i^N\sum \limits_k^K{y}_i^k\log \left({p}_i^k\right) $$
(8)

where N is the number of voxels, K is the number of classes, \( {y}_i^k \) is a binary indicator that shows whether k is the correct class, and \( {p}_i^k \) is the predicted probability for voxel i to be in kth class.

Weighted Cross-Entropy Loss

Weighted cross-entropy (WCE) loss is a variant of the cross-entropy loss to address the class imbalance issue. Specifically, class-specific coefficients are used to weigh each class differently, as follows:

$$ {\ell}_{WCE}=-\sum \limits_i^N\sum \limits_k^K{w}_{y_k}{y}_i^k\log \left({p}_i^k\right) $$
(9)

Here, \( {w}_{y_k} \) is the coefficient for the kth class. Suppose there are 5 positive samples and 12 negative samples in a binary classification training set. By setting w0 = 1 and w1 = 2, the loss would be as if there were ten positive samples.

Focal Loss

Focal loss was proposed to apply a modulating term to the CE loss to focus on hard negative samples. It is a dynamically scaled CE loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples:

$$ {\ell}_{Focal}=-\sum \limits_i^N{\alpha}_i{\left(1-{p}_i\right)}^{\gamma}\log \left({p}_i\right) $$
(10)

Here, αi is the weighing factor to address the class imbalance and γ is a tunable focusing parameter (γ > 0).

Dice Loss

The Dice coefficient is a widely used metric in the computer vision community to calculate the similarity between two binary segmentations. In 2016, this metric was adapted as a loss function for 3D medical image segmentation [2]:

$$ {\ell}_{Dice}=1-\frac{2{\sum}_i^N{p}_i{g}_i+1}{\sum_i^N\left({p}_i+{g}_i\right)+1} $$
(11)

Generalized Dice Loss

Generalized Dice loss (GDL) [18] was proposed to reduce the well-known correlation between region size and Dice score:

$$ {L}_{GDL}=1-2\frac{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i{g}_i}{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i+{g}_i} $$
(12)

Here \( {w}_l=\frac{1}{{\left({\sum}_i^N{g}_{li}\right)}^2} \) is used to provide invariance to different region sizes, i.e., the contribution of each region is corrected by the inverse of its volume.

Tversky Loss

The Tversky loss [19] is a generalization of the Dice loss by adding two weighting factors α and β to the FP (false positive) and FN (false negative) terms. The Tversky loss is defined as

$$ {L}_{Tversky}=1-\frac{\sum_i^N{p}_i{g}_i}{\sum_i^N{p}_i{g}_i+\alpha \left(1-{g}_i\right){p}_i+\beta \left(1-{p}_i\right){g}_i} $$
(13)

Recently, a comprehensive study [16] of loss functions on medical image segmentation tasks shows that using Dice-related compound loss functions, e.g., Dice loss +  CE loss, is a better choice for new segmentation tasks, though none of losses can consistently achieve the best performance on multiple segmentation tasks. Therefore, for a new segmentation task, we recommend the readers to start with Dice +  CE loss, which is also the default loss function in one of the most popular medical image segmentation frameworks, nnU-Net [6].

Finally, note that other loss functions have also been proposed to introduce prior knowledge about size, topology, or shape, for instance [20].

2.1.4 Early Stop**

Given a loss function, a simple strategy for training is to stop the training process once a predetermined maximum number of iterations are reached. However, too few iterations would lead to an under-fitting problem, while over-fitting may occur with too many iterations. “Early stop**” is a potential method to avoid such issues. The training set is split into training and validation sets when using the early stop** condition. The early stop** condition is based on the performance on the validation set. For example, if the validation performance (e.g., average Dice score) does not increase for a number of iterations, the early stop** condition is triggered. In this situation, the best model with the highest performance on the validation set is saved and used for inference. Of course, one should not report the validation performance for the validation of the model. Instead, one should use a separate test set which is kept unseen during training for an unbiased evaluation.

2.1.5 Evaluation Metrics for Segmentation Tasks

Various metrics can quantitatively evaluate different aspects of a segmentation algorithm. In a binary segmentation task, a true positive (TP) indicates that a pixel in the target object is correctly predicted as target. Similarly, a true negative (TN) represents a background pixel that is correctly identified as background. On the other hand, a false positive (FP) and a false negative (FN) refer to a wrong prediction for pixels in the target and background, respectively. Most of the evaluation metrics are based upon the number of pixels in these four categories.

Sensitivity measures the completeness of positive predictions with regard to the positive ground truth (TP +  FN). It thus shows the model’s ability to identify target pixels. It is also referred to as recall or true-positive rate (TPR). It is defined as

$$ \mathrm{Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(14)

As the negative counterpart of sensitivity, specificity describes the proportion of negative pixels that are correctly predicted. It is also referred to as true-negative rate (TNR). It is defined as

$$ \mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$
(15)

Specificity can be difficult to interpret because TN is usually very large. It can even be misleading as TN can be made arbitrarily large by changing the field of view. This is due to the fact that the metric is computed over pixels and not over patients/controls like in classification tasks (the number of controls is fixed). In order to provide meaningful measures of specificity, it is preferable to define a background region that has an anatomical definition (for instance, the brain mask from which the target is subtracted) and does not include the full field of view of the image.

Positive predictive value (PPV), also known as precision, measures the correct rate among pixels that are predicted as positives:

$$ \mathrm{PPV}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(16)

For clinical interpretation of segmentation, it is often useful to have a more direct estimation of false negatives. To that purpose, one can report the false discovery rate:

$$ \mathrm{FDR}=1-\mathrm{PPV}=\frac{\mathrm{FP}}{\mathrm{TP}+\mathrm{FP}} $$
(17)

which is redundant with PPV but may be more intuitive for clinicians in the context of segmentation.

Dice similarity coefficient (DSC) measures the proportion of spatial overlap between the ground truth (TP+FN) and the predicted positives (TP+FP). Dice similarity is the same as the F1 score, which computes the harmonic mean of sensitivity and PPV:

$$ \mathrm{DSC}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FN}+\mathrm{FP}} $$
(18)

Accuracy is the ratio of correct predictions:

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$
(19)

As was the case in specificity, we note that there are many segmentation tasks where the target anatomical structure is very small (e.g., subcortical structures); hence, the foreground and background have unbalanced number of pixels. In this case, accuracy can be misleading and display high values for poor segmentations. Moreover, as for the case of specificity, one needs to define a background region in order for TN, and thus accuracy, not to vary arbitrarily with the field of view.

The Jaccard index (JI), also known as the intersection over union (IoU), measures the percentage of overlap between the ground truth and positive prediction relative to the union of the two:

$$ \mathrm{JI}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}} $$
(20)

JI is closely related to the DSC. However, it is always lower than the DSC and tends to penalize more severely poor segmentations.

There are also distance measures of segmentation accuracy which are especially relevant when the accuracy of the boundary is critical. These include the average symmetric surface distance (ASSD) and the Hausdorff distance (HD). Suppose the surface of the ground truth and the predicted segmentation are \( \mathcal{S} \) and \( {\mathcal{S}}^{\prime } \), respectively. For any point \( \boldsymbol{p}\in \mathcal{S} \), the distance from p to surface \( {\mathcal{S}}^{\prime } \) is defined by the minimum Euclidean distance:

$$ d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}^{\prime}\in \mathcal{S}^{\prime }}{\min}\parallel \boldsymbol{p}-{\boldsymbol{p}}^{\prime}\parallel {}_2 $$
(21)

Then the average distance between \( \mathcal{S} \) and \( {\mathcal{S}}^{\prime } \) is given by averaging over \( \mathcal{S} \):

$$ d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\frac{1}{N_S}\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right) $$
(22)

Note that \( d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne d\left({\mathcal{S}}^{\prime },\mathcal{S}\right) \). Therefore, both directions are included in ASSD so that the mean of the surface distance is symmetric:

$$ \mathrm{ASSD}=\frac{1}{N_S+{N}_{S\prime }}\left[\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right)+\sum \limits_{j=1}^{N_{S\prime }}d\Big({\boldsymbol{p}}_j^{\prime },\mathcal{S}\Big)\right] $$
(23)

The ASSD tends to obscure localized errors when the segmentation is decent at most of the points on the boundary. The Hausdorff distance (HD) can better represent the error, by, instead of computing the average distance to a surface, computing the maximum distance. To that purpose, one defines

$$ h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}\in \mathcal{S}}{\max }d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right) $$
(24)

Note that, again, \( h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne h\left({\mathcal{S}}^{\prime },\mathcal{S}\right) \). Therefore, both directions are included in HD so that the distance is symmetric:

$$ \mathrm{HD}=\max \left(h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right),h\left({\mathcal{S}}^{\prime },\mathcal{S}\right)\right) $$
(25)

HD is more sensitive than ASSD to localized errors. However, it can be too sensitive to outliers. Hence, using the 95th percentile rather than the maximum value for computing \( h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right) \) is a good option to alleviate the problem.

Moreover, there are some volume-based measurements that focus on correctly estimating the volume of the target structure, which is essential for clinicians since the size of the tissue is an important marker in many diseases. Denote the ground truth volume as V  while the prediction volume as V. There are a few expressions for the volume difference. (1) The unsigned volume difference: |V− V |. (2) The normalized unsigned difference: \( \frac{\mid {V}^{\prime }-V\mid }{V} \). (3) The normalized signed difference: \( \frac{V^{\prime }-V}{V} \). (4) Pearson’s correlation coefficient between the ground truth volumes and the predicted volumes: \( \frac{\mathrm{Cov}\left(V,{V}^{\prime}\right)}{\sqrt{\mathrm{Var}(V)}\sqrt{\mathrm{Var}\left({V}^{\prime}\right)}} \). Nevertheless, note that, while they are useful, these volume-based metrics can also be misleading (a segmentation could be wrongly placed while providing a reasonable volume estimate) when used in isolation. They thus need to be combined with overlap metrics such as Dice.

Finally, some recent guidelines on validation of different image analysis tasks, including segmentation, were published in [21].

2.1.6 Pre-processing for Segmentation Tasks

Image pre-processing is a set of sequential steps taken to improve the data and prepare it for subsequent analysis. Appropriate image pre-processing steps often significantly improve the quality of feature extraction and the downstream image analysis. For deep learning methods, they can also help the training process converge faster and achieve better model performance. The following sections will discuss some of the most widely used image pre-processing techniques.

Skull Strip**

Many neuroimaging applications often require preliminary processing to isolate the brain from extracranial or non-brain tissues from MRI scans, commonly referred to as skull strip**. Skull strip** helps reduce the variability in datasets and is a critical step prior to many other image processing algorithms such as registration, segmentation, or cortical surface reconstruction. In literature, skull strip** methods are broadly classified into five categories: mathematical morphology-based methods [22], intensity-based methods [23], deformable surface-based methods [24], atlas-based methods [25], and hybrid methods [26]. Recently, deep learning-based skull strip** methods have been proposed [27,28,29,30,31,32] to improve the accuracy and efficiency. A detailed discussion of the merits and limitations of various skull strip** techniques can be found in [33].

Bias Field Correction

The bias field refers to a low-frequency and very smooth signal that corrupts MR images [34]. These artifacts, often described as shading or bias, can be generated by imperfections in the field coils or by magnetic susceptibility changes at the boundaries between anatomical tissue and air. This bias field can significantly degrade the performance of image processing algorithms that use the image intensity values. Therefore, a pre-processing step is usually required to remove the bias field. The N4 bias field correction algorithm [35] is one of the most widely used methods for this purpose, as it assumes a simple parametric model and does not require tissue classification.

Data Harmonization

Another challenge of MRI data is that it suffers from significant intensity variability due to several factors such as variations in hardware, reconstruction algorithms, and acquisition settings. This is also due to the fact that most MR imaging sequences (e.g., T1-weighted, T2-weighted) are not quantitative (the voxel values can only be interpreted relative to each other). Such differences can often be pronounced in multisite studies, among others. This variability can be problematic because intensity-based models may not generalize well to such heterogeneous datasets. Any resulting data can suffer from significant biases caused by acquisition details rather than anatomical differences. It is thus desirable to have robust data harmonization methods to reduce unwanted variability across sites, scanners, and acquisition protocols. One of the popular MRI harmonization methods is a statistical approach named the combined association test (comBat). This method was shown to exhibit a good capacity to remove unwanted site biases while preserving the desired biological information [36]. Another popular method is a deep learning-based image-to-image translation model, CycleGAN [37]. The CycleGAN and its variants do not require paired data, and thus the training process is unsupervised in the context of data harmonization.

Intensity Normalization

Intensity normalization is another important step to ensure comparability across images. In this section, we discuss common intensity normalization techniques. Readers can refer to the work [38] in which the author explores the impact of different intensity normalization techniques on MR image synthesis.

Z-Score Normalization

The basic Z-score normalization on the entire image is also called the whole-brain normalization. Given the mean μ and standard deviation σ from all voxels in a brain mask B, Z-score normalization can be performed for all voxels in image I as follows:

$$ {I}_{z- score}(x)=\frac{I(x)-\mu }{\sigma } $$
(26)

While straightforward to implement, whole-brain normalization is known to be sensitive to outliers.

White Stripe Normalization

White stripe normalization [39] is based on the parameters obtained from a sample of normal-appearing white matter (NAWM) and is thus robust to local intensity outliers such as lesions. The NAWM is obtained by smoothing the histogram of the image I and selecting the mode of the distribution. For T1-weighted MRI, the “white stripe” is defined as the 10% of intensity values around the mean of NAWM μ. Let F(x) be the CDF of the specific MR image I(x) inside the brain mask B, and τ = 5%. The white stripe Ωτ is defined as

$$ {\Omega}_{\tau }=\left\{I(x)|{F}^{-1}\left(F(x)-\tau \right)<I(x)<{F}^{-1}\left(F(x)+\tau \right)\right\} $$
(27)

Then let στ be the sample standard deviation associated with Ωτ. The white stripe normalized image is

$$ {I}_{ws}(x)=\frac{I(x)-\mu }{\sigma_{\tau }} $$
(28)

Compared to the whole-brain normalization, the white stripe normalization may work better and have better interpretation, especially for applications where intensity outliers such as lesions are expected.

Segmentation-Based Normalization

Segmentation-based normalization uses a segmentation of a specified tissue, such as the cerebrospinal fluid (CSF), gray matter (GM), or white matter (WM), to normalize the entire image to the mean of the tissue. Let T ⊂ B be the tissue mask for image I. The tissue mean can be calculated as \( \mu =\frac{1}{\mid T\mid }{\sum}_{t\in T}I(t) \) and the segmentation-based normalized image is expressed as

$$ {I}_{seg}(x)=\frac{cI(x)}{\mu } $$
(29)

where \( c\in {\mathbb{R}}^{+} \) is a constant.

Kernel Density Estimate Normalization

Kernel density estimate (KDE) normalization estimates the empirical probability density function of the intensities of the entire image I over the brain mask B via kernel density estimation. The KDE of the probability density function for the image intensities can be expressed as

$$ \hat{p}(x)=\frac{1}{\mathrm{HWD}\times \delta}\sum \limits_{i=1}^{\mathrm{HWD}}K\left(\frac{x-{x}_i}{\delta}\right) $$
(30)

where H, W, D are the image sizes of I, x is an intensity value, K is the kernel, and δ is the bandwidth parameter which scales the kernel. With KDE normalization, the mode of WM can be selected more robustly via a smooth version of the histogram and thus is more suitable to be used in a segmentation-based normalization method.

Spatial Normalization

Spatial normalization aims to register a subject’s brain image to a common space (reference space) to allow comparisons across subjects. When the reference space is a standard space, such as the Montreal Neurological Institute (MNI) space [40] or the Talairach and Tournoux atlas (Talairach space), the registration also facilitates the sharing and interpretation of data across studies. It is also common practice to define a customized space from a dataset rather than using a standard space. For deep learning methods, it has been shown that training data with appropriate spatial normalization tend to yield better performances [41,42,43]. Rigid, affine, or deformable registration may be desirable for spatial normalization, depending on the application. Many registration methods are publicly available through software packages such as 3D Slicer, FreeSurfer [https://surfer.nmr.mgh.harvard.edu/], FMRIB Software Library (FSL) [https://fsl.fmrib.ox.ac.uk/fsl/fslwiki], and Advanced Normalization Tools (ANTs) [https://picsl.upenn.edu/software/ants/].

2.2 Supervision Settings

In the following three sections, we categorize the learning-based segmentation algorithms by their supervision setting. In the reverse order of the amount of annotation required, these include supervised, semi-supervised, and unsupervised methods (Fig. 7). For supervised methods, we mainly present some training strategies and model architectures that will help improve the segmentation performance. For the other two types of approaches, we classify the mainstream ideas and then provide application examples proposed in recent research.

Fig. 7
A flowchart illustrates medical image segmentation of supervised, semi-supervised, and unsupervised learning with the types on the top. It lists labeled data, model, supervised loss of prediction, and label.

Overview of the supervision settings for medical image segmentation. Best viewed in color

2.3 Supervised Methods

2.3.1 Background

In supervised learning, a model is presented with the given dataset \( \mathcal{D}={\left\{\left({x}^{(i)},{y}^{(i)}\right)\right\}}_{i=1}^n \) of inputs x and associated labels y. This y can take several forms, depending on the learning task. In particular, for fully convolutional neural network-based segmentation applications, y is a segmentation map. In supervised learning, the model can learn from labeled training data by minimizing the loss function and apply what it has learned to make a prediction/segmentation in testing data. Supervised training thus aims to find model parameters θ that best predict the data based on a loss function \( L\left(y,\hat{y}\right) \). Here, \( \hat{y} \) denotes the output of the model obtained by feeding a data point x to the function f(x;θ) that represents the model. Given sufficient training data, supervised methods can generally perform better than semi-supervised or unsupervised segmentation methods.

2.3.2 Data Representation

Data is an important part of supervised segmentation models, and the model performance relies on data representation. In addition to image pre-processing (Subheading 2.1.6), there are a few key steps for data preparation before being fed into the segmentation network.

Patch Formulation

The inputs of CNN can be represented as image patches when the whole image is too large and would require too much GPU memory. The image patches could be 2D slices, 3D patches, and any format in between. The choice of patches would affect the performance of networks for a given dataset and task [44]. Compared to 3D patches, 2D slices have the advantage of lighter computational load during training. However, contextual information along the third axis is missing. In contrast, 3D patches leverage data from all three axes, but they require more computational resources. As a compromise between 2D and 3D patches, “2.5D” approaches have been proposed, by taking 2D slices in all three orthogonal views through the same voxel [45]. Those 2D slices could be trained in a single CNN or a separate CNN for each view. Furthermore, Zhang et al. [46] proposed 2.5D stacked slices to leverage the information from adjacent slices in each view.

Patch Extraction

Due to the imbalance between foreground and background, various patch extraction strategies have been designed to obtain robust segmentation. Kamnitsas et al. [47], Dolz et al. [48], and Li et al. [49] pick a voxel within the foreground or background with 50% probability at every iteration during training and select the patch centered at that voxel. In [46], Zhang et al. extract 2.5D stacked patches if the central slice contains the foreground, even with only one voxel. In some models [50, 51], 3D patches with target structure are used as input instead of the whole image, which could reduce the effect of the background for segmenting target structures with smaller volume.

Data Augmentation

To avoid the over-fitting problem and increase the generalizability of the model, data augmentation (DA) is widely used in medical image segmentation [52]. The common DA strategies could be classified into three categories: (1) spatial augmentation, (2) image appearance augmentation, and (3) image quality augmentation. For spatial augmentation, random image flip, rotation, scale, and deformation are often used [4, 45, 53,54,55]. Random gamma correction, intensity scale, and intensity shift are the common forms for image appearance augmentation [51, 54, 56, 57]. Image quality augmentation includes random Gaussian blur, random noise addition, and image sharpening [51, 56]. Note that while we only list a few commonly used methods here, many others have been explored. TorchIO [58] is a widely used software package for data augmentation.

2.3.3 Network Architecture

Here, we classify the popular supervised segmentation networks into single/multipath networks and encoder-decoder networks.

Single/Multipath Networks

As discussed above, patches are often used as input instead of the entire image, resulting in a lack of global context. This could produce noisy segmentations, such as undesired islands of false-positive voxels that need to be removed in post-processing [48]. To compensate for the missing global context, Li et al. [49] used spatial coordinates as additional channels of input patches. A multipath network is another feasible solution (Fig. 8). Multipath networks usually contain global and local paths [47, 59, 60] that extract different features at different scales. The global path uses convolutions with larger kernel size [60] or a larger receptive field [47] to learn global information [47]. In contrast, local features are extracted in the local path. The global path thus extracts global features and tends to locate the position of the target structure. In contrast, the shape, size, texture, boundary, and other details of the target structure are identified by the local path. However, the performance of this type of network is easily affected by the size and design of input patches: for example, too small patches would not provide enough information, while too large patches would be computationally prohibitive.

Fig. 8
A process flow of single-path and multipath networks has the following layers. Brain M R I, input segment, convolutional layer, and classification layer. In the multipath network, fully connected layers are between convolutional layers and classification layer.

Examples of single-path (top) and multipath (bottom) networks. In the multipath network, the inputs for the two pathways are centered at the same location. The top pathway is equivalent to the single-path network and takes the normal resolution image as input, while the bottom pathway takes a downsampled image with larger field of view as input. Replicated from [47] (CC BY 4.0)

U-Net and Its Variants

To tackle the limitations of the single/multipath networks, many models use U-net variants with encoder-decoder paths [1, 61], which establishes end-to-end training from image to segmentation map. The encoder is similar to the single/multipath networks but with downsampling operations between the different scales of feature maps. The decoder leverages the extracted features from the encoder and produces a segmentation of the same size as the original image. Skip connections that pass the feature maps from the encoder directly to the decoder contribute to the performance of the U-net. The passed information could help to recover the details of segmentation.

The most common modification of the U-Net is the introduction of other convolutional modules, such as residual blocks [62], dense blocks [63], attention modules [3, 4], etc. These convolutional modules could replace regular convolution operations or be used in the skip connections of the U-Net. Residual blocks could mitigate the gradient vanishing problem during training by adding the input of the module to its output, which also contributes to the speed of convergence [62]. In this configuration, the network can be built deeper. The work of [53, 59, 64,65,66] used residual connections or residual blocks instead of regular convolutions in their network architecture for robust segmentation of various brain structures. Dense blocks could strengthen feature propagation and encourage feature reuse to improve segmentation accuracy. However, they require more computational resources during training. Zhang et al. [46, 56] employed the Tiramisu network [67], a densely U-shaped network, to produce superior multiple sclerosis (MS) lesion segmentation.

The attention module is another commonly used tool in segmentation to focus on salient features [4]. It can be categorized into spatial attention and channel attention modules. Li et al. [53] use spatial attention modules in the skip connections for extracting smaller subcortical structures. Similarly, attention modules are used between skip connections and in the decoder part in the work of [51, 68] for segmenting vestibular schwannoma and cochlea. In addition, Zhang et al. [69] proposed to use slice-wise attention networks in 3D CNNs for MS segmentation. Applying the slice-wise attention in three different orientations improves the computational efficiency compared to the regular attention module. Hou et al. [70] proposed the cross-attention block, which combines channel attention and spatial attention. Moreover, in [71], a skip attention unit is used for brain tumor segmentation. Zhou et al. [72] build fusion blocks based on the attention module. Attention modules have also been used for brain tumor segmentation [73].

Transformers

As discussed in Subheading 2.1.2, transformers have become popular in medical image segmentation [74,75,76]. Transformers leverage the long-range dependencies and can better capture low-level details. In practice, they can replace CNNs [77], be combined with CNNs [78, 79], or integrated into CNNs [80]. Some recent works [14, 15, 77] have shown that the implementation of transformer on U-Net architecture can achieve superior performance in medical image segmentation compared to their CNN counterparts.

2.3.4 Framework Configuration

The single network mainly focuses on a single task during training and may ignore other potentially useful information. To improve the segmentation accuracy, frameworks with multiple encoders and decoders have been proposed [53, 81, 82].

Multi-task Networks

As the name suggests, multi-task networks attempt to simultaneously tackle a main task as well as auxiliary tasks, rather than focusing on a single segmentation task. These networks usually contain a shared encoder and multiple decoders for multiple tasks, which could help deal with class imbalance (Fig. 9). Compared to a single-task network, the learning ability of the encoder is increased from same domain tasks (e.g., multiple tasks of multiple decoders), which could improve segmentation performance. Simultaneously learning multiple tasks could also improve model generalizability. McKinley et al. [81] leverage the information of additional tissue types to increase the accuracy of MS lesion segmentation. Another common multi-task setting is to introduce an auxiliary reconstruction task [57].

Fig. 9
An illustration of the multi-task framework along with 3-D M R I scans as input on the top. On the bottom, square box = group norm, R e L U, conv 3 multiple 3 multiple 3 group norm R e L U conv 3 multiple 3 multiple 3.

Example of multi-task framework. The model takes four 3D MRI sequences (T1w, T1c, T2w, and FLAIR) as input. The U-Net structure (the top pathway with skip connection) serves as the segmentation network, and the output contains the segmentation maps of the three subregions (whole tumor (WT), tumor core (TC), and enhancing tumor (ET)). An auxiliary VAE branch (the bottom decoder) that reconstructs the input images is applied in the training stage to regularize the shared encoder. Ⓒ2019 Springer Nature. Reprinted, with permission, from [57]

Cascaded Networks

A cascaded network is a series of connected networks such that the input of each downstream network is the output from an upstream network (Fig. 10). For example, a coarse-to-fine segmentation strategy can be used to reduce the high computational cost of training for 3D images [50, 53]. In this scenario, an upstream network could take downsampled images as input to roughly locate the target structures, allowing the images to be cropped to the region of interest for the downstream network. The downstream network could then produce high-quality segmentation in full resolution. Another advantage of this approach is to reduce the impact of volume imbalance between foreground and background classes. However, the upstream network would determine the performance of the whole framework, and some global information is missing in the downstream networks.

Fig. 10
An illustration of image cascaded network. The input multi-model volumes with W net points to segmentation of the whole tumor with T net, to segmentation of tumor core with E net, and then to segmentation of enhancing tumor core.

Example of cascaded networks. WNet segments the whole tumor from the input multimodal 3D MRI. Then based upon the segmentation, a bounding box (yellow dash line) can be obtained and used to crop the input. The TNet takes the cropped image to segment the tumor core. Similarly, the ENet segments the enhancing tumor core by taking the cropped images determined by the segmentation from the previous stage. Ⓒ2018 Springer Nature. Reprinted, with permission, from [50]

Ensemble Networks

To obtain a robust segmentation, a popular approach is to aggregate the output from multiple independent networks (i.e., no weights/parameters shared). Kanitsas et al. proposed the ensemble of multiple models and architectures (EMMA) [83] for brain tumor segmentation. Kao et al. [84] produce segmentation using 26 ensemble neural networks. Zhao et al. [85] proposed a framework for 3D segmentation with multiple 2D networks that take input from different views. Huo et al. [82]