Medical Image Segmentation Using Deep Learning

Liu, Han; Hu, Dewei; Li, Hao; Oguz, Ipek

doi:10.1007/978-1-0716-3195-9_13

Han Liu³,
Dewei Hu⁴,
Hao Li⁴ &
…
Ipek Oguz^3,4

Part of the book series: Neuromethods ((NM,volume 197))

9204 Accesses
2 Citations
1 Altmetric

Abstract

Image segmentation plays an essential role in medical image analysis as it provides automated delineation of specific anatomical structures of interest and further enables many downstream tasks such as shape analysis and volume measurement. In particular, the rapid development of deep learning techniques in recent years has had a substantial impact in boosting the performance of segmentation algorithms by efficiently leveraging large amounts of labeled data to optimize complex models (supervised learning). However, the difficulty of obtaining manual labels for training can be a major obstacle for the implementation of learning-based methods for medical images. To address this problem, researchers have investigated many semi-supervised and unsupervised learning techniques to relax the labeling requirements. In this chapter, we present the basic ideas for deep learning-based segmentation as well as some current state-of-the-art approaches, organized by supervision type. Our goal is to provide the reader with some possible solutions for model selection, training strategies, and data manipulation given a specific segmentation task and dataset.

Authors Han Liu, Dewei Hu, and Hao Li have equal contributors to this chapter

You have full access to this open access chapter, Download protocol PDF

Key words

1 Introduction

Image segmentation is an essential and challenging task in medical image analysis. Its goal is to delineate the object boundaries by assigning each pixel/voxel a label, where pixels/voxels with the same labels share similar properties or belong to the same class. In the context of neuroimaging, robust and accurate image segmentation can effectively help neurosurgeons and doctors, e.g., measure the size of brain lesions or quantitatively evaluate the volume changes of brain tissue throughout treatment or surgery. For instance, quantitative measurements of subcortical and cortical structures are critical for studies of several neurodegenerative diseases such as Alzheimer’s, Parkinson’s, and Huntington’s diseases. Automatic segmentation of multiple sclerosis (MS) lesions is essential for the quantitative analysis of disease progression. The delineation of acute ischemic stroke lesions is crucial for increasing the likelihood of good clinical outcomes for the patient. While manual delineation of object boundaries is a tedious and time-consuming task, automatic segmentation algorithms can significantly reduce the workload of clinicians and increase the objectivity and reproducibility of measurements. To be specific, the segmentation task in medical images usually refers to semantic segmentation. For example, for paired brain structures (e.g., left and right pairs of subcortical structures), the instances of the same category will not be specified in the segmentation, in contrast to instance and panoptic segmentation.

There are many neuroimaging modalities such as magnetic resonance imaging, computed tomography, transcranial Doppler, and positron emission tomography. Moreover, neuroimaging studies often contain multimodal and/or longitudinal data, which can help improve our understanding of the anatomical and functional properties of the brain by utilizing complementary physical and physiological sensitivities. In this chapter, we first present some background information to help readers get familiar with the fundamental elements used in deep learning-based segmentation frameworks. Next, we discuss the learning-based segmentation approaches in the context of different supervision settings, along with some real-world applications.

2 Methods

2.1 Fundamentals

2.1.1 Common Network Architectures for Segmentation Tasks

Convolutional neural networks (CNNs) dominated the medical image segmentation field in recent years. CNNs leverage information from images to predict segmentations by hierarchically learning parameters with linear and nonlinear layers. We begin by discussing some popular models and their architectures: (1) U-Net [1], (2) V-Net [2], (3) attention U-Net [7] was introduced for volumetric segmentation that learns from volumetric images.

A model of U-shaped net architecture depicts a feature map of different channel numbers of various operations highlighted by arrows. Arrows represent Conv 3 multiple 3, R e L U, copy and crop, max pool 2 multiple 2, up-conv 2 multiple 2, and conv 1 multiple 1. — **Fig. 1**

V-Net is another popular model for volumetric medical image segmentation. Based upon the overall structure of the U-Net, the V-Net [2] leverages the residual block [8] to replace the regular conv, and the convolution kernel size is enlarged to 5 × 5 × 5. The residual blocks can be formulated as follows: (1) the input of a residual block is processed by conv layers and nonlinearities, and (2) the input is added to the output from the last conv layer or nonlinearity of the residual block. It consists of a fully convolutional neural network trained end-to-end.

Attention U-Net is a model based on U-Net with attention gates (AG) in the skip connections (Fig. 2). The attention gates can learn to focus on the segmentation target. The salient features are emphasized with larger weights from the CNN during the training. This leads the model to achieve higher accuracy on target structures with various shapes and sizes. In addition, AGs are easy to integrate into the existing popular CNN architectures. The details of the attention mechanism and attention gates are discussed in Subheading 2.1.2. More details on attention can also be found in Chap. 6.

A U-shaped architecture of the input image, and segmentation map on both ends of arrows indicate Conv 3 multiple 3 multiple 3 + R e L U, upsampling, max-pooling, skip connection, gating signal, concatenation, and attention gate. — **Fig. 2**

nnU-Net is a medical image segmentation pipeline that can achieve a self-configuring network architecture based on the different datasets and tasks it is given, without any manual intervention. According to the dataset and task, nnU-Net will generate one of (1) 2D U-Net, (2) 3D U-Net, and (3) cascaded 3D U-Net for the segmentation network. For cascaded 3D U-Net, the first network takes downsampled images as inputs, and the second network uses the image at full resolution as input to refine the segmentation accuracy. The nnU-Net is often used as a baseline method in many medical image segmentation challenges, because of its robust performance across various target structures and image properties. The details of nnU-Net can be found in [6].

2.1.2 Attention Modules

Although the U-Net architecture described in Subheading 2.1.1 has achieved remarkable success in medical image segmentation, the downsampling steps included in the encoder path can induce poor segmentation accuracy for small-scale anatomical structures (e.g., tumors and lesions). To tackle this issue, the attention modules are often applied so that the salient features are enhanced by higher weights, while the less important features are ignored. This subsection will introduce two types of attention mechanisms: additive attention and multiplicative attention.

Additive Attention

As discussed in the previous section, U-Net is the most popular backbone for medical image analysis tasks. The downsampling enables it to work on features of different scales. Suppose we are working on a 3D segmentation problem. The output of the U-Net encoder at the lth level is then a tensor X^l of size [F_l, H_l, W_l, D_l], where H_l, W_l, D_l denote the height, width, and depth of the feature map, respectively, and F_l represents the length of the feature vectors. We regard the tensor as a set of feature vectors $ {\boldsymbol{x}}_i^l $:

$$ {\mathcal{X}}^l={\left\{{\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\boldsymbol{x}}_i^l\in {\mathbb{R}}^{F_l} $$

(1)

where n = H_l × W_l × D_l. The attention gate assigns a weight α_i to each vector x_i so that the model can concentrate on salient features. Ideally, important features are assigned higher weight that will not vanish when downsampling. The output of the attention gate will be a collection of weighted feature vectors:

$$ {\hat{\mathcal{X}}}^l={\left\{{\alpha}_i^l\cdot {\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\alpha}_i^l\in \mathbb{R} $$

(2)

These weights α_i, also known as gating coefficients, are determined by an attention mechanism that delineates the correlation between the feature vector x and a gating signal g. As shown in Fig. 3, for all $ {\boldsymbol{x}}_i^l\in {\mathcal{X}}^l $, we compute an additive attention with regard to a corresponding g_i by

$$ {s}_{att}^l={\boldsymbol{\psi}}^{\top}\left[{\sigma}_1\left({\boldsymbol{W}}_x^{\top }{\boldsymbol{x}}_i^l+{\boldsymbol{W}}_g^{\top }{\boldsymbol{g}}_i+{\boldsymbol{b}}_g\right)\right]+{b}_{\psi } $$

(3)

where b_g and b_ψ represent the bias and W_x, W_g, ψ are linear transformations. The output dimension of the linear transformation is $ {\mathbb{R}}^{F_{int}} $ where F_int is a self-defined integer. Denote these learnable parameters by a set Θ_att. The coefficients $ {s}_{att}^l $ are normalized to [0, 1] by a sigmoid function σ₂:

$$ {\alpha}_i^l={\sigma}_2\left({s}_{att}^l\left({\boldsymbol{x}}_i^l,{\boldsymbol{g}}_i;{\Theta}_{att}\right)\right) $$

(4)

Basically, the attention gate is thus a linear combination of the feature vector and the gating signal. In practical applications [3, 4, 9], the gating signal is chosen to be the coarser feature space as indicated in Fig. 2. In other words, for input feature $ {\boldsymbol{x}}_i^l $, the corresponding gating signal is defined by

$$ {\boldsymbol{g}}_i={\boldsymbol{x}}_i^{l+1} $$

(5)

Note that an extra downsampling step should be applied on X^l so that it has the same shape as X^l+1. In experiments to segment brain tumor on MRI datasets [9] and the pancreas on CT abdominal datasets [4], AG was shown to improve the segmentation performance for diverse types of model backbones including U-Net and Residual U-Net.

A schematic structure of G i, in W g and X i superscript l in W x leads to R e L U of sigma 1, si, sigmoid of sigma 2 and then to x cap of l and i. — **Fig. 3**

Multiplicative Attention

Similar to additive attention, the multiplicative mechanism can also be leveraged to compute the importance of feature vectors. The basic idea of multiplicative attention was first introduced in machine translation [11]. Evolving from that, Vaswani et al. proposed a groundbreaking transformer architecture [10] which has been widely implemented in image processing [12, 13]. In recent research, transformers have been incorporated with the U-Net structure [14, 15] to improve medical image segmentation performance.

The attention function is described by matching a query vector q with a set of key vectors {k₁, k₂, ..., k_n} to obtain the weights of the corresponding values {v₁, v₂, ..., v_n}. Figure 4a shows an example for n = 4. Suppose the vectors q, k_i, and v_i have the same dimension $ {\mathbb{R}}^d $. Then, the attention function is

$$ {s}_i=\frac{{\boldsymbol{q}}^{\top }{\boldsymbol{k}}_i}{\sqrt{d}} $$

(6)

We note that the dot product can have large magnitude when d is large, which can cause gradient vanishing problem in the softmax function; s_i is normalized by the size of the vector to alleviate this. Equation 13.6 is a commonly used attention function in transformers. There are some other options including s_i = q^⊤k_i and s_i = q^⊤Wk_i where W is a learnable parameter. Generally, the attention value s_i is determined by the similarity between the query and the key. Similar to the additive attention gate, these attention values are normalized to [0, 1] by a softmax function σ₃:

$$ {\alpha}_i={\sigma}_3\left({s}_1,...,{s}_n\right)=\frac{e^{s_i}}{\sum_{j=1}^n{e}^{s_j}} $$

(7)

A. a network diagram has q connected to keys K 1, 2, 3, and 4 and to S 1, 2, 3, and 4 of interconnecting sigma 3 that gives values V 1, 2, 3, and 4 at attention and SoftMax of sigma 3. B. a flowchart depicts scaled dot-product attention connecting linear v, k, and q, to concat. — **Fig. 4**

The output of the attention gate will be $ \hat{\boldsymbol{v}}={\sum}_{i=1}^n{\alpha}_i{\boldsymbol{v}}_i $. In the transformer application, the values, keys, and queries are usually linearly projected into several different spaces, and then the attention gate is applied in each space as illustrated in Fig. 4b. This approach is called multi-head attention; it enables the model to jointly attend to information from different subspaces.

In practice, the value v_i is often defined by the same feature vector as the key k_i. This is why the module is also called multi-head self-attention (MSA). Chen et al. proposed the TransUNet [15], which leverages this module in the bottleneck of a U-Net as shown in Fig. 5. They argue that such a combination of a U-Net and the transformer achieves superior performance in multi-organ segmentation tasks.

An illustration of the architecture of TransNet has the following parts. 1. An embedded sequence has layer norm, M S A, and M L P, 2. M R I scan of the human brain, 3. a box of C N N, hidden feature, and linear projection, and 4. 1 by 8, 1 by 4, and 1 by 2 of a cubic box. — **Fig. 5**

2.1.3 Loss Functions for Segmentation Tasks

This section summarizes some of the most widely used loss functions for medical image segmentation (Fig. 6) and describes their usage in different scenarios. A complementary reading material for an extensive list of loss functions can be found in [16, 17]. In the following, the predicted probability by the segmentation model and the ground truth at the ith pixel/voxel are denoted as p_i and g_i, respectively. N is the number of voxels in the image.

A network diagram links the following. Cross-entropy, W C E, D P C E, topK, and focal loss to E L L, dice C E, focal, topK, dice, Tversky, G D, p G D, S S, boundary, H D loss at distribution-based, compound, region-based, and boundary-based loss. — **Fig. 6**

Cross-Entropy Loss

Cross-entropy (CE) is defined as a measure of the difference between two probability distributions for a given random variable or set of events. This loss function is used for pixel-wise classification in segmentation tasks:

$$ {\ell}_{CE}=-\sum \limits_i^N\sum \limits_k^K{y}_i^k\log \left({p}_i^k\right) $$

(8)

where N is the number of voxels, K is the number of classes, $ {y}_i^k $ is a binary indicator that shows whether k is the correct class, and $ {p}_i^k $ is the predicted probability for voxel i to be in kth class.

Weighted Cross-Entropy Loss

Weighted cross-entropy (WCE) loss is a variant of the cross-entropy loss to address the class imbalance issue. Specifically, class-specific coefficients are used to weigh each class differently, as follows:

$$ {\ell}_{WCE}=-\sum \limits_i^N\sum \limits_k^K{w}_{y_k}{y}_i^k\log \left({p}_i^k\right) $$

(9)

Here, $ {w}_{y_k} $ is the coefficient for the kth class. Suppose there are 5 positive samples and 12 negative samples in a binary classification training set. By setting w₀ = 1 and w₁ = 2, the loss would be as if there were ten positive samples.

Focal Loss

Focal loss was proposed to apply a modulating term to the CE loss to focus on hard negative samples. It is a dynamically scaled CE loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples:

$$ {\ell}_{Focal}=-\sum \limits_i^N{\alpha}_i{\left(1-{p}_i\right)}^{\gamma}\log \left({p}_i\right) $$

(10)

Here, α_i is the weighing factor to address the class imbalance and γ is a tunable focusing parameter (γ > 0).

Dice Loss

The Dice coefficient is a widely used metric in the computer vision community to calculate the similarity between two binary segmentations. In 2016, this metric was adapted as a loss function for 3D medical image segmentation [2]:

$$ {\ell}_{Dice}=1-\frac{2{\sum}_i^N{p}_i{g}_i+1}{\sum_i^N\left({p}_i+{g}_i\right)+1} $$

(11)

Generalized Dice Loss

Generalized Dice loss (GDL) [18] was proposed to reduce the well-known correlation between region size and Dice score:

$$ {L}_{GDL}=1-2\frac{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i{g}_i}{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i+{g}_i} $$

(12)

Here $ {w}_l=\frac{1}{{\left({\sum}_i^N{g}_{li}\right)}^2} $ is used to provide invariance to different region sizes, i.e., the contribution of each region is corrected by the inverse of its volume.

Tversky Loss

The Tversky loss [19] is a generalization of the Dice loss by adding two weighting factors α and β to the FP (false positive) and FN (false negative) terms. The Tversky loss is defined as

$$ {L}_{Tversky}=1-\frac{\sum_i^N{p}_i{g}_i}{\sum_i^N{p}_i{g}_i+\alpha \left(1-{g}_i\right){p}_i+\beta \left(1-{p}_i\right){g}_i} $$

(13)

Recently, a comprehensive study [16] of loss functions on medical image segmentation tasks shows that using Dice-related compound loss functions, e.g., Dice loss + CE loss, is a better choice for new segmentation tasks, though none of losses can consistently achieve the best performance on multiple segmentation tasks. Therefore, for a new segmentation task, we recommend the readers to start with Dice + CE loss, which is also the default loss function in one of the most popular medical image segmentation frameworks, nnU-Net [6].

Finally, note that other loss functions have also been proposed to introduce prior knowledge about size, topology, or shape, for instance [20].

2.1.4 Early Stop**

Given a loss function, a simple strategy for training is to stop the training process once a predetermined maximum number of iterations are reached. However, too few iterations would lead to an under-fitting problem, while over-fitting may occur with too many iterations. “Early stop**” is a potential method to avoid such issues. The training set is split into training and validation sets when using the early stop** condition. The early stop** condition is based on the performance on the validation set. For example, if the validation performance (e.g., average Dice score) does not increase for a number of iterations, the early stop** condition is triggered. In this situation, the best model with the highest performance on the validation set is saved and used for inference. Of course, one should not report the validation performance for the validation of the model. Instead, one should use a separate test set which is kept unseen during training for an unbiased evaluation.

2.1.5 Evaluation Metrics for Segmentation Tasks

Various metrics can quantitatively evaluate different aspects of a segmentation algorithm. In a binary segmentation task, a true positive (TP) indicates that a pixel in the target object is correctly predicted as target. Similarly, a true negative (TN) represents a background pixel that is correctly identified as background. On the other hand, a false positive (FP) and a false negative (FN) refer to a wrong prediction for pixels in the target and background, respectively. Most of the evaluation metrics are based upon the number of pixels in these four categories.

Sensitivity measures the completeness of positive predictions with regard to the positive ground truth (TP + FN). It thus shows the model’s ability to identify target pixels. It is also referred to as recall or true-positive rate (TPR). It is defined as

$$ \mathrm{Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$

(14)

As the negative counterpart of sensitivity, specificity describes the proportion of negative pixels that are correctly predicted. It is also referred to as true-negative rate (TNR). It is defined as

$$ \mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$

(15)

Specificity can be difficult to interpret because TN is usually very large. It can even be misleading as TN can be made arbitrarily large by changing the field of view. This is due to the fact that the metric is computed over pixels and not over patients/controls like in classification tasks (the number of controls is fixed). In order to provide meaningful measures of specificity, it is preferable to define a background region that has an anatomical definition (for instance, the brain mask from which the target is subtracted) and does not include the full field of view of the image.

Positive predictive value (PPV), also known as precision, measures the correct rate among pixels that are predicted as positives:

$$ \mathrm{PPV}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$

(16)

For clinical interpretation of segmentation, it is often useful to have a more direct estimation of false negatives. To that purpose, one can report the false discovery rate:

$$ \mathrm{FDR}=1-\mathrm{PPV}=\frac{\mathrm{FP}}{\mathrm{TP}+\mathrm{FP}} $$

(17)

which is redundant with PPV but may be more intuitive for clinicians in the context of segmentation.

Dice similarity coefficient (DSC) measures the proportion of spatial overlap between the ground truth (TP+FN) and the predicted positives (TP+FP). Dice similarity is the same as the F₁ score, which computes the harmonic mean of sensitivity and PPV:

$$ \mathrm{DSC}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FN}+\mathrm{FP}} $$

(18)

Accuracy is the ratio of correct predictions:

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$

(19)

As was the case in specificity, we note that there are many segmentation tasks where the target anatomical structure is very small (e.g., subcortical structures); hence, the foreground and background have unbalanced number of pixels. In this case, accuracy can be misleading and display high values for poor segmentations. Moreover, as for the case of specificity, one needs to define a background region in order for TN, and thus accuracy, not to vary arbitrarily with the field of view.

The Jaccard index (JI), also known as the intersection over union (IoU), measures the percentage of overlap between the ground truth and positive prediction relative to the union of the two:

$$ \mathrm{JI}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}} $$

(20)

JI is closely related to the DSC. However, it is always lower than the DSC and tends to penalize more severely poor segmentations.

There are also distance measures of segmentation accuracy which are especially relevant when the accuracy of the boundary is critical. These include the average symmetric surface distance (ASSD) and the Hausdorff distance (HD). Suppose the surface of the ground truth and the predicted segmentation are $ \mathcal{S} $ and $ {\mathcal{S}}^{\prime } $, respectively. For any point $ \boldsymbol{p}\in \mathcal{S} $, the distance from p to surface $ {\mathcal{S}}^{\prime } $ is defined by the minimum Euclidean distance:

$$ d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}^{\prime}\in \mathcal{S}^{\prime }}{\min}\parallel \boldsymbol{p}-{\boldsymbol{p}}^{\prime}\parallel {}_2 $$

(21)

Then the average distance between $ \mathcal{S} $ and $ {\mathcal{S}}^{\prime } $ is given by averaging over $ \mathcal{S} $:

$$ d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\frac{1}{N_S}\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right) $$

(22)

Note that $ d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne d\left({\mathcal{S}}^{\prime },\mathcal{S}\right) $. Therefore, both directions are included in ASSD so that the mean of the surface distance is symmetric:

$$ \mathrm{ASSD}=\frac{1}{N_S+{N}_{S\prime }}\left[\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right)+\sum \limits_{j=1}^{N_{S\prime }}d\Big({\boldsymbol{p}}_j^{\prime },\mathcal{S}\Big)\right] $$

(23)

The ASSD tends to obscure localized errors when the segmentation is decent at most of the points on the boundary. The Hausdorff distance (HD) can better represent the error, by, instead of computing the average distance to a surface, computing the maximum distance. To that purpose, one defines

$$ h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}\in \mathcal{S}}{\max }d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right) $$

(24)

Note that, again, $ h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne h\left({\mathcal{S}}^{\prime },\mathcal{S}\right) $. Therefore, both directions are included in HD so that the distance is symmetric:

$$ \mathrm{HD}=\max \left(h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right),h\left({\mathcal{S}}^{\prime },\mathcal{S}\right)\right) $$

(25)

HD is more sensitive than ASSD to localized errors. However, it can be too sensitive to outliers. Hence, using the 95th percentile rather than the maximum value for computing $ h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right) $ is a good option to alleviate the problem.

Moreover, there are some volume-based measurements that focus on correctly estimating the volume of the target structure, which is essential for clinicians since the size of the tissue is an important marker in many diseases. Denote the ground truth volume as V while the prediction volume as V^′. There are a few expressions for the volume difference. (1) The unsigned volume difference: |V^′− V |. (2) The normalized unsigned difference: $ \frac{\mid {V}^{\prime }-V\mid }{V} $. (3) The normalized signed difference: $ \frac{V^{\prime }-V}{V} $. (4) Pearson’s correlation coefficient between the ground truth volumes and the predicted volumes: $ \frac{\mathrm{Cov}\left(V,{V}^{\prime}\right)}{\sqrt{\mathrm{Var}(V)}\sqrt{\mathrm{Var}\left({V}^{\prime}\right)}} $. Nevertheless, note that, while they are useful, these volume-based metrics can also be misleading (a segmentation could be wrongly placed while providing a reasonable volume estimate) when used in isolation. They thus need to be combined with overlap metrics such as Dice.

Finally, some recent guidelines on validation of different image analysis tasks, including segmentation, were published in [21].

2.1.6 Pre-processing for Segmentation Tasks

Image pre-processing is a set of sequential steps taken to improve the data and prepare it for subsequent analysis. Appropriate image pre-processing steps often significantly improve the quality of feature extraction and the downstream image analysis. For deep learning methods, they can also help the training process converge faster and achieve better model performance. The following sections will discuss some of the most widely used image pre-processing techniques.

Skull Strip**

Many neuroimaging applications often require preliminary processing to isolate the brain from extracranial or non-brain tissues from MRI scans, commonly referred to as skull strip**. Skull strip** helps reduce the variability in datasets and is a critical step prior to many other image processing algorithms such as registration, segmentation, or cortical surface reconstruction. In literature, skull strip** methods are broadly classified into five categories: mathematical morphology-based methods [22], intensity-based methods [23], deformable surface-based methods [24], atlas-based methods [25], and hybrid methods [26]. Recently, deep learning-based skull strip** methods have been proposed [27,28,29,30,31,32] to improve the accuracy and efficiency. A detailed discussion of the merits and limitations of various skull strip** techniques can be found in [33].

Bias Field Correction

The bias field refers to a low-frequency and very smooth signal that corrupts MR images [34]. These artifacts, often described as shading or bias, can be generated by imperfections in the field coils or by magnetic susceptibility changes at the boundaries between anatomical tissue and air. This bias field can significantly degrade the performance of image processing algorithms that use the image intensity values. Therefore, a pre-processing step is usually required to remove the bias field. The N4 bias field correction algorithm [35] is one of the most widely used methods for this purpose, as it assumes a simple parametric model and does not require tissue classification.

Data Harmonization

Another challenge of MRI data is that it suffers from significant intensity variability due to several factors such as variations in hardware, reconstruction algorithms, and acquisition settings. This is also due to the fact that most MR imaging sequences (e.g., T1-weighted, T2-weighted) are not quantitative (the voxel values can only be interpreted relative to each other). Such differences can often be pronounced in multisite studies, among others. This variability can be problematic because intensity-based models may not generalize well to such heterogeneous datasets. Any resulting data can suffer from significant biases caused by acquisition details rather than anatomical differences. It is thus desirable to have robust data harmonization methods to reduce unwanted variability across sites, scanners, and acquisition protocols. One of the popular MRI harmonization methods is a statistical approach named the combined association test (comBat). This method was shown to exhibit a good capacity to remove unwanted site biases while preserving the desired biological information [36]. Another popular method is a deep learning-based image-to-image translation model, CycleGAN [37]. The CycleGAN and its variants do not require paired data, and thus the training process is unsupervised in the context of data harmonization.

Intensity Normalization

Intensity normalization is another important step to ensure comparability across images. In this section, we discuss common intensity normalization techniques. Readers can refer to the work [38] in which the author explores the impact of different intensity normalization techniques on MR image synthesis.

Z-Score Normalization

The basic Z-score normalization on the entire image is also called the whole-brain normalization. Given the mean μ and standard deviation σ from all voxels in a brain mask B, Z-score normalization can be performed for all voxels in image I as follows:

$$ {I}_{z- score}(x)=\frac{I(x)-\mu }{\sigma } $$

(26)

While straightforward to implement, whole-brain normalization is known to be sensitive to outliers.

White Stripe Normalization

White stripe normalization [39] is based on the parameters obtained from a sample of normal-appearing white matter (NAWM) and is thus robust to local intensity outliers such as lesions. The NAWM is obtained by smoothing the histogram of the image I and selecting the mode of the distribution. For T1-weighted MRI, the “white stripe” is defined as the 10% of intensity values around the mean of NAWM μ. Let F(x) be the CDF of the specific MR image I(x) inside the brain mask B, and τ = 5%. The white stripe Ω_τ is defined as

$$ {\Omega}_{\tau }=\left\{I(x)|{F}^{-1}\left(F(x)-\tau \right)<I(x)<{F}^{-1}\left(F(x)+\tau \right)\right\} $$

(27)

Then let σ_τ be the sample standard deviation associated with Ω_τ. The white stripe normalized image is

$$ {I}_{ws}(x)=\frac{I(x)-\mu }{\sigma_{\tau }} $$

(28)

Compared to the whole-brain normalization, the white stripe normalization may work better and have better interpretation, especially for applications where intensity outliers such as lesions are expected.

Segmentation-Based Normalization

Segmentation-based normalization uses a segmentation of a specified tissue, such as the cerebrospinal fluid (CSF), gray matter (GM), or white matter (WM), to normalize the entire image to the mean of the tissue. Let T ⊂ B be the tissue mask for image I. The tissue mean can be calculated as $ \mu =\frac{1}{\mid T\mid }{\sum}_{t\in T}I(t) $ and the segmentation-based normalized image is expressed as

$$ {I}_{seg}(x)=\frac{cI(x)}{\mu } $$

(29)

where $ c\in {\mathbb{R}}^{+} $ is a constant.

Kernel Density Estimate Normalization

Kernel density estimate (KDE) normalization estimates the empirical probability density function of the intensities of the entire image I over the brain mask B via kernel density estimation. The KDE of the probability density function for the image intensities can be expressed as

$$ \hat{p}(x)=\frac{1}{\mathrm{HWD}\times \delta}\sum \limits_{i=1}^{\mathrm{HWD}}K\left(\frac{x-{x}_i}{\delta}\right) $$

(30)

where H, W, D are the image sizes of I, x is an intensity value, K is the kernel, and δ is the bandwidth parameter which scales the kernel. With KDE normalization, the mode of WM can be selected more robustly via a smooth version of the histogram and thus is more suitable to be used in a segmentation-based normalization method.

Spatial Normalization

Spatial normalization aims to register a subject’s brain image to a common space (reference space) to allow comparisons across subjects. When the reference space is a standard space, such as the Montreal Neurological Institute (MNI) space [40] or the Talairach and Tournoux atlas (Talairach space), the registration also facilitates the sharing and interpretation of data across studies. It is also common practice to define a customized space from a dataset rather than using a standard space. For deep learning methods, it has been shown that training data with appropriate spatial normalization tend to yield better performances [41,42,43]. Rigid, affine, or deformable registration may be desirable for spatial normalization, depending on the application. Many registration methods are publicly available through software packages such as 3D Slicer, FreeSurfer [https://surfer.nmr.mgh.harvard.edu/], FMRIB Software Library (FSL) [https://fsl.fmrib.ox.ac.uk/fsl/fslwiki], and Advanced Normalization Tools (ANTs) [https://picsl.upenn.edu/software/ants/].

2.2 Supervision Settings

In the following three sections, we categorize the learning-based segmentation algorithms by their supervision setting. In the reverse order of the amount of annotation required, these include supervised, semi-supervised, and unsupervised methods (Fig. 7). For supervised methods, we mainly present some training strategies and model architectures that will help improve the segmentation performance. For the other two types of approaches, we classify the mainstream ideas and then provide application examples proposed in recent research.

A flowchart illustrates medical image segmentation of supervised, semi-supervised, and unsupervised learning with the types on the top. It lists labeled data, model, supervised loss of prediction, and label. — **Fig. 7**

2.3 Supervised Methods

2.3.1 Background

In supervised learning, a model is presented with the given dataset $ \mathcal{D}={\left\{\left({x}^{(i)},{y}^{(i)}\right)\right\}}_{i=1}^n $ of inputs x and associated labels y. This y can take several forms, depending on the learning task. In particular, for fully convolutional neural network-based segmentation applications, y is a segmentation map. In supervised learning, the model can learn from labeled training data by minimizing the loss function and apply what it has learned to make a prediction/segmentation in testing data. Supervised training thus aims to find model parameters θ that best predict the data based on a loss function $ L\left(y,\hat{y}\right) $. Here, $ \hat{y} $ denotes the output of the model obtained by feeding a data point x to the function f(x;θ) that represents the model. Given sufficient training data, supervised methods can generally perform better than semi-supervised or unsupervised segmentation methods.

2.3.2 Data Representation

Data is an important part of supervised segmentation models, and the model performance relies on data representation. In addition to image pre-processing (Subheading 2.1.6), there are a few key steps for data preparation before being fed into the segmentation network.

Patch Formulation

The inputs of CNN can be represented as image patches when the whole image is too large and would require too much GPU memory. The image patches could be 2D slices, 3D patches, and any format in between. The choice of patches would affect the performance of networks for a given dataset and task [44]. Compared to 3D patches, 2D slices have the advantage of lighter computational load during training. However, contextual information along the third axis is missing. In contrast, 3D patches leverage data from all three axes, but they require more computational resources. As a compromise between 2D and 3D patches, “2.5D” approaches have been proposed, by taking 2D slices in all three orthogonal views through the same voxel [45]. Those 2D slices could be trained in a single CNN or a separate CNN for each view. Furthermore, Zhang et al. [46] proposed 2.5D stacked slices to leverage the information from adjacent slices in each view.

Patch Extraction

Due to the imbalance between foreground and background, various patch extraction strategies have been designed to obtain robust segmentation. Kamnitsas et al. [47], Dolz et al. [48], and Li et al. [49] pick a voxel within the foreground or background with 50% probability at every iteration during training and select the patch centered at that voxel. In [46], Zhang et al. extract 2.5D stacked patches if the central slice contains the foreground, even with only one voxel. In some models [50, 51], 3D patches with target structure are used as input instead of the whole image, which could reduce the effect of the background for segmenting target structures with smaller volume.

Data Augmentation

To avoid the over-fitting problem and increase the generalizability of the model, data augmentation (DA) is widely used in medical image segmentation [52]. The common DA strategies could be classified into three categories: (1) spatial augmentation, (2) image appearance augmentation, and (3) image quality augmentation. For spatial augmentation, random image flip, rotation, scale, and deformation are often used [4, 45, 53,54,55]. Random gamma correction, intensity scale, and intensity shift are the common forms for image appearance augmentation [51, 54, 56, 57]. Image quality augmentation includes random Gaussian blur, random noise addition, and image sharpening [51, 56]. Note that while we only list a few commonly used methods here, many others have been explored. TorchIO [58] is a widely used software package for data augmentation.

2.3.3 Network Architecture

Here, we classify the popular supervised segmentation networks into single/multipath networks and encoder-decoder networks.

Single/Multipath Networks

As discussed above, patches are often used as input instead of the entire image, resulting in a lack of global context. This could produce noisy segmentations, such as undesired islands of false-positive voxels that need to be removed in post-processing [48]. To compensate for the missing global context, Li et al. [49] used spatial coordinates as additional channels of input patches. A multipath network is another feasible solution (Fig. 8). Multipath networks usually contain global and local paths [47, 59, 60] that extract different features at different scales. The global path uses convolutions with larger kernel size [60] or a larger receptive field [47] to learn global information [47]. In contrast, local features are extracted in the local path. The global path thus extracts global features and tends to locate the position of the target structure. In contrast, the shape, size, texture, boundary, and other details of the target structure are identified by the local path. However, the performance of this type of network is easily affected by the size and design of input patches: for example, too small patches would not provide enough information, while too large patches would be computationally prohibitive.

A process flow of single-path and multipath networks has the following layers. Brain M R I, input segment, convolutional layer, and classification layer. In the multipath network, fully connected layers are between convolutional layers and classification layer. — **Fig. 8**

U-Net and Its Variants

To tackle the limitations of the single/multipath networks, many models use U-net variants with encoder-decoder paths [1, 61], which establishes end-to-end training from image to segmentation map. The encoder is similar to the single/multipath networks but with downsampling operations between the different scales of feature maps. The decoder leverages the extracted features from the encoder and produces a segmentation of the same size as the original image. Skip connections that pass the feature maps from the encoder directly to the decoder contribute to the performance of the U-net. The passed information could help to recover the details of segmentation.

The most common modification of the U-Net is the introduction of other convolutional modules, such as residual blocks [62], dense blocks [63], attention modules [3, 4], etc. These convolutional modules could replace regular convolution operations or be used in the skip connections of the U-Net. Residual blocks could mitigate the gradient vanishing problem during training by adding the input of the module to its output, which also contributes to the speed of convergence [62]. In this configuration, the network can be built deeper. The work of [53, 59, 64,65,66] used residual connections or residual blocks instead of regular convolutions in their network architecture for robust segmentation of various brain structures. Dense blocks could strengthen feature propagation and encourage feature reuse to improve segmentation accuracy. However, they require more computational resources during training. Zhang et al. [46, 56] employed the Tiramisu network [67], a densely U-shaped network, to produce superior multiple sclerosis (MS) lesion segmentation.

The attention module is another commonly used tool in segmentation to focus on salient features [4]. It can be categorized into spatial attention and channel attention modules. Li et al. [53] use spatial attention modules in the skip connections for extracting smaller subcortical structures. Similarly, attention modules are used between skip connections and in the decoder part in the work of [51, 68] for segmenting vestibular schwannoma and cochlea. In addition, Zhang et al. [69] proposed to use slice-wise attention networks in 3D CNNs for MS segmentation. Applying the slice-wise attention in three different orientations improves the computational efficiency compared to the regular attention module. Hou et al. [70] proposed the cross-attention block, which combines channel attention and spatial attention. Moreover, in [71], a skip attention unit is used for brain tumor segmentation. Zhou et al. [72] build fusion blocks based on the attention module. Attention modules have also been used for brain tumor segmentation [73].

Transformers

As discussed in Subheading 2.1.2, transformers have become popular in medical image segmentation [74,75,76]. Transformers leverage the long-range dependencies and can better capture low-level details. In practice, they can replace CNNs [77], be combined with CNNs [78, 79], or integrated into CNNs [80]. Some recent works [14, 15, 77] have shown that the implementation of transformer on U-Net architecture can achieve superior performance in medical image segmentation compared to their CNN counterparts.

2.3.4 Framework Configuration

The single network mainly focuses on a single task during training and may ignore other potentially useful information. To improve the segmentation accuracy, frameworks with multiple encoders and decoders have been proposed [53, 81, 82].

Multi-task Networks

As the name suggests, multi-task networks attempt to simultaneously tackle a main task as well as auxiliary tasks, rather than focusing on a single segmentation task. These networks usually contain a shared encoder and multiple decoders for multiple tasks, which could help deal with class imbalance (Fig. 9). Compared to a single-task network, the learning ability of the encoder is increased from same domain tasks (e.g., multiple tasks of multiple decoders), which could improve segmentation performance. Simultaneously learning multiple tasks could also improve model generalizability. McKinley et al. [81] leverage the information of additional tissue types to increase the accuracy of MS lesion segmentation. Another common multi-task setting is to introduce an auxiliary reconstruction task [57].

An illustration of the multi-task framework along with 3-D M R I scans as input on the top. On the bottom, square box = group norm, R e L U, conv 3 multiple 3 multiple 3 group norm R e L U conv 3 multiple 3 multiple 3. — **Fig. 9**

Cascaded Networks

A cascaded network is a series of connected networks such that the input of each downstream network is the output from an upstream network (Fig. 10). For example, a coarse-to-fine segmentation strategy can be used to reduce the high computational cost of training for 3D images [50, 53]. In this scenario, an upstream network could take downsampled images as input to roughly locate the target structures, allowing the images to be cropped to the region of interest for the downstream network. The downstream network could then produce high-quality segmentation in full resolution. Another advantage of this approach is to reduce the impact of volume imbalance between foreground and background classes. However, the upstream network would determine the performance of the whole framework, and some global information is missing in the downstream networks.

An illustration of image cascaded network. The input multi-model volumes with W net points to segmentation of the whole tumor with T net, to segmentation of tumor core with E net, and then to segmentation of enhancing tumor core. — **Fig. 10**

Ensemble Networks

To obtain a robust segmentation, a popular approach is to aggregate the output from multiple independent networks (i.e., no weights/parameters shared). Kanitsas et al. proposed the ensemble of multiple models and architectures (EMMA) [83] for brain tumor segmentation. Kao et al. [84] produce segmentation using 26 ensemble neural networks. Zhao et al. [85] proposed a framework for 3D segmentation with multiple 2D networks that take input from different views. Huo et al. [82]

Full size image

2.3.5 Multiple Modalities and Timepoints

Many neuroimaging studies contain multiple modalities or multiple timepoints per subject. This additional information is clearly valuable and can be leveraged to improve segmentation performance.

Multiple Modalities

Different imaging modalities offer different visualizations of various tissue types. Multi-modality datasets can be thus leveraged to improve segmentation accuracy. For example, Zhang et al. [86] proposed a framework with two independent networks that take two different modalities as inputs. Instead of combining single modality networks, Zhang et al. [46] concatenate multi-modality data as different channels of inputs. However, not all modalities are available in clinical practice: (1) the MRI sequences can vary between different imaging sites and (2) some modalities may be unusable due to poor image quality. This is known as the missing modality problem. To tackle this problem, Havaei et al. [87] proposed a deep learning method that is robust to missing modalities for brain tumor and MS segmentation, which contains an abstraction layer that transforms feature maps into statistics to help learning during training. In [88], the authors further improved modality dropout by introducing dynamic filters and co-training strategy for MS lesion segmentation. In [89, 90], the authors used knowledge distillation scheme to transfer the knowledge from full-modality data to each missing condition with individual models.

Multiple Timepoints

Data from multiple timepoints are important for tracking the longitudinal changes in a single subject. The additional timepoints can also be used as temporal context to improve the segmentation for each timepoint. In [45], longitudinal data are concatenated as a multichannel input to improve segmentation. In the work of [91], the stacked convolutional long short-term memory modules (C-LSTMs) are integrated into CNN for 4D medical image segmentation, which allows the model to learn the correlation and overall trends from longitudinal data. Li et al. [92] also proposed a framework with C-LSTM modules for segmenting longitudinal data jointly.

2.4 Semi-supervised Methods

2.4.1 Background

Given a considerable amount of labeled data, deep learning-based methods have achieved state-of-the-art performances in various medical image analysis applications. However, it is a laborious and time-consuming process to obtain dense pixel/voxel-level annotations for segmentation tasks. Since accurate annotations require expertise in medical domain, they are also expensive to collect. It is therefore desirable to leverage unlabeled data alongside the labeled data to improve model performance, an approach typically known as semi-supervised learning (SSL). Intuitively, these unlabeled data can provide critical information on the data distribution and thus can be used to improve model robustness by exploring this distribution.

Conceptually, SSL falls in between supervised learning (fully labeled data) and unsupervised learning (no labeled data). In SSL, we have access to both a labeled dataset $ {\mathcal{D}}_L=\left\{\left({x}_l^{(i)},{y}_l^{(i)}\right)|i=1,2,\cdots \kern0.3em ,{n}_l\right\} $, where $ {y}_l^{(i)} $ is the ith manually annotated ground truth mask in the context of segmentation task, and an unlabeled dataset $ {\mathcal{D}}_U=\left\{\right({x}_u^{(i)}\mid i=1,2,\cdots \kern0.3em ,{n}_u\Big\} $. Typically, n_u ≫ n_l. The main objective of SSL is to train a segmentation network X by leveraging both $ {\mathcal{D}}_L $ and $ {\mathcal{D}}_U $ to surpass the performances achieved by solely supervised learning with $ {\mathcal{D}}_L $ or unsupervised learning with $ {\mathcal{D}}_U $.

According to [93], there are mainly three underlying assumptions held by SSL: (1) smoothness assumption, (2) low-density assumption, and (3) cluster assumption. The smoothness assumption states that the data points that are close by in the input or latent space should have similar or identical labels. With this assumption, we can expect the labels of unlabeled data to be similar to those of labeled data when these samples are similar in input or latent space, i.e., the labels from the labeled dataset can be transferred to the unlabeled dataset. In the low-density assumption, we assume that the decision boundary of a classifier should ideally not pass through the high density of the marginal data distribution. Placing the decision boundary in a high-density region would violate the smoothness assumption because the labels would be more likely to be dissimilar for similar data points. Lastly, the cluster assumption states that each cluster of data points should belong to the same class. This assumption is necessary because if the data points from the unlabeled and labeled datasets cannot be meaningfully clustered, the unlabeled data cannot be used to improve the model performance trained from only the labeled data.

2.4.2 Overview of Semi-supervised Techniques

In the semi-supervised learning literature, most of the techniques are originally designed and validated in the context of classification tasks. However, these methods can be readily adapted to segmentation tasks since a segmentation task can be viewed as pixel-wise classification. In this chapter, we mainly categorize the SSL approaches into three techniques, namely, (1) consistency regularization, (2) entropy minimization, and (3) self-training. However, most existing SSL approaches often employ a combination of these techniques rather than a single one, as summarized in Table 1. In the following sections, we will discuss each approach in detail and introduce some of the most important SSL techniques alongside.

Table 1 Summary of classic semi-supervised learning methods

Full size table

2.4.3 Consistency Regularization

In semi-supervised learning, consistency regularization has been widely used as a technique to make use of unlabeled data. The idea of consistency regularization is based on the smoothness assumption that the network outputs should remain the same even if the input data is perturbed slightly (i.e., do not vary dramatically in the input space). The consistency between the predictions of an unlabeled sample and its perturbed counterpart can be used as a supervision mechanism for training to leverage the unlabeled data. In such scenarios, we can formulate the semi-supervised training objective as follows:

$$ {\ell}_{SSL}=\sum \limits_{x_l,{y}_l\in {D}_L}{L}_S\left({x}_l,{y}_l\right)+\alpha \sum \limits_{x_u\in {D}_U}{L}_C\left({x}_u,{\tilde{x}}_u\right) $$

(31)

where L_S is the supervised loss for labeled data. For segmentation tasks, L_S can be one of the segmentation losses we presented in Subheading 2.1.3. x_u and $ {\tilde{x}}_u $ are the unlabeled data and its perturbed version, respectively. L_C is the consistency loss function. Mean squared error loss and KL divergence loss have been widely used as L_C in the SSL literature. α is a balancing term to weigh the impact of consistency loss from unlabeled data.

It is worth noting that the random permutations involved in consistency regularization can be implemented in different ways. For instance, the Π model [95] encourages consistent network outputs between two versions of the same input data, i.e., with different data augmentation and different network dropout conditions. In this way, training can leverage the labeled data by optimizing the supervised segmentation loss and the unlabeled data by using this unsupervised consistency loss. In mean teacher [96], the authors propose to compute the consistency between the outputs of the student network and the teacher network (which uses the exponential moving average of the student network weights) from the same input data. In unsupervised data augmentation (UDA) [97], unlabeled data are augmented via different augmentation strategies such as RandAugment [100] and are fed to the same network to obtain two model predictions, which are used to compute the consistency loss. Similarly, in MixMatch [98], another very popular SSL method, an unlabeled image is augmented K times and the average of their outputs is sharpened, which is then used as the supervision signal to compute the consistency loss. Moreover, in FixMatch [99], the consistency loss is computed on the weakly and strongly augmented versions of the same input. In summary, consistency regularization has been widely used in various SSL techniques to leverage the unlabeled data.

Application: MTANS

MTANS [101] is an SSL framework for brain lesion segmentation. As shown in Fig. 12, the MTANS framework is built upon the mean teacher model [96] where both the teacher and the student models are used to segment the brain lesions as well as the signed distance maps of the object surfaces. As a variant of the mean teacher model, MTANS incorporates consistency regularization in the training strategy. Specifically, the authors propose to compute the multi-scale feature consistency as consistency regularization, while the traditional mean teacher model only computes the consistency at the output level. Besides, a discriminator network is used to extract hierarchical features and differentiate the signed distance maps obtained by labeled and unlabeled data. In experiments, MTANS is evaluated on three public brain lesion datasets including ISBI 2015 (multiple sclerosis) [102], ISLES 2015 (ischemic stroke) [103], and BRATS 2018 (brain tumor) [104]. Experimental results show that MTANS can outperform the supervised baseline and other competing SSL methods when trained with the same amount of labeled data.

An illustration of the M T A N S framework of unlabeled, and labeled M R I sequences. It leads to various steps, including teacher and student model, student S D M, map, and segmentation region with unlabeled F L A I R, that leads to discriminator, segmentation, and lesion features. — **Fig. 12**

2.4.4 Entropy Minimization

Entropy minimization is another important SSL technique and is often used together with consistency training. Generally, entropy is the measure of the disorder or the uncertainty of a system. In the context of SSL, this term often refers to the uncertainty in the pseudo-label obtained by the unlabeled data. Entropy minimization, also known as minimum entropy regularization, aims to encourage the model to produce high-confidence predictions. The idea of entropy minimization is built upon the low-density assumption as it requires the network to output low-entropy predictions on unlabeled data. The high-confidence pseudo-labels have been found very effective when used as the supervision for unlabeled data. For example, in MixMatch, the pseudo-label of the unlabeled data, i.e., the average predictions of K augmented samples, is “sharpened” by adjusting the prediction distribution. This sharpening process is an implicit way to minimize the entropy on the unlabeled data distribution. In pseudo-label [94], the authors propose to construct the hard (one-hot) pseudo-labels from the high-confidence predictions of the unlabeled data, which is another form of entropy minimization. In addition, the UDA method proposes to compute the consistency loss only when the highest probability in the predicted class is above a pre-defined threshold. Similarly, in FixMatch, the predictions of the weakly augmented unlabeled data are first filtered by a pre-defined threshold and later converted to a one-hot pseudo-label.

2.4.5 Self-training

Self-training is an iterative training process where the network uses the high-confidence pseudo-labels of the unlabeled data from previous training steps. Interestingly, it has been shown that self-training is equivalent to a version of the classification EM algorithm [105]. The ideas of self-training and consistency regularization are very similar. Here, we differentiate these two concepts as follows: for consistency regularization, the supervision signals of the unlabeled data are generated online, i.e., from the current training epoch; in contrast, for self-training, the pseudo-labels of unlabeled data are generated offline, i.e., generated from the previous training epoch/epochs. Typically, in self-training, the pseudo-labels produced from previous epochs need to be carefully processed before being used as the supervision, as they are crucial to the effectiveness of the self-training methods. In the SSL literature, pseudo-label [94] is a representative method that uses self-training. In pseudo-label, the network is first trained on the labeled data only. Then the pseudo-labels of the unlabeled data are obtained by feeding them to the trained model. Next, the top K predictions on the unlabeled data are used as the pseudo-labels for the next epoch. The training objective function of pseudo-label is as follows:

$$ {L}_{PL}=\sum \limits_{x_l,{y}_l\in {D}_L}{L}_S\left({x}_l,{y}_l\right)+\alpha (t)\sum \limits_{x_u\in {D}_U}{L}_S\left({x}_u,{\tilde{y}}_u\right) $$

(32)

where $ \overset{\sim }{y} $ is the pseudo-label and α(t) is a balancing term to weigh the importance of pseudo-label training. Particularly, α(t) is designed to slowly increase to help the optimization process to avoid poor local minima [94]. Note that both labeled and unlabeled data are trained in a supervised manner with ground truth labels y_l and pseudo labels $ {\tilde{y}}_u $.

Application: 4S

In this study, the authors propose a sequential semi-supervised segmentation (4S) framework [106] for serial electron microscopy image segmentation. As shown in Fig. 13, 4S relies on the self-training strategy as it applies pseudo-labeling to all slices in the target continuous images, with only a small number of consecutive input slices. Specifically, a few labeled samples are used for the first round of training. The trained model is then used to generate pseudo-labels for the next sample. Afterward, the segmentation model is retrained using the pseudo-labels and produces new pseudo-labels for the next slices. This method was evaluated on the ISBI 2012 dataset (neural cell membranes) [107] and Japanese carpenter ant dataset (nestmate discriminant sensory elements) [108]. Results show that 4S has achieved better performance than the supervised learning-based method.

A flowchart illustrates sequential semi-supervised segmentation of model prediction train and retrains with raw images, true, and pseudo labels of the slices with N-times. — **Fig. 13**

2.5 Unsupervised Methods

2.5.1 Background

As suggested in Subheadings 2.3 and 2.4, most deep segmentation models learn to map the input image x to the manually annotated ground truth y. Although semi-supervised approaches can drastically reduce the need for labels, low availability of ground truth is still a primary concern for the development of learning-based models. Another disadvantage of supervised learning approaches becomes evident when considering the anomaly detection/segmentation task: a model can only recognize anomalies that are similar to those in the training dataset and will likely fail with rare findings that may not appear in the training data [109].

Unsupervised anomaly detection (UAD) methods have been developed in recent years to tackle these problems. Since no ground truth labels are provided, the models are designed to capture the inherent discrepancy between healthy and pathological data distributions. The general idea is to represent the distribution of normal brain anatomy by a deep model that is trained exclusively on healthy subjects [109]. Consequently, the pathological subjects are out of the distribution modeled by the network. Usually, this neural network has an encoder-decoder architecture such that the output will be a reconstruction of the input image. Since not well represented by the training data, the abnormal region cannot be fully reconstructed. Hence, the pixel-wise reconstruction error can be used as an estimate of the anomalous region. Figure 14 illustrates this process.

A and B, schematic representations of training, and inference stages have healthy, and testing images of the human brain leading to the encoder, Z, decoder, and reconstructions, with the residual image in b. — **Fig. 14**

The auto-encoder (AE) and its variations (Fig. 15) are widely used in the UAD problem. All these models generate a low-dimensional representation of the input image termed latent vector z at the bottleneck. Most of the research concentrates on manipulating the distribution of z so that the abnormal region can be “cured” in the reconstruction. This process is often referred to as image restoration (or sometimes image inpainting) in the computer vision literature. The following sections will discuss some mainstream approaches categorized by the model structure implemented.

4 schematic representations of A E, V A E, A A E, and Ano V A E G A N. A. has beam z connecting healthy image, and reconstruction. b, c, and d. have the beam sigma, mu, and z between healthy image and reconstruction. — **Fig. 15**

2.5.2 Auto-encoders

The auto-encoder (AE) (Fig. 15a) is the simplest encoder-decoder structure. Let an encoder f_θ and a decoder g_ϕ, where θ, ϕ are model parameters. Given a healthy input image $ {\boldsymbol{X}}^h\in {\mathbb{R}}^{D\times H\times W} $, the encoder learns to project it to a lower-dimensional latent space z = f_θ(X^h), $ \boldsymbol{z}\in {\mathbb{R}}^L $. Then the decoder recovers the original image from the latent vector as $ {\hat{\boldsymbol{X}}}^h={g}_{\phi}\left(\boldsymbol{z}\right) $. The model is trained by minimizing the loss function $ \mathcal{L} $ that delineates the difference between the input and the reconstructed image:

$$ \underset{\theta, \phi }{\arg\ \min }{\mathcal{L}}_{\theta, \phi}\left({\boldsymbol{X}}^h,{\hat{\boldsymbol{X}}}^h\right)=\parallel {\boldsymbol{X}}^h-{\hat{\boldsymbol{X}}}^h\parallel {}_n $$

(33)

The ℓ₁-norm (n = 1) and ℓ₂-norm (mean squared error) (n = 2) are common choices for the loss function. The training stage is illustrated in Fig. 14a. When a sample with anomaly X^a is passed into the model, the abnormal region (e.g., lesion, tumor) cannot be well reconstructed in $ {\hat{\boldsymbol{X}}}^a $ as the model has never seen the anomaly in the healthy training data. In other words, the AE-based methods leverage the models’ dependence on training data to discern the region that is out of distribution. Figure 14b shows that the anomaly can be roughly represented by the reconstruction error $ \hat{\boldsymbol{Y}}=\mid {\boldsymbol{X}}^a-{\hat{\boldsymbol{X}}}^a\mid $.

Bayesian Auto-encoder

Pawlowski et al. [110] report a Bayesian convolutional auto-encoder to model the healthy data distribution. They introduce the model uncertainty and deem the reconstructed image as the Monte Carlo (MC) estimate. Let F_Θ be the auto-encoder model with weights Θ and $ \mathcal{D} $ the training dataset. Then, the MC estimation can be expressed as

$$ {F}_{\Theta}\left(\boldsymbol{X}\right)=\int P\left(\boldsymbol{X}|\Theta \right)P\left(\Theta |\mathcal{D}\right)d\Theta \approx \frac{1}{N}\sum \limits_{i=1}^N{F}_{\Theta_i}\left(\boldsymbol{X}\right) $$

(34)

where $ {\Theta}_i\sim P\left(\Theta |\mathcal{D}\right) $. In practice, the authors apply the MC-dropout to model the weight uncertainty. The segmentation is still obtained by setting a threshold on the reconstruction error, as in the vanilla auto-encoder.

2.5.3 Variational Auto-encoders

In some applications, instead of utilizing the lack of generalizability of the model, we want to modify the latent vector z to further guarantee that the reconstructed testing image $ {\hat{\boldsymbol{X}}}^a $ looks closer to a healthy subject. Then again, the residual between X^a and $ {\hat{\boldsymbol{X}}}^a $ is sufficient to highlight the anomalies in the image. Usually, such manipulation requires probabilistic modeling for the latent manifold. Hence, many applications use the variational auto-encoder (VAE) [111] as the backbone of the model (Fig. 15b).

As previously stated, we want the model to learn the distribution of healthy data P(X^h). In the encoder-decoder structure, we introduce a latent vector z at the bottleneck which follows a given distribution P(z). Usually, P(z) is assumed to follow a normal distribution $ \mathcal{N}\left(0,\boldsymbol{I}\right) $. The encoder and decoder are expressed by the conditional probabilities Q_θ(z|X^h) and P_ϕ(X^h|z), respectively. Then the target distribution is given by

$$ P\left({\mathbf{X}}^h\right)=\int {P}_{\phi}\left({\mathbf{X}}^h|\mathbf{z}\right)P\left(\mathbf{z}\right)d\mathbf{z}. $$

(35)

In addition to the reconstruction loss (e.g., ℓ₁/ℓ₂ norm), the Kullback-Leibler (KL) divergence D_KL[Q_θ(z|X^h)∥P(z)] that measures the distance of two distributions is another objective function to minimize. This term provides a constraint on the latent manifold such that the feature vector z can be stochastically sampled from a normal distribution. By modifying Eq. 13.35 and then applying Jensen’s inequality, we get the evidence lower bound (ELBO) $ \mathcal{L} $ for the log-likelihood of the healthy data:

$$ \mathcal{L} \left(\theta, \phi \right)={\mathbbm{E}}_{\mathbf{z}\sim {Q}_{\theta}\left(\mathbf{z}|{\mathbf{X}}^h\right)}\left[\log {P}_{\phi}\left({\mathbf{X}}^h|\mathbf{z}\right)\right]-{D}_{KL}\left[{Q}_{\theta}\left(\mathbf{z}|{\mathbf{X}}^h\right)\parallel P\left(\mathbf{z}\right)\right] $$

(36)

It has been proved that maximizing the $ \log P\left({\mathbf{X}}^h\right) $ is equivalent to maximizing its ELBO, so $ -\mathcal{L} $ serves as an objective function to optimize parameters θ and ϕ in the VAE model. By leveraging the same idea in the AE-based methods, the neural networks f_θ and g_ϕ model the normal brain anatomy if the training data contains only the healthy subjects. The approaches using VAE take one more step to guarantee the abnormal region cannot be recovered in the output, that is, modify the latent vector z^a of the anomalous input such that z^a ∼ Q_θ(z|X^h).

Given that healthy brains X^h and subjects with anomaly X^a are differently distributed, it is reasonable to assume that their latent manifolds Q_θ(z|X^h) and Q_θ(z|X^a) also vary. Suppose z^a = f_θ(X^a), then naturally, z^a ∼ Q_θ(z|X^a). If we can modify z^a so that z^a ∼ Q_θ(z|X^h), then after passing through the decoder P_ϕ(X^h|z), the reconstruction output of the model $ {\hat{\mathbf{X}}}^a $ would belong in P(X^h). That is to say, the modification in the latent manifold “cures” the anomaly. It is then easy to identify the anomaly as the residual between the input and output. The core part of the process is how to “cure” the latent representation of abnormal input. Some common ways are reported in the following examples.

Distribution Constraint

A straightforward way to force z^a ∼ Q_θ(z|X^h) is adding a specific loss function at the bottleneck. Chen et al. [112] propose an adversarial auto-encoder (AAE) shown in Fig. 15c. The encoder works as a generator that produces samples in the latent space, and an additional discriminator is trained to judge whether the sample is drawn from the normal distribution. It emphasizes that all the latent representations should follow $ \mathcal{N}\left(0,\boldsymbol{I}\right) $, whether the input is healthy or not.

Discrete Encoding

Another solution is proposed by Pinaya et al. [113]. They implement the vector-quantized variational auto-encoder (VQ-VAE) [114] to obtain a discrete representation of the latent tensor $ \boldsymbol{z}\in {\mathbb{R}}^{n_z\times h\times w} $. It can be regarded as a h × w image which contains a vector $ {\boldsymbol{v}}_i\in {\mathbb{R}}^{n_z} $ at each image location, where i = 1, 2, ..., h × w. The quantization of z is realized by a pretrained embedding space ($ {\boldsymbol{e}}_j\in {\mathbb{R}}^{n_z} $, where j = 1, 2, ..., K). It serves as a codebook from which we can always find a code e_j that is closest to the given v_i. Then by simply replacing the vector v_i with the index of its closest counterpart in the codebook, a quantized latent image $ {\boldsymbol{z}}_q\in {\mathbb{R}}^{h\times w} $ is obtained. Theoretically, the abnormal region is “cured” by using e_j to approximate v_i as the embedding space follows a fixed distribution. As usual, the residual between input and the reconstructed image $ \mid \boldsymbol{X}-\hat{\boldsymbol{X}}\mid $ is used to find the anomaly.

Different Normative Prior

Different from the vanilla VAE described above, Dilokthanakul et al. [115] propose a Gaussian mixture VAE (GMVAE) that replaces the unit multivariate Gaussian prior in the latent space with a Gaussian mixture model. GMVAE was used for brain UAD by You et al. [116]. Following the same idea of ruling out the anomaly in the latent space, they restore the image with anomaly using maximum a posteriori estimation given the Gaussian mixture model.

2.5.4 Variational Auto-encoders with Generative Adversarial Networks

A generative adversarial network (GAN) consists of two modules, a generator G and a discriminator D. Similar with the decoder in VAE, the generator G models the map** from a latent vector to the image space $ \mathbf{z}\mapsto \mathcal{X} $ where $ \mathbf{z}\sim \mathcal{N}\left(0,\boldsymbol{I}\right) $. The discriminator D can be deemed as a trainable loss function that judges whether the generated image G(z) is in the image space $ \mathcal{X} $. Combining the GAN discriminator and the VAE backbone has become a common idea in UAD problems. More details on GANs can be found in Chap. 5.

We note that D can be used as an additional loss in either latent or image space. In the adversarial auto-encoder (AAE) discussed above, the discriminator works to check whether the latent vector is drawn from the multivariate normal distribution. In contrast, Buar et al. [117] propose the AnoVAEGAN (Fig. 15d) model, in which the discriminator is applied in the image space to check whether the reconstructed image lies in the distribution of healthy data.