1 Introduction

Deep learning has dominated the field of computer vision since 2012 [1], taking advantage of the huge improvement in data storage and computing power of modern processing devices. Currently, most advanced methods for computer vision are based on deep learning. In this context, medical image analysis is an important research direction. The advantage of deep learning network is the ability to automatically extract features [2], researchers can describe medical images without constructing complex manual features. In deep learning-based medical image analysis, the end-to-end network training method shows significant advantages. Moreover, medical image analysis has huge practical demand and market space. It can be reasonably predicted that deep learning-based medical image analysis has great potential for research in the near future.

Most current artificial intelligence (AI) methods and applications belong to the category of supervised learning [3], which in this case means medical image data must be labeled. This is very difficult and costly to achieve in practice. On one hand, each medical image implies in principle a patient behind it, so the amount of medical image data available is very limited. On the other hand, medical image labeling requires highly specialized medical staff and plenty of time. For example, to train a deep convolutional neural network (CNN) for tumor segmentation, it is necessary for a specialized physician to mark all tumor pixels in the training image. These problems greatly restrict the development of automated, intelligent medical image analysis tools. It is of high potential that Generative adversarial network (GAN) has the potential to provide efficient solutions to these problems.

GAN was proposed in 2014 [4], with the original intention of imitating real data. GAN consists of two subnetworks: generator (\(G\)) and discriminator (\(D\)). During training, \(G\) is used to generate data with a given (expected) distribution, whereas \(D\) is used to determine whether generated data are real or fake. The two are trained alternately, and improve together [5]. Eventually, a G is obtained that can generate data close to the real data distribution, which is the ultimate goal of the method. Obviously, if it is applied to medical imaging, it can expand datasets with insufficient amounts of medical image data so that deep learning methods can be used together with the expanded datasets. Another very useful feature of GAN for medical image analysis is its adversarial training strategy, which can be applied to image segmentation, detection, or classification.

Compared with other medical image analysis techniques [6], GAN is still in its infancy and the number of related works available in the literature is relatively small, but it has huge potential. The application of GAN to medical images began in 2016, when only an article on the topic was published [7]. Since 2017, there have been more relevant studies, so the articles about GAN in medical images in the past five years have been analyzed and summarized in terms of application direction, methods, and other aspects. The rest of this article is organized as follows (Fig. 1). In the second section, GAN methods commonly applied in the medical image field are described in detail, focusing on their technical characteristics. The third section addresses the main application of GAN in this context, namely medical image synthesis. A classification is proposed according to different conditions of generation. The fourth section analyzes the application of GAN in medical image data enhancement. In the fifth section, GAN is discussed as a semi-supervised learning method, which mainly operates through feature and annotations sharing. The sixth section describes the functions of GAN that can be extended to other medical tasks. The seventh section discusses technical and non-technical challenges and directions. Finally, the conclusions are summarized in the eighth section.

Fig. 1
figure 1

Main content of this paper

2 GAN technology

This section starts with the original GAN and then covers the evolution process of GAN when used for image generation. The methods considered are also frequently used in the specific field of medical images. The section emphasizes the overall architecture, data flow, and objective function of GAN, and does not address the network details of specific generators or discriminators.

2.1 Original GAN

The operation of the original GAN is shown in Fig. 2a, where the symbols \(G\) and \(D\) denote neural networks. The input of \(G\) is a random noise vector \(z\), which is sampled from the distributed \(p(z)\). Generally, in order to keep consistency and convenience of training, the symbol \(p(z)\) adopts either a Gaussian or a uniform distribution. It should be noted that \(z\) is a low-dimensional vector, whereas images in actual applications are high-dimensional data, so \(G\) learns the map** from a low-dimensional noise space to a high-dimensional real data space. The inputs to \(D\) include \(G(z)\), generated fake data, and \(X\), real sample data used to balance training data. The symbol \(D\) is a classifier whose purpose is to judge the truth or falsehood of data. The purpose of \(G\) is to produce data as close to the real ones as possible, confusing \(D\) so that it cannot distinguish which ones are real and which ones are fake. In this way, \(G\) and \(D\) take part in a dynamic game process, improving each other during training. Data generated by \(G\) will be more and more realistic, and the recognition rate of \(D\) will gradually decrease from an initial value equal (or close) to 1 (perfect discrimination of real and false data) to (optimally) 0 (fake data cannot be distinguished from real ones). The optimization functions for \(D\) and \(G\) are as follows:

$$ \begin{gathered} L_{D} { = }\mathop {\max }\limits_{D} E_{x \sim p(x)} \left[ {\log D(x)} \right] + E_{z \sim p(z)} \left[ {\log (1 - D(G(z)))} \right], \hfill \\ L_{G} = \mathop {\min }\limits_{G} E_{z \sim p(z)} \left[ {\log (1 - D(G(z)))} \right] \hfill \\ \end{gathered} $$
(1)
Fig. 2
figure 2

Four common GAN structures

where \(L_{D}\) and \(L_{G}\) represent \(D\) and \(G\) loss functions, respectively, \(D(x)\) is close to 1, because X are real data,\(D(G(z))\) gradually decreases, and the optimization process consist in maximizing \(L_{D}\) and minimizing \({\text{L}}_{{\text{G}}}\).

2.2 DCGAN

The original GAN is not actually used to generate images, because \(G\) and \(D\) are ordinary fully connected networks not suitable for images. Image data distribution is very complex and has high dimensions, which is not easy to achieve. CNNs are more image-friendly than fully connected networks, and deep convolutional generative adversarial networks (DCGAN) have successfully combined CNNs with GAN, resulting in a more suitable solution for image generation [8]. DCGAN also adopts the structure shown in Fig. 2a, except that \(G\) and \(D\) are both replaced by CNNs. GAN has the problem of mode collapse, in other words, the training process is not stable, and generated images may only belong to a few fixed categories, or some strange images may appear. DCGAN proposes a series of techniques to balance the training process. \(G\) and \(D\) are fully convolutional networks (FCNs, i.e., CNNs without fully connected layers), using strided convolution instead of pooling layer for down-sampling. The output layer of \(G\) and the input layer of \(D\) use batch normalization [9], a data normalization layer that can be embedded in the network to accelerate learning and convergence. In DCGAN, activation functions are also changed in \(D\) with regard to GAN. GAN uses the ReLU activation function [10] (Fig. 3a) for both \(G\) and \(D\), whereas DCGAN uses ReLU for \(G\) and LeakyReLU [11] (Fig. 3b) for \(D\), to prevent gradient sparsity. In addition, the activation function of the output layer of \(D\) is tanh.

Fig. 3
figure 3

ReLU and Leaky ReLU activation functions

2.3 CGAN

GAN uses a random noise vector with a very low dimension to generate high-dimensional image data. This modeling method has too many degrees of freedom. If the noise signal has only hundreds of dimensions but the generated image has thousands of pixels, then controllability will be very poor. Conditional generative adversarial networks (CGAN, Fig. 2b) increase controllability by adding a constraint c to data [12], which is part of the input layer of both \(G\) and \(D\), guiding data generation. The objective function of CGAN is

$$ \begin{gathered} L_{D} { = }\mathop {\max }\limits_{D} E_{x \sim p(x)} \left[ {\log D(x|c)} \right] + E_{z \sim p(z)} \left[ {\log (1 - D(G(z|c)))} \right], \hfill \\ L_{G} = \mathop {\min }\limits_{G} E_{z \sim p(z)} \left[ {\log (1 - D(G(z|c)))} \right] \hfill \\ \end{gathered} $$
(2)

where \({\text{c}}\) can be a label, tags, data from different modes, or even an image. For example, the prior condition (see Sect. 3.3) used by Pix2pix [13] is segmentation image or contour image, Pix2pix can complete the transformation from image to image. When the prior condition is an image, a loss between conditional and generated images is usually added, so that the generated image can have higher authenticity [13]. InfoGAN [14] can also be viewed as a special kind of CGAN. Different from CGAN, it tries to add constraints in random noise \(z\) and uses regularization terms based on mutual information. As the input of the network, the symbol \(z\) controls the image generation. For instance, in the MNIST dataset [15], \(z\) controls the thickness, slope, and other characteristics of the generated numbers.

2.4 CycleGAN

Pix2pix requires paired images, one of them annotated, which requires a lot of time and implies a high cost. In contrast, CycleGAN [16] proposes a ring closed network consisting of two generators and two discriminators (Fig. 2c), which performs the conversion between two image domains without the need of paired images. Because of the two generators and discriminators, the overall structure and data flow are more complex than in the previous methods. The symbols \(G_{B}\) and \(G_{A}\) perform the transformation from domain A to domain B and from domain B to domain A, respectively, so they are equivalent to two reciprocal map**s. The symbol \(G_{B}\) generates images \(X_{{{\text{fB}}}}\) with domain B characteristics from images \(X_{A}\) of domain A, whereas \(G_{A}\) generates images \(X_{{{\text{fA}}}}\) with domain A characteristics from images \(X_{B}\) of domain B. Discriminators \(D_{A}\) and \(D_{B}\) identify images of domains A and B, respectively. The objective function of CycleGAN can be written as:

$$ \begin{aligned} L(G_{A} ,G_{B} ,D_{A} ,D_{B} ) = L_{GAN} (G_{A} ,G_{B} ,X_{A} ,X_{B} ) \\ + L_{GAN} (G_{B} ,G_{A} ,X_{B} ,X_{A} ) + \lambda L_{cyc} (G_{A} ,G_{B} ) \\ \end{aligned} $$
(3)

where \(L_{GAN}\) is a regular generator loss, as described by Eq. (1). Real data return to its original domain after a loop, so \(L_{cyc}\) represents the loss of real data and its cyclic data. \(\lambda\) is a coefficient used to balance generator loss and cycle loss.

$$ \begin{aligned} L_{cyc} (G_{A} ,G_{B} ) = {\text{E}}_{{X_{A} \sim A}} \left[ {\left\| {G_{A} (G_{B} (X_{A} )) - X_{A} } \right\|_{1} } \right] \\ + {\text{E}}_{{X_{B} \sim B}} \left[ {\left\| {G_{B} (G_{A} (X_{B} )) - X_{B} } \right\|_{1} } \right] \\ \end{aligned} $$
(4)

Since it is easy for GAN to be unbalanced in training, the two generators and discriminators in CycleGAN need to be carefully balanced during training. The use of paired images is equivalent to a feature filtering, and GAN can easily learn which parts of images need to be converted. However, the training process requires huge amounts of data when working with unpaired images, like in the case of CycleGAN.

2.5 LAPGAN

Humans usually paint a picture with multiple strokes, so machines can create images by multiple steps. That is where the idea of LAPGAN [17] comes from. There is no need to complete all GAN tasks at once, but one at a time generating a full image in several steps. Figure 2d shows a three-stage LAPGAN, the red arrows representing down-sampling and the blue arrows representing up-sampling. The three down-sampling processes can be regarded as a three-layer Laplace pyramid, and an independent conditional GAN model is trained at each level. Using the multi-scale structure of natural images, a series of generative models are constructed, each one capturing a specific scale image structure of the pyramid. The training process is carried out from left to right. The original image \(X_{r1}\) is transformed into \(X_{r1}^{^{\prime}}\) through down-sampling, and \(X_{r1}^{^{\prime}}\) becomes \(X_{r1}^{^{\prime\prime}}\) through up-sampling. Then a residual image is obtained by comparing \(X_{r1}\) with \(X_{r1}^{^{\prime\prime}}\). \(G_{1}\) takes a noise signal \(z_{1}\) as input and \(X_{r1}^{^{\prime\prime}}\) as the condition to generate the residual image. Training in the remaining levels is similar. The LAPGAN test process is shown in Fig. 2e. In this case it is performed from right to left. It is important to note that the target of \(G\) is the residual image, so there is a summation process. Serialization and the use of residual images are the two LAPGAN characteristics that effectively reduce the content and difficulty that GAN needs to learn.

3 Medical image synthesis

The most successful application of GAN in medical image analysis to date is medical image synthesis, which can alleviate the problems of insufficient medical images available or imbalanced data categories [18, 19]. Traditional data enhancement techniques include image cutting, flip**, and symmetry, among others. Obviously, these techniques can only change data in direction or size, but no new data are generated, whereas GAN can generate completely new data. In this section, unconditional synthesis, domain transformation and other conditional synthesis methods are described according to different conditions of medical images. Figure 4 shows some examples of these applications.

Fig. 4
figure 4

Medical image synthesis examples. Unconditional synthesis of brain magnetic resonant images (MRI) [20] a and of skin lesions [51], the generating condition is set as real image. GAN's role is to add different disease features into healthy images to generate realistic CXR images with different diseases. Lejmer et al. [59]. Chen et al. [60] proposed a multistage dense connection network with GAN for 3D brain MRI super-resolution reconstruction. The generator convolutional layers are all connected in a dense manner, whose main advantage is high speed. Irina et al. [61] used a combination of least squares dual loss and image gradient as loss function for the generator in GAN 3D super-resolution reconstruction, improving the quality of generated images.

4.2 Denoising

Noise in medical images seriously affects the diagnostic accuracy of doctors. This problem can be alleviated by GAN image denoising capabilities. In CT images, since high doses can harm the patient's health, the past decade has seen a trend towards dose reduction in CT examinations, at the expense of noise appearing in the low-dose images. Yang et al. [62] proposed a GAN with Wasserstein distance and perceptual similarity, which suppresses noise by comparing the perceptual features of a denoised output against those of the ground truth in a given feature space. Wolterink et al. [63] compared three training strategies, namely voxel loss, combined voxelwise and adversarial loss, and adversarial loss. Choi et al. [64] considered the statistical characteristics of CT images and introduced a loss function to incorporate the noise property in the image domain derived from noise statistics in the sinogram domain.

In addition to the use of lower doses, operation equipment (e.g., portable) may also introduce noise. Zhou et al. [65] constructed a two-stage GAN to improve the quality of ultrasonic images and reduce noise. In the training process, a transmission learning method based on plane wave image (PWI) data was introduced to facilitate convergence and eliminate the influence of deformation caused by respiratory activity. Chen et al. [66] proposed an unsupervised learning framework for high-quality pixel-level smoke detection and removal. The detection network is regarded as a prior knowledge and a loss function is used to support the training of smoke removal network.

4.3 Reconstruction

MRI is a widely used clinical medical imaging method, but one of its main disadvantages is the long acquisition time. During MRI imaging, data samples are not collected directly in the image space, but in the k space. The k space contains spatial frequency information obtained row by row and at any position. Slow acquisition causes interferences that may reduce image quality, due for instance to patient movements, such as heart beats or breathing. Compressive sensor-based imaging provides a solution to accelerate the acquisition of MRI images by reconstructing them from a small part of k space. In theory, assuming that the original data can be compressed, the reconstruction can be performed through nonlinear optimization of random under-sampled original data. GAN-based MRI image reconstruction is based on this theory and can be summarized as follows. The generator consists of multiple end-to-end networks. The first one converts a zero-fill reconstructed image into a complete reconstructed image. The following refinement network improves the accuracy of the reconstructed image. Then a discriminator network assesses whether or not the reconstruction is accurate. The works reported in [67, 68, 69 ] are all based on this framework, whose structure is shown in Fig. 4, the difference being the loss functions used. In order to improve the perceived quality of reconstruction [67], content loss is designed for generator training. This loss includes three parts: pixel mean square error loss, frequency domain mean squared error loss, and VGG loss. Feature matching loss and penalty are added in [68]. The work in [69] adds cycle loss, which is a cycle combination of low sampling frequency and completely reconstructed images.

4.4 Registration

To get accurate pathological information in the process of medical diagnosis, a set of images is taken of the same body part, so it is usually necessary to conduct quantitative analyses of several different images at the same time. These images need to be strictly aligned, which is called image registration. It requires a spatial transformation of images, so that there is spatial consistency between corresponding points in several images. In [70], a constrained CNN replaced heuristic smoothness measures of displacement fields, the generator is the registration network and the discriminator distinguishes the dense displacement field predicted by the generator from motion data simulated with the finite element method. During training, the registration network maximizes the similarity between anatomical labels and minimizes the difference between measured and simulated deformation. The generator in [71] generates conversion parameters between fixed and moving images. Different from [70], the discriminator is not used to assess conversion parameters, but to determine whether or not the processed moving image has completed registration. The work in [72] used CGAN for multimodal registration. By adding appropriate terms into the loss function of image generation, the generated output image has the same features as the moving image. Christine et al. [85] proposed a GAN to synthesize 3D CT scan images from X-ray ones, and then used a multi-organ segmentation network for segmentation. Konstantinos et al. [86] proposed a domain adaptive multi-connected adversarial network, where different data types are treated as different domains, making features learnt by segmentation independent of domain-specific factors. Good adaptability was shown with two different brain MRI databases. In [87], also from the point of view of using domains to solve data inconsistent segmentation problems, a network that migrates specific image styles was used. An unannotated color fundus image dataset was changed to annotated dataset style. In this way, the segmentation network trained by annotated datasets can be used to segment unannotated images.

6 Function expansion of GAN

6.1 Extended generator and discriminator

The adversarial learning process of the generator and discriminator produces a large number of advanced semantic features that can be extended to other tasks. The applicability of extended generators and discriminators is not limited, respectively, to image synthesis and to the classification of true and fake images.

Das et al. [88] proposed a generalizable classifier using adversarial learning between generator and discriminator to predict progressive retinal diseases such as age-related macular degeneration and diabetic macular edema. Gu et al. [89] proposed a transfer recurrent feature learning framework for probe-based confocal laser endomicroscopy (pCLE) video classification tasks. In a first phase, the discriminator features of pCLE images are learnt by GAN. In a second phase, discriminator features are applied to a recurrent neural network (RNN) to distinguish between true and false data and lesion grade. It can be seen that the discriminator is mainly expanded into a multiclass classifier.

Some researchers suggested using generators to segment images [87, 7.2 Non-technology challenges and directions

1) Privacy. The collection of medical images for scientific research requires patient consent. It is not clear if generated images or dataset generated based on them are to be considered as original data or new data, and therefore whether they should be subject to patient consent or not. The legality of new data is also uncertain. Some applications of GAN, such as domain transformation, may even expose more patients' personal privacy than original images. Therefore, for the application of new technology, not only its feasibility but also ethics and law must be considered.

2) Image confidence. In the field of medical imaging, the interpretation of an image may affect the life of the patient, so many technologies that are good in other areas for similar purposes may not be applicable in this medical field. Sometimes even a normal medical image will not be given enough trust by doctors, and multi-level detection is still needed. In this context, currently there is no reason for doctors to give trust to images generated by GAN. Cohen et al. [107] questioned the medical images generated by GAN, which may misjudge the medical condition of patients. They trained a CycleGAN to convert normal brain MRI images to brain MRI images with tumors. In fact, the images generated by their network are visually realistic, but without tumors. There are many reasons behind this. For instance, the generalization performance of a well-trained model is not good, or the transformations between some data domains cannot be accurately carried out. Attention should be paid to this issue, but this does not mean that all GANs will lead to misdiagnosis.

3) Datasets. Although there are many publicly available datasets, most of them were created not for use with GAN, but for other medical tasks. The quality of existing medical datasets is spotty, and some are old and scattered. For some tasks, such as the transformation between MRI and CT images, it is difficult to find relevant images of a certain scale. Most researchers collect them by themselves through hospitals.

8 Conclusion

Oriented to GAN for medical imaging, this paper summarizes commonly used GAN methods, medical image synthesis and the function of adversarial learning in other medical image tasks. The relevant papers in the area published in the last five years are reviewed. The challenges of datasets, training methods, reliability, and legality are pointed out. Future directions of unsupervised learning, breakthroughs in clinical needs, and the need for GANs more suitable for medical imaging are also discussed. In general, the existing medical image synthesis technology has a high reliability, and the combination of GAN and other medical image models also produces a good effect. It can be clearly concluded that GAN has great potential and development perspectives in medical imaging. In fact, the whole development trend of artificial intelligence is towards unsupervised (deep) learning.