Keywords

1 Introduction

Single image super-resolution (SISR) is a hotspot in image restoration. It is an inverse problem which recovers a high-resolution (HR) image from a low-resolution (LR) image via super-resolution (SR) algorithms. Traditional SR algorithms are inferior to deep learning based SR algorithms on speed and some distortion measures, e.g., peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). In addition, SR algorithms based on deep learning can also obtain excellent visual effects [2,3,4,5,6,7,8].

Here, SR algorithms with deep learning can be divided into two categories. One is built upon convolutional neural network with classic L1 or L2 loss in pixel space as the optimization function, which can gain a higher PSNR but over-smoothness for lacking enough high-frequency texture information. The representative approaches are SRResNet [5] and EDSR [7]. The other is based on generative adversarial networks (GAN), e.g., SRGAN [5] and EnhanceNet [9], which introduces perceptual loss in the optimization function. This kind of algorithms can restore more details and improve visual performance at the expense of objective evaluation indices. Different quality assessment methods are used in various application scenarios. For example, medical imaging may concentrate on objective evaluation metrics, while the subjective visual perception may be more important for natural images. Therefore, we need to make a balance between the objective evaluation criteria and subjective visual effects.

Blau et al. [6] proposed a cascaded pyramid structure with two branches, one is for feature extraction, the other is for image reconstruction. Moreover, Charbonnier loss was applied to multiple levels and it can generate sub-band residual images at each level. Tong et al. [15] introduced dense blocks combining low-level features and high-level features to improve the performance effectively. Lim et al. [7] removed Batch Normalization layers in residual blocks (ResBlocks) and adopted residual scaling factor to stabilize network training. Besides, it also proposed multi-scale SR algorithm via a single network. However, when the scaling factor is equal to or larger than \(4\times \), the results obtained by the aforementioned methods mostly look smooth and lack enough high-frequency details. The reason is that the optimization targets are mostly based on minimizing L1 or L2 loss in pixel space without considering the high-level features.

2.2 Image Super-Resolution Using Generative Adversarial Networks

Super-Resolution with Adversarial Training. Generative adversarial nets (GANs) [16] consist of Generator and Discriminator. In the task of super-resolution, e.g., SRGAN [5], Generator is used to generate SR images. Discriminator distinguishes whether an image is true or forged. The goal of Generator is to generate a realistic image as much as possible to fool Discriminator. And Discriminator aims to distinguish the ground truth from the generated SR image. Thus, Generator and Discriminator constitute an adversarial game. With adversarial training, the forged data and the real data can eventually obey a similar image statistics distribution. Therefore, adversarial learning in SR is important for recovering the image textural statistics.

Perceptual Loss for Deep Learning. In order to be better accordant with human perception, Johnson et al. [17] introduced perceptual loss based on high-level features extracted from pre-trained networks, e.g. VGG16, VGG19, for the task of style transfer and SR. Ledig et al. [5] proposed SRGAN, which aimed to make the SR images and the ground-truth (GT) similar not only in low-level pixels, but also in high-level features. Therefore, SRGAN can generate realistic images. Sajjadi et al. proposed EnhanceNet [9], which applied a similar approach and introduced the local texture matching loss, reducing visually unpleasant artifacts. Zhang et al. [18] explained why the perceptual loss based on deep features fits human visual perception well. Mechrez et al. proposed contextual loss [19, 20] which was based on the idea of natural image statistics, and it is the best algorithm for recovering perceptual results in previous published works currently. Although these algorithms can obtain better perceptual image quality and visual performance, it cannot achieve better results in terms of objective evaluation criteria.

2.3 Image Quality Evaluation

There are two ways to evaluate image quality including objective and subjective assessment criteria. The popular objective criteria includes the following: PSNR, SSIM, multi-scale structure similarity index (MSSSIM), information fidelity criterion (IFC), weighted peak signal-to-noise ratio (WPSNR), noise quality measure (NQM) [21] and so on. Although IFC has the highest correlation with perceptual scores for SR evaluation [21], it is not the best criterion to assess the image quality. The subjective assessment is usually scored by human subjects in the previous works [22, 23]. However, there is not a suitable objective evaluation in accordance with the human subjective perception yet. In the PIRM-SR 2018 challenge [10], the assessment of perceptual image quality is proposed which combines the quality measures of Ma [24] and NIQE [25]. The formula of perceptual index is represented as follows,

$$\begin{aligned} Perceptual\ index=\frac{1}{2}((10-Ma)+NIQE) \end{aligned}$$
(1)

Here, a lower perceptual index indicates better perceptual quality.

3 Proposed Methods

We first describe the overall structure of Bi-GANs-ST and then construct the networks MR-SRGAN and WP-SRGAN. The soft thresholding method is used for image fusion, as presented in Sect. 3.4.

3.1 Basic Architecture of Bi-GANs-ST

As shown in Fig. 1, our Bi-GANs-ST mainly consists of three parts: (1) memory residual SRGAN (MR-SRGAN), (2) weight perception SRGAN (WP-SRGAN), (3) soft thresholding (ST). The two GANs are used for generating two complementary SR images, and ST fuses the two SR results for balancing the perceptual score and RMSE.

Fig. 1.
figure 1

The architecture of Bi-GANs-ST.

3.2 MR-SRGAN

Network Architecture. As illustrated in Fig. 2, our MR-SRGAN is composed of Generator and Discriminator. In Generator, LR images are input to the network followed by one Conv layer for extracting shallow features. Then four memory residual (MR) blocks are applied for improving image quality which help to form persistent memory and improve the feature selection ability of model like MemEDSR [26]. Each MR block consists of four ResBlocks and a gate unit. The former generates four-group features and then we extract a certain amount of features from these features by the gate unit. And the input features are added to the extracted features as the output of MR block. In ResBlocks, all the activation function layers are replaced with parametric rectified linear unit (PReLU) function and all the Batch Normalization (BN) layers are discarded in the generator network for reducing computational complexity. Finally, we restore the original image size by two upsampling operations. n is the corresponding number of feature maps and s denotes the stride for each convolutional layer in Figs. 2 and 3. In Discriminator, we use the same setting as SRGAN [5].

Fig. 2.
figure 2

The architecture of MR-SRGAN.

Loss Function. The total generator loss function can be represented as three parts: pixel-wise loss, adversarial loss and perceptual loss, the formulas are as follows,

$$\begin{aligned} L_{total}=L_{pixel}+\lambda _1 L_{adv}+\lambda _2 L_{vgg} \end{aligned}$$
(2)
$$\begin{aligned} L_{pixel}=\frac{1}{N}\sum _{i=1}^{N}\left| x_{t}^{i}-G(x_{l}^{i}) \right| ^{2} \end{aligned}$$
(3)
$$\begin{aligned} L_{adv}=-(D(G(x_{l}))) \end{aligned}$$
(4)
$$\begin{aligned} L_{vgg}=\frac{1}{N}\sum _{i=1}^{N}\left| \phi (x_{t}^{i}) - \phi (G(x_{l}^{i}))\right| ^{2} \end{aligned}$$
(5)

where \(L_{pixel}\) is the pixel-wise MSE loss between the generated images and the ground truth, \(L_{vgg}\) is the perceptual loss which calculates MSE loss between features extracted from the pre-trained VGG16 network, and \(L_{adv}\) is the adversarial loss for Generator in which we remove logarithm. \(\lambda _1\), \(\lambda _2\) are the weights of adversarial loss and perceptual loss. \(x_{t}\), \(x_{l}\) denote the ground truth and LR images, respectively. \(G(x_{l})\) is the SR images forged by Generator. N represents the number of training samples. \(\phi \) represents the features extracted from pre-trained VGG16 network.

3.3 WP-SRGAN

Network Architecture. In WP-SRGAN, we use 16 ResBlocks in the generator network which is depicted in Fig. 3. Each ResBlock is consisted of convolutional layer, PReLU activation layer and convolutional layer. And Batch Normalization (BN) layers are removed in both Generator and Discriminator. The architecture of Discriminator in WP-SRGAN is the same as MR-SRGAN except for removing BN layers.

Fig. 3.
figure 3

The generator architecture of WP-SRGAN. The red box shows the generator loss in the first training stage, then the orange box contains the generator loss in the second training stage.

Loss Function. As shown in Fig. 3, a two-stage bias adversarial training mechanism is adopted in WP-SRGAN by using different Generator losses. In the first stage, as the red box shows, we optimize the Generator loss which is consisted of pixel-wise loss and adversarial loss to obtain better objective performance (i.e., reduce the RMSE value). In the second stage, as the orange box shows, we regard the network parameters in the first stage as the pre-trained model and then replace the aforementioned generator loss with perceptual loss and adversarial loss to optimize for improving the subjective visual effects (e.g., reduce the perceptual index). The two-stage losses are represented as Eqs. (6) and (7).

$$\begin{aligned} L_{1}=L_{pixel}+\lambda _1 L_{adv} \end{aligned}$$
(6)
$$\begin{aligned} L_{2}=\lambda _1 L_{adv}+\lambda _2 L_{vgg} \end{aligned}$$
(7)

Here, the pixel-wise loss is defined as the Eq. (3), the perceptual loss adopts MSE loss by the features extracted from pre-trained VGG19 network, and the adversarial loss is donated as follows,

$$\begin{aligned} L_{adv}=-log(D(G(x_{l}))) \end{aligned}$$
(8)

By adopting two-stage adversarial training mechanism, it can make the generated SR image similar to the corresponding ground truth in high-level features space.

3.4 Soft-Thresholding

We can obtain different SR results by the two GANs aforementioned. One is MR-SRGAN, which emphasizes on improving the objective performance. The other is WP-SRGAN, which obtains the result that favors better subjective perception. To balance the perceptual score and RMSE of SR results, soft thresholding method proposed by Deng et al. [27] is adopted to fuse the two SR images (i.e., MR-SRGAN, WP-SRGAN) which can be regarded as a way of pixel interpolation. The formulas are shown as follows,

$$\begin{aligned} I_{e}=I_{G}+soft(\Delta ,\xi ) \end{aligned}$$
(9)
$$\begin{aligned} soft(\Delta ,\xi )=sign(\Delta )\cdot max(\left| \Delta \right| -\xi ,0) \end{aligned}$$
(10)

where \(I_{e}\) is the fused image, \(\Delta =I_{G}-I_{g}\), \(I_{G}\) is the generated image by WP-SRGAN whose perceptual score is lower, \(I_{g}\) is the generated image by MR-SRGAN whose RMSE value is lower. \(\xi \) is the adjusted threshold which is discussed in Sect. 4.2.

4 Experimental Results

In this section, we conduct extensive experiments on five publicly available benchmarks for scaling factor \(4\times \) image SR: Set5 [28], Set14 [29], B100 [30], Urban100 [31], Managa109 [32], separately. The first three datasets Set5, Set14, BSD100 mainly contain natural images, Urban100 consists of 100 urban images, and Manga109 is Japanese anime containing fewer texture features. Then we compare the performance of our proposed Bi-GANs-ST algorithm with the state-of-the-art SR algorithms in terms of objective criteria and subjective visual perception.

4.1 Implementation and Training Details

We train our networks using the RAISEFootnote 1 dataset which consists of 8156 HR RAW images. The HR images are downsampling by bicubic interpolation method for the scaling factor \(4\times \) to obtain the LR images. To analyze our models capacity, we evaluate them on the PIRM-SR 2018 self validation dataset [\(3\times 3\). The learning rate is initialized to \(1e-4\) and Adam optimizer with the momentum 0.9 is utilized. The network is trained for 600 epochs, and we choose the best results according to the metric SSIM.

In Generator of WP-SRGAN, 16 ResBlocks are used and the filter size is \(3\times 3\). The filter size is \(9\times 9\) in the first and last convolutional layer. All the convolutional layers use one stride and one padding. The weights are initialized by Xavier method. All the convolutional and upsampling layers are followed by PReLU activation function. The learning rate is initialized to \(1e-4\) and decreased by a factor of 10 for \(2.5\times 10^{5}\) iterations and total iterations are \(5\times 10^{5}\). We use Adam optimizer with momentum 0.9. In Discriminator, the filter size is \(3\times 3\), and the number of features is twice increased from 64 to 512, the stride is one or two, alternately.

The weights of adversarial loss and perceptual loss both in MP-SRGAN and WP-SRGAN (i.e., \(\lambda _1\) and \(\lambda _2\)) are set to \(1e-3\), \(6e-3\), respectively. And the threshold (i.e., \(\xi \)) for image fusion is set to 0.73 in our experiment.

Table 1. The quantitative results of WP-SRGAN with two stages on the PRIM 2018 self validation dataset for \(4\times \) enlargement.

4.2 Model Analysis

Training WP-SRGAN with Two Stages. We analyze the experimental results of the two-stage adversarial training mechanism in WP-SRGAN. The quantitative and qualitative results on PIRM-SR 2018 self validation dataset are shown in Table 1 and Fig. 4.

In Table 1, WP-SRGAN with two stages can achieve lower perceptual score than WP-SRGAN with the first stage. As shown in Fig. 4, the recovered details of WP-SRGAN with two stages are much more than WP-SRGAN with the first stage. And the images generated by two stages look more realistic. Therefore, we use WP-SRGAN with two stages in our model.

Soft Thresholding. In the challenge, three regions are defined by RMSE between 11.5 and 16. According to different threshold settings, we draw the perceptual-distoration plane which is shown in Fig. 5, according to the results fused by Eqs. (9) and (10). The points on the curve denote the different thresholds from 0 to 1 with an interval of 0.1. Experimental results show that we can obtain excellent perceptual score in Region3 (RMSE is between 12.5 and 16) when \(\xi \) is set to 0.73.

Model Capacity. To demonstrate the capability of our models, we analyze the SR results of MR-SRGAN, WP-SRGAN and Bi-GANs-ST for the metrics perceptual score and RMSE on the PIRM-SR 2018 self validation dataset. The quantitative and qualitative results are shown in Table 2 and Fig. 6. The experimental results show that Bi-GANs-ST can keep balance between the perceptual score and RMSE.

Fig. 4.
figure 4

The visual results of WP-SRGAN with two stages on the PRIM 2018 self validation dataset for \(4\times \) enlargement.

Table 2. The model capacity analysis of the SR results by MR-SRGAN, WP-SRGAN and Bi-GANs-ST for the metrics perceptual score and RMSE on the PIRM-SR 2018 self validation dataset.

4.3 Comparison with the State-of-the-arts

To verificate the validity of our Bi-GANs-ST, we conduct extensive experiments on five publicly available benchmarks and compare the results with other state-of-the-art SR algorithms, including EDSR [7], EnhanceNet [9]. We use the open-source implementations for the two comparison methods. We evaluate the SR images with image quality assessment indices (i.e., PSNR, SSIM, perceptual score, RMSE) where PSNR and SSIM are measured on the y channel and ignored 6 pixels from the border.

Fig. 5.
figure 5

The perceptual-distortion plane of our method. The points on the curve denote the different thresholds from 0 to 1 with an interval of 0.1.

Fig. 6.
figure 6

The visual results on three models (MR-SRGAN, WP-SRGAN and Bi-GANs-ST) for scaling factor \(4\times \) for the metrics perceptual score and RMSE on the PIRM-SR 2018 self validation dataset.

The quantitative results for evaluating PSNR and SSIM are shown in Table 3. The best algorithm is EDSR, which is on average 1.0 dB, 0.54 dB, 0.34 dB, 0.83 dB and 1.13 dB higher than our MR-SRGAN. The PSNR values of our Bi-GANs-ST are higher than EnhanceNet on Set5, Urban100, Manga109 approximately 0.64 dB, 0.3 dB, 0.13 dB, respectively. The SSIM values of our Bi-GANs-ST are all higher than EnhanceNet. Table 4 shows the quantitative evaluation of average perceptual score and RMSE. For perceptual score index, our WP-SRGAN achieves the best and Bi-GANs-ST achieves the second best on five benchmarks except for Set5. For RMSE index, EDSR performs the best and our MR-SRGAN performs the second best.

Fig. 7.
figure 7

The visual results on five benchmark datasets for scaling factor \(4\times \) which is Bicubic, EDSR, EnhanceNet, MR-SRGAN, WP-SRGAN, Bi-GANs-ST, ground truth from left to right.

The visual perception results of \(4\times \) enlargement of different algorithms on five benchmarks are shown in Fig. 7. These visual results are produced by Bicubic, EDSR, EnhanceNet, MR-SRGAN, WP-SRGAN, Bi-GANs-ST and the ground truth from left to right. EDSR can generate the images which look clear and smooth but not realistic. The SR images of our MR-SRGAN algorithm are like to EDSR. EnhanceNet can generate more realistic images with unpleasant noises. The SR images of our WP-SRGAN algorithm obtain more details like EnhanceNet with less noises which are more close to the ground-truth. And our Bi-GANs-ST algorithm has fewer noises than WP-SRGAN.

Table 3. Quantitative evaluation of state-of-the-art SR algorithms on five publicly available benchmarks: average PSNR/SSIM for scaling factor \(4\times \) (Red text indicates the best and blue text indicates the second best performance).
Table 4. Quantitative evaluation of state-of-the-art SR algorithms on five publicly available benchmarks: average perceptual scores/RMSE for scale \(4\times \) (Red text indicates the best and blue text indicates the second best performance).

5 Conclusions

In this paper, we propose a new deep SR framework Bi-GANs-ST by integrating two complementary generative adversarial networks (GAN) branches. To keep better balance between the perceptual score and RMSE of generated images, we redesign two GANs (i.e., MR-SRGAN, WP-SRGAN) to generate two complementary SR results based on SRGAN. Last, we use soft-thresholding method to fuse two SR results which can make the perceptual score and RMSE tradeoff. Experimental results on five publicly benchmarks show that our proposed algorithm can perform better perceptual results than other SR algorithms for \(4\times \) enlargement.