Introduction

The digital image is an essential source of information in many fields, such as image surveillance, target tracking, and magnetic resonance images (MRI) [1, 2]. However, the digital image is inevitable to be corrupted by various types of noise in the procedure of capture and transmission, which decreases image quality. A noisy image is usually formulated as

$${\varvec{y}}={\varvec{x}}+{\varvec{v}}$$

where \({\varvec{y}}\) denotes the noisy image, \({\varvec{x}}\) the noise-free image, and \({\varvec{v}}\) inductive noise. The noise \({\varvec{v}}\) is often assumed to be subject to some kind of distribution.

In past decades, numerous image denoising techniques have been proposed, such as non-local self-similarity methods, partial differential equations (PDEs) algorithms, threshold algorithms, sparse representation algorithms[3], and hybrid method [4]. With the development of deep learning, CNN-based image denoising method has become the focus of image denoising [5]. As proposed in [6], a feed-forward denoising convolutional neural network (DnCNN) is introduced, consisting of a cascaded structure that includes convolution layers, rectified linear unit (ReLU), batch normalization (BN) layers, and residual learning (RL) introduced at network output. Although most of these image denoisers mentioned above perform well for noisy images polluted by additive white Gaussian noise (AWGN), their performance usually suffers degrading dramatically when removing noise in real-world noisy images captured by digital cameras which introduce more sophisticated noise. In view of this problem, Wei et al. [7] aim to establish a more accurate simulation of image noise models in real-world scenarios, intending to generate target data for improving the denoising capabilities of algorithms on images captured in real scenes. Guo et al. [8] proposed a more realistic noise model that considers signal-dependent noise and the influence of the image signal processing (ISP) pipeline on noise. They also proposed a convolutional blind denoising network (CBDNet) to restore a clean image from a realistic noisy image. This is achieved by designing a noise estimation sub-network based on a more realistic noise model. Chen et al. [9] contends that conventional training methods involve overfitting to the noise in the training set and has devised a masking training approach. It involves applying a random and substantial masking to the input image, compelling the model to learn the reconstruction of the obscured image content, thus improving the model’s generalization capability. However, due to the influence of various factors on real camera noise, existing noise models still struggle to fully match the complexity of real-world noise. As a result, these methods have not significantly enhanced the generalization performance of denoising networks and still face challenges when dealing with mismatched noise distributions [10].

Moreover, the MSE loss used in the training of traditional denoising CNN is also designed based on the assumption of Gaussian noise and with the aim of enhancing the peak signal-to-noise ratio (PSNR) index. However, it has been indicated that the PSNR index does not effectively reflect human visual perception features, leading to evaluation results that often differ from human visual perception. In other words, even though the PSNR index of the image is improved, denoising results in excessive smoothing of image details. Therefore, for images with non-Gaussian noise, when CNN-based image denoisers are using only MSE loss, the denoised image actually contains additional information introduced by various denoising methods [11], resulting in artifacts. Simultaneously, the excessive smoothing of the image leads to the loss of texture details.

Multi-task learning (MTL) is a learning paradigm that aims at taking advantage of knowledge contained in multiple related tasks to promote the generalization performance for each task [12, 13]. It can effectively leverage information provided by different learning tasks more efficiently than single-task learning [14] and facilitate knowledge sharing between tasks, thereby reducing the risk of overfitting for each individual task and improving overall performance [15]. However, the performance of a MTL model relies heavily on the weight selection among tasks, while searching for an optimal weight using manual adjustment is time-consuming and difficult [16, 17]. In this paper, we propose a method to enhance the existing image denoising convolutional neural network (DCNN) within the framework of multi-task learning (MTL) for non-Gaussian noise image denoising and design a data-driven sub-task weight self-learning method.

  1. (1)

    Through the MTL framework, different image quality assessment metrics and image features (including MSE, SSIM, statistical characteristics of image residuals) are utilized as sub-tasks to achieve collaborative training for the DCNN. This approach gradually transforms non-Gaussian image noise towards Gaussian noise, thereby enhancing denoising performance and improving the visual quality of denoised images.

  2. (2)

    We designed a network layer that, in conjunction with the collaborative training of the aforementioned image convolutional neural network (DCNN), automatically and rapidly learns the weights for each subtask.

  3. (3)

    Two training strategies are researched, one for optimal performance and the other for obtaining the most suitable shared features for multiple tasks.

The experiments demonstrate that our approach enhances the image denoising performance of convolutional neural networks (DCNN) on two types of networks and four image datasets, under both Gaussian and non-Gaussian noise conditions. This improvement is observed in terms of both quantitative metrics and visual perception of the images.

Related work

Image DCNN

The image DCNN has achieved great improvement in Gaussian noise denoising. To deal with more complex noises, a fast and flexible denoising convolutional neural network (FFDNet) [18] has been presented by introducing noise level graph as an additional input of the network based on DnCNN. In view of the difficulty to obtain noisy/noise-free images sample pairs, the Noise2Noise (N2N) [19] method uses samples pairs composed of independent noisy images from the same background to train DCNN and reach comparable performance of training with noisy/noise-free pairs. Its training strategy is derived from the statistical observation that the loss function only requires the target signal to be “clean” on some statistical values while not needing to be “clean” on every target signal. CBDNet [8] consists of two sub-networks. One is a noise estimation sub-network which has a symmetric structure and the total variation losses and outputs a noise level graph of the same size as the input image, and the other is a non-blind denoising sub-network to obtain the latent clean image with noise level graph and noisy image as input. The synthetic structure and real-world noisy images are merged for CBDNet training to achieve a more robust performance even though the noise model is slightly different from the real-world noise. Experiments demonstrated the crucial role of the image noise model in real noisy images.

MTL

MTL aims at improving the performance of each task by inductive knowledge transfer to share domain information between tasks and has been successfully applied in machine learning and deep learning. Tang. et al. [20] designed a face recognition network with multi-task learning for better performance by jointing optimization on the face recognition loss and the face classification loss. Gao et al. [21] applied the MTL framework to integrate target recognition and image noise reduction in the defect recognition of railway insulator images, which carried out coordinated training for CNN by alternately freezing one task and optimizing the other. Considering that manual adjustment of the weight coefficient of each task is time-consuming and laborious, Kendall et al. [17] adopted the homoscedastic uncertainty of each task to weigh each loss and showed their method superior to individual models trained respectively on each task in per-pixel depth regression and other problems. Ozan et al. [22] transformed the (MTL problem into a multi-objective optimization (MOO) problem to optimize a set of potentially conflicting multiple objectives. Thus, the objective of MTL is converted into finding the Pareto optimal solutions for the corresponding MOO problem. Ozan et al. use the multiple-gradient descent algorithm (MGDA) to solve the weight coefficients for potentially conflicting targets. They demonstrated that their method produces a solution that is either a Pareto stationary point or provides a descent direction that can improve each task objective. This method has been successfully applied in scene understanding and multi-label classification.

Proposed method

MTL framework and auxiliary tasks

To leverage the MTL framework for denoising convolutional neural network (DCNN) training, we contemplate introducing loss functions based on different principles for DCNN. The optimization of these loss functions is treated as sub-tasks within the MTL framework, thereby transforming MTL into the following MOO problem:

$$\begin{array}{c}\mathrm{Min }\mathbf{L}\left({\varvec{x}},{\varvec{y}}\right)=\left\{\left({\varvec{x}},{\varvec{y}},{{\varvec{\theta}}}^{{\varvec{s}}},{{\varvec{\theta}}}^{1}\right),\cdots ,{L}_{n}\left(x,y,{\theta }^{s},{\theta }^{n}\right)\right\}\\ n=\mathrm{1,2},\cdots ,N\end{array}$$
(1)

where \({\varvec{x}}\in {\mathbb{R}}^{d}\) denote input space, and \({\varvec{y}}\in {\mathbb{R}}^{N}\) denote a set of objective space. \(N\) is the total number of objectives, \({{\varvec{\theta}}}^{n}\) are objective-specific parameters, \({{\varvec{\theta}}}^{s}\) are shared parameters, and \({L}_{n}\left({\varvec{x}},{\varvec{y}},{{\varvec{\theta}}}^{s},{{\varvec{\theta}}}^{n}\right):{\varvec{x}}{\to {\varvec{y}}}^{n}\) is nth sub-tasks or loss function of DCNN.

As mentioned in Introduction, when the DCNN solely uses mean square error (MSE) as the loss function, although it suppresses the amplitude of image residuals, the distribution of image residuals is influenced by non-Gaussian noise and the denoising algorithm, leading to a reduction in the quality of visual perception [11]. Therefore, we introduce the distribution distance metric as an auxiliary task to make the image residuals \(\widehat{{\varvec{e}}}\) approximate Gaussian white noise. Simultaneously, the structural similarity index (SSIM) is introduced to enhance the structural similarity between the denoised image and the target image. The integration of these tasks under a MTL framework aims to not only suppress the amplitude of image residuals but also reduce redundant information in order to remove noise and improve denoising effectiveness.

Distribution distance loss

If the image residuals \(\widehat{{\varvec{e}}}\) are close to zero-mean Gaussian white noise, it indicates that the geometric structure or texture features have been effectively removed from the noisy image. Therefore, within the MTL framework of the DCNN, we introduce a sub-task aimed at making the image residuals in the denoised image approximate white Gaussian noise. This is intended to align with the traditional DCNN noise model, thereby improving the denoising performance of the DCNN under non-Gaussian noise conditions. There are several methods to evaluate how closely image residuals \(\widehat{{\varvec{e}}}\) approximate white Gaussian noise, with one of these being the auto-correlation coefficient of the residuals. This coefficient can be calculated using the following formula:

$$\rho =\frac{E\left[\left(X-{\mu }_{x}\right)\left(Y-{\mu }_{y}\right)\right]}{{\sigma }_{x}{\sigma }_{y}}$$
(2)

where \({\mu }_{x}\), \({\mu }_{y}\) denote the mean of \(x\) and \(y\), and \({\sigma }_{x}\) and \({\sigma }_{y}\) denote the standard deviation of \(x\) and \(y\) respectively. Then, the auto-correlation coefficient can be examined through a randomness test. Another method is the Kullback–Leibler divergence (KLD), which can be used directly to calculate the difference between two different distributions. The formula of KLD is

$$KL\left(p\parallel q\right)=-\int p\left(x\right){\text{ln}}\left\{\frac{q\left(x\right)}{p\left(x\right)}\right\}dx$$
(3)

where \(x\sim q\left(x\right)\) is the distribution of \(\widehat{{\varvec{e}}}\), and \(p\left(x\right)\sim N\left(\mathrm{0,1}\right)\) is distribution of white Gaussian noise, and we calculate the KLD between \(\widehat{{\varvec{e}}}\) and white noise as an auxiliary task.

SSIM loss

Denoised images reconstructed by DCNN-based methods that minimize MSE loss often lose important details, such as over-smoothing artifacts in texture-rich regions [12], leading to a degradation in image quality. In contrast, the SSIM measures the structural similarity between images by comparing image brightness, contrast, and structure. Its evaluation results are considered to be more consistent with how humans measure the differences between two images. In [23], training DCNN with the joint loss function of SSIM and L1 indeed achieved better image denoising results. Therefore, we introduce the SSIM index as another subtask in the MTL framework. Therefore, we use SSIM index as another sub-task in the MTL framework. Let \({C}_{l}\left(I,\widehat{I}\right)\) and \({C}_{c}\left(I,\widehat{I}\right)\) denote respectively the difference between two images in luminance and contrast and \({\mu }_{I}\) and \({\sigma }_{I}\) the mean and standard deviation of image; then, the formula is

$$\begin{array}{c}{C}_{l}\left(I,I\right)=\frac{{2}_{{\mu }_{I}{\mu }_{\widehat{I}}}+{C}_{1}}{{{\mu }_{I}}^{2}+{{\mu }_{\widehat{I}}}^{2}+{C}_{1}}\\ {C}_{c}\left(I,\widehat{I}\right)=\frac{2{\sigma }_{I}{\sigma }_{\widehat{I}}+{C}_{2}}{{{\sigma }_{I}}^{2}+{{\sigma }_{\widehat{I}}}^{2}+{C}_{2}}\end{array}$$
(4)

where \({C}_{1}\) and \({C}_{2}\) are constants for stability. Let \({C}_{s}\left(I,\widehat{I}\right)\) denote the difference between two images in structure; then, the SSIM index is calculated as follows.

$$\begin{array}{c}{C}_{s}\left(I,\widehat{I}\right)=\frac{{\sigma }_{I\widehat{I}}+{C}_{3}}{{{\sigma }_{I}\sigma }_{\widehat{I}}+{C}_{3}}\\ SSIM\left(I,\widehat{I}\right)={[{C}_{l}\left(I,\widehat{I}\right)]}^{\alpha }{[{C}_{c}\left(I,\widehat{I}\right)]}^{\beta }{[{C}_{s}\left(I,\widehat{I}\right)]}^{\gamma }\end{array}$$
(5)

where α, β, and γ are adjustable parameters.

Non-Gaussian noise model

As acquiring noisy/noise-free image pairs is not easy, DCNNs usually have to utilize simulated noise image data to perform training, and the proper noise model has great influence on the training effect and denoising performance of DCNN. Existing CNN denoisers [24], BM3D-Net [25] or DnCNN, generally occur performance degradation on real-world noisy images, which is owing to that they adopt a simple AWGN model while the real noise is usually non-Gaussian. In this section, Poisson-Gaussian model [26] is introduced as the real noise distribution model. The Poisson-Gaussian model has 0 mean value, and its variance varies with the actual pixel value, which is signal-dependent and changes with different cameras and camera settings. The Poisson-Gaussian noise model can be further simplified to be the heterogeneous Gaussian noise model that is made up of a stationary noise and a signal-dependent noise. It has been proved that heterogeneous Gaussian noise model is more suitable than AWGN for noise modeling in real-world image. In the heterogeneous Gaussian (HG) noise model, each observed sample y is regarded as a random variable with a signal-depended variance which is formulated as follows.

$$y\sim N\left(\mu =x,{\sigma }^{2}={\lambda }_{r}+{\lambda }_{s}x\right)$$
(6)

where x is the signal, and \({\lambda }_{r}\) and \({\lambda }_{s}\) are parameters which depend on sensor’s gains. Moreover, other image processing procedures (such as color correction, and tone map**) are also considered in generating the simulated noisy images, which are synthesized by adding noise to raw sensor measurements and used as training data.

Network structure

Depending on the optimization process, we propose two strategies for the training of DCNN with MSE loss, SSIM loss and distribution distance loss as tasks in the MTL framework. Network structure of the first strategy is illustrated in Fig. 1, where the DCNN contains all the shared parameters, and each loss is computed as an objectives-specific task. Then, we convert the MOO problem to the following SOO problem by weighting all these losses.

Fig. 1
figure 1

Network architecture of the first strategy

$$\underset{{\alpha }^{1},\cdots ,{\alpha }^{T},{{\varvec{\theta}}}^{s}}{{\text{Argmin}}}\sum_{t=1}^{T}{\alpha }^{t}{\nabla }_{{{\varvec{\theta}}}^{s}}{L}_{t}({{\varvec{\theta}}}^{s},{{\varvec{\theta}}}^{t})$$
(7)

The weight α in the optimization problem (7) can be considered as a group of hyper-parameters. For a small number of hyper-parameters, Bayesian optimization [27] can be employed for parameter search. In this paper, we designed a linear layer output to perform a weighted sum for different tasks, and the weights α are automatically optimized through network training. The shared parameters contained in the DCNN are also optimized according to the gradient descent algorithm. The algorithm based on this image DCNN training strategy is formulated in Algorithm 1.

Algorithm 1. Training according to the first strategy.

figure b

The second strategy employs a task-switching multi-task learning (MTL) framework for training the DCNN. The network structure of this strategy is illustrated in Fig. 2, where each output corresponds to the DCNN utilizing a different loss. In this strategy, an alternate optimization method is used to optimize the network in turn according to all sub-tasks. Through this training strategy, the DCNN can acquire shared features that are most suitable for multiple sub-tasks.

Fig. 2
figure 2

Network architecture of the second strategy based on feature transformation MTL framework

Algorithm 2. Training according to the second strategy.

figure c

Results and discussion

Experiments data preparation

To evaluate our method, we selected two denoising convolutional neural network (DCNN) models: denoising autoencoder (DAE) and CBDNet. We applied the training methods described in the “Network structure” section to study the improvement in denoising performance of the DCNN. Evaluation was conducted using the PSNR and SSIM metrics. The algorithm was implemented in Python.

For the DAE, we clip images from STL dataset [28] to patches with size 96 × 96. Various levels of Gaussian or non-Gaussian noise were added to these patches to create sample pairs. Images from PolyU [29] and RENOIR [30] are used for testing. The testing data pair are generated on PolyU and RENOIR by the same way with the training data pair.

For the CBDNet, we kept its network architecture and training parameters unchanged. We extracted 1200 images from the DND dataset [31] and synthesized training noisy images using the heterogeneous Gaussian noise model and image processing pipeline (ISP) proposed in literature [8]. Testing was conducted using images from the PolyU dataset, BSDS500 [32], and RENOIR.

Experiments on DAE

The DAE used in our experiments is made up of two parts: one is encoder \(z=f\left({\widetilde{x}}_{i}\right)\), and the other is decoder \(y=g\left(z\right)\), and both are CNNs (Table 1). \(z\) denotes the low-dimensional hidden layer feature vector extracted from input \(x\). On the generation of training data, we utilize three different noise models: Gaussian noise, heterogeneous Gaussian noise, and heterogeneous Gaussian noise with ISP. The test data are generated in the same way on PolyU and RENOIR datasets. In the procedure of training, we use stochastic optimization algorithm with a learning rate 3 × 10 −4 and set the training epoch as 100.

Table 1 Network structure of the tested DAE

Figures 3 and 4 show the image denoised results of a PolyU image respectively on the AWGN model and heterogeneous Gaussian noise model. Compared with the improved DAE, the traditional DAE methods generate more artifacts in the denoised images and lose some details in the image structures. The improved DAE performs better in preserving image detail structures and achieves positive gains in PSNR and SSIM metrics when removing noise.

Fig. 3
figure 3

Restored results of a PolyU image over AWGN (σ = 30). a Target image. b Noisy image PSNR/SSIM: 19.16/0.33. c Traditional DAE PSNR/SSIM: 30.17/0.94. d Improved DAE PSNR/SSIM: 32.07/0.98

Fig. 4
figure 4

Restored results of a PolyU image over HG noise. a Clean image. b Noisy image PSNR/SSIM: 22.89/0.51. c Traditional DAE PSNR/SSIM: 30.74/0.95. d Improved DAE PSNR/SSIM: 32.51/0.97

Figure 5 shows the variation of MSE loss and validation result respectively about the traditional DAE and the DAE improved by the two strategies with the training on the RENOIR dataset. As is seen from Fig. 5, the improved DAES1 has the fastest rate of decline speed on MSE loss curve and validation curve. All the two improved DAEs outperform traditional DAE on the decline speed of MSE loss curve and validation curve, demonstrating an improvement in denoising effectiveness.

Fig. 5
figure 5

The variation in MSE loss and validation

Experiments on CBDNet

CBDNet has demonstrated effective noise reduction capabilities on real-world images. In order to enhance the denoising performance and improve the generalization to non-Gaussian noise of CBDNet, we applied the MTL framework to its training process. It takes about 2 days to train the improved CBDNet on a Nvidia GeForce GTX 1060 GPU.

Figure 6 provides the denoised result on a PolyU image in heterogeneous Gaussian noise model with ISP. The improved CBDNet demonstrates positive gains over the traditional CBDNet in preserving image edges and achieving relative increases in PSNR and SSIM on the RENOIR dataset. Figure 7 shows the denoising results of different methods on RENOIR images under the Gaussian noise model. The original DAE method produces over-smoothing results, while the improved DAE method restores more local details and has better denoising effect than DnCNN and FFDNet.

Fig. 6
figure 6

Denoising results of a PolyU image over HG noise + ISP. a Clean image. b Noisy image PSNR/SSIM: 26.52/0.79. c Traditional CDBNet PSNR/SSIM: 30.40/0.88. d Improved CDBNet PSNR/SSIM: 30.49/0.92

Fig. 7
figure 7

Denoising results of a RENOIR image. a Clean image. b Noisy image. c DnCNN. d FFDNet. e DAE. f Improved DAES1. g Improved DAES2

Table 2 exhibits the denoising performance comparison on the RENOIR dataset between the improved versions using the two strategies proposed in the “Network structure” section and the original DAE, DnCNN, FFDNet, and CBDNet. The improved DAE is trained with image generated through the AWGN model and the heterogeneous Gaussian noise model respectively. In the heterogeneous Gaussian noise model with ISP, the improved DAE demonstrates significantly enhanced denoising performance on noisy images. Its PSNR/SSIM results outperform those of DnCNN and FFDNet, indicating that the proposed method can enhance the denoising performance of DCNN for non-Gaussian noisy images. The improved CBDNetS2 has the highest PSNR/SSIM results in all methods.

Table 2 The denoising results on RENOIR dataset

Figure 8 shows the denoising results on a BSDS500 image using CBDNet and its improved version under the ISP’s heterogeneous Gaussian noise model. Compared to the original method, the improved approach preserves more local details in the denoised image, resulting in a better visual effect.

Fig. 8
figure 8

Denoising results of a BSDS500 image generated over the HG noise model + ISP. a Clean image. b Noisy image PSNR/SSIM: 26.43/0.84. c CDBNet PSNR/SSIM: 27.81/0.89. d Improved CDBNet PSNR/SSIM: 28.89/0.91

Conclusion

Due to the limitations of training DCNNs solely using MSE loss, which cannot fully match non-Gaussian noise in images and may introduce additional information during the denoising process, resulting in a decrease in the visual quality of the denoised image. To address this, we explore various image evaluation metrics that describe image characteristics from different angles, such as residual statistical properties and image structural similarity. These metrics are then employed as loss functions to improve the training of DCNN. This approach aims to enhance the generalization ability of DCNNs for non-Gaussian noise, improve the recovery of details in denoised images, and reduce the generation of artifacts. Additionally, we introduced novel training strategies to address the issue of automatically selecting appropriate weight coefficients for each task. These measures effectively enhanced the image denoising performance of the original DCNNs. Future research will explore the introduction of more reasonable image evaluation metrics, applying the MTL framework to new network architectures, such as non-reference image quality evaluation metrics and denoising networks based on Transformer. We will also consider integrating these metrics with self-supervised denoising methods for noisy images to reduce dependence on noise-free training data.