CUI-Net: a correcting uneven illumination net for low-light image enhancement

Chao, Ke; Song, Wei; Shao, Sen; Liu, Dan; Liu, **angchun; Zhao, **aoBing

doi:10.1038/s41598-023-39524-5

CUI-Net: a correcting uneven illumination net for low-light image enhancement

Article
Open access
Published: 09 August 2023

Volume 13, article number 12894, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

CUI-Net: a correcting uneven illumination net for low-light image enhancement

Download PDF

Ke Chao¹,
Wei Song^1,2,3,4,5,
Sen Shao¹,
Dan Liu¹,
**angchun Liu¹ &
…
**aoBing Zhao^1,3,4,5

1394 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Uneven lighting conditions often occur during real-life photography, such as images taken at night that may have both low-light dark areas and high-light overexposed areas. Traditional algorithms for enhancing low-light areas also increase the brightness of overexposed areas, affecting the overall visual effect of the image. Therefore, it is important to achieve differentiated enhancement of low-light and high-light areas. In this paper, we propose a network called correcting uneven illumination network (CUI-Net) with sparse attention transformer and convolutional neural network (CNN) to better extract low-light features by constraining high-light features. Specifically, CUI-Net consists of two main modules: a low-light enhancement module and an auxiliary module. The enhancement module is a hybrid network that combines the advantages of CNN and Transformer network, which can alleviate uneven lighting problems and enhance local details better. The auxiliary module is used to converge the enhancement results of multiple enhancement modules during the training phase, so that only one enhancement module is needed during the testing phase to speed up inference. Furthermore, zero-shot learning is used in this paper to adapt to complex uneven lighting environments without requiring paired or unpaired training data. Finally, to validate the effectiveness of the algorithm, we tested it on multiple datasets of different types, and the algorithm showed stable performance, demonstrating its good robustness. Additionally, by applying this algorithm to practical visual tasks such as object detection, face detection, and semantic segmentation, and comparing it with other state-of-the-art low-light image enhancement algorithms, we have demonstrated its practicality and advantages.

MaCo: efficient unsupervised low-light image enhancement via illumination-based magnitude control

Article 09 February 2024

IATN: illumination-aware two-stage network for low-light image enhancement

Article 17 February 2024

Luminance domain-guided low-light image enhancement

Article 25 April 2024

Introduction

Low-light image enhancement has been studied for many years and has important applications in areas such as night-time video surveillance and autonomous vehicles. Therefore, using low-light image enhancement algorithms to restore low-light images to normal-light images provides a solid foundation for subsequent high-level vision tasks, such as object detection, object tracking, and semantic segmentation. At the same time, low-light image enhancement technology is also indispensable in fields such as military security and deep-sea exploration¹.

Traditional methods^2,3,4,5,6,7 for low-light image enhancement are typically based on histogram equalization and Retinex-based approaches. These methods have some effect in increasing the brightness of low-light images, but they often suffer from over-enhancement and detail loss, as well as excessive noise and color distortion due to the reduction of grayscale levels, scene complexity, and unstable prior knowledge extraction¹.

With the improvement of computer hardware technology, the speed of data processing has been greatly increased. Many deep learning-based methods^{8,9,10,11,12,13} have shown good performance in the field of low-light image enhancement. Currently, most low-light image enhancement methods are based on convolutional neural networks (CNNs), which learn the map** relationship from low-light images to normal-light images from a large amount of data through carefully designed CNN structures. However, the limited receptive field during the convolution operation in CNNs cannot fully consider long-distance pixel relationships in the input image, which affects the image enhancement effect¹⁴. The self-attention mechanism^15,16 in Transformers^{17,18,19,20,21} can solve this problem. The self-attention mechanism models long-range dependencies, which can better preserve image details and reduce the impact of noise, thereby improving the quality of the image²². Transformer-based methods have made important progress in low-level vision tasks such as image super-resolution^23,24, image denoising²⁵, and image dehazing²⁶. Currently, related Transformer methods^27,28 have also been applied to low-light image enhancement and have achieved good performance, as they can better model non-local information to achieve high-quality image reconstruction. However, these methods do not enhance the local features of the image well, which is what CNNs excel at. Therefore, recent researchr^29,30,31 has attempted to combine CNN and Transformer networks to combine their advantages and improve the performance of the corresponding tasks. For low-light enhancement tasks, network architecture design needs to be adapted to the characteristic of low-light images having more low-light features than high-light features. At the same time, for low-light enhancement tasks in real scenes, zero-shot learning³² methods are needed to better solve high-level vision tasks in real scenes where paired datasets are lacking. Specifically, zero-shot learning means that no paired or unpaired data is needed during training.

The substantial contributions of this study are meticulously designed to combat the issue of uneven illumination. Transformers, armed with their global attention mechanism, can comprehensively process long-range pixel relations in an input image. However, the traditional self-attention mechanism demands a high quantity of computational resources, and its multitude of parameters could lead to overfitting. On the other hand, CNN networks are well-regarded for enhancing local features and maintaining robustness. Still, they struggle in capturing global context information. The integration of these two networks, without thoughtful design of the CNN network, could lead to an ineffective learning of global information features generated by the Transformer network.

Aiming to unite the advantages of CNN’s local feature extraction and Transformer’s global modeling, the network introduced in this study comes with specific improvements. The complexity of the Transformer module increases linearly, not quadratically, with the rise in image resolution, facilitating efficient acquisition of contextual information. The CNN module’s class Transformer structure is designed to concentrate better on the features extracted by the Transformer, making up for the difficulties in global information acquisition and thus enhancing the model’s efficiency³³. Ablation experiments were conducted during the development process, and multiple combinations were tested before finalizing the network architecture presented in this paper.

Particularly, the channel attention mechanism of the auxiliary module and the Multi-Dconv head Sparse Attention(MDSA) module designed in this research addresses to some extent the issue of high time and space complexity inherent to traditional Transformers. The introduction of the sparse attention mechanism provides a deeper understanding and handling of the local features in the image. In low-light enhancement tasks, overly bright local features may hinder the model’s ability to capture other critical low-light features. To mitigate this problem, the MDSA module is adopted for a more precise depiction of local features and to boost their enhancement ability, marking the first application of the improved sparse attention mechanism in low-light enhancement tasks.

Figure 1 illustrates that in unevenly lit low-light environments, conventional self-attention mechanisms or ordinary sparse self-attention mechanisms tend to place the primary focus and weight on the highlight features, which is not ideal for low-light enhancement tasks. The sparse self-attention mechanism applied in this study properly biases the main weight towards low-light features while effectively reducing the weight of highlight features, significantly improving the model’s performance in low-light enhancement tasks. This method, unexplored in original methodologies, represents innovative thinking.

Among the two inputs in the Cross Gating Feedforward Network(CGFN) module, one is processed through the MDSA module, and the other bypasses it. The MDSA module implements the sparse attention mechanism on the channel dimension. Therefore, the proposed CGFN calculates weights in the spatial dimension, addressing the lack of spatial information after the feature passes through the MDSA module. Additionally, the presence of the gating mechanism can better suppress the further propagation of information features that are unfavorable to model convergence. In low-light enhancement tasks, the feature information in the highlight area can severely hamper the enhancement quality. The CGFN module can further alleviate this problem, introducing a method not previously seen in other methodologies.

Therefore, considering the characteristics of low-light images under uneven lighting, this article proposes a more effective zero-shot learning low-light enhancement network structure. The main contributions are summarized as follows:

A zero-shot learning low-light enhancement network named CUI-Net was designed. The entire network comprises enhancement modules and auxiliary modules. The enhancement module merges the global attention mechanism of the Transformer and the ability of the CNN network to process local features. It has efficient computing efficiency and powerful modeling capabilities. This unique structure enables better handling of the problem of uneven lighting, richer feature information extraction, and achievement of image enhancement in low-light environments. The CNN network in the auxiliary module augments the convergence ability of the enhancement module and indirectly rectifies the influence of lighting.
A Multi-Dconv head Sparse Attention (MDSA) module was designed. The MDSA module constrains highlight features at the channel level and increases the weight of important local features. This design helps quell the interference of overly bright features, allowing the model to focus on and extract low-light features better, thereby enhancing the model’s performance in low-light enhancement tasks.
A novel Cross Gating Feedforward Network (CGFN) was proposed. CGFN can not only effectively suppress the further spread of information features that are not conducive to model convergence but also supplement the information loss in the spatial dimension through information exchange, thereby further boosting the efficiency and effect of the model. For low-light enhancement tasks, the feature information in the highlight area can seriously disrupt the enhancement quality of low-light enhancement tasks. The existence of the CGFN module can further mitigate this problem.
A multitude of experiments was conducted on nine challenging datasets. Most of the experimental results indicate that CUI-Net surpasses current state-of-the-art methods in terms of image quality enhancement effects and various evaluation indicators. More importantly, CUI-Net’s superior performance in high-level visual tasks (such as object detection, face detection, and semantic segmentation) in real-world low-light scenarios further validates its practical value and effectiveness.

Related work

Traditional enhancement methods

Traditional low-light enhancement methods can be primarily divided into two types: the methods based on histogram equalization (HE) and the methods based on the Retinex model. Methods based on HE^2,3 redistribute pixel values based on the cumulative distribution function of the input image to expand the dynamic range. However, these methods are also prone to color fidelity loss and the generation of noise, resulting in image distortion⁴. The Retinex theory⁵ decomposes low-light images into reflectance part and illumination part based on prior knowledge or regularization, such as the Single Scale Retinex model (SSR)⁶ and the Multi-Scale Retinex model (MSR)⁷. MSR is considered a weighted sum of several different SSR outputs. The output of these methods may cause changes. The relative proportions of the enhanced three color channels can be affected. Compared to the original image, this can lead to color distortion⁴. Fu et al.³⁴ proposed a fusion method that combines the advantages of sigmoid function and histogram equalization, which has improved performance compared to^2,3. Guo et al.³⁵ initialized the illumination map of the image by finding the maximum value in the RGB channels and then optimized the initial illumination map by adding a structural prior to achieve image enhancement. These methods have some effect in increasing the brightness of low-light images, however, some algorithms ignore the correlation between the bright and dark parts, resulting in color distortion in images with significant brightness differences.

Deep learning-based methods

Most of the network-based low-light image enhancement algorithms are based on CNN methods. In CNN-based methods, methods based on Retinex usually enhance the illumination and reflection components separately through dedicated sub-networks³². Wei et al.³⁶ introduced the Retinex-Net model, which aims to enhance low-light images. The model comprises two parts: a Decom-Net for decomposing images into illumination and reflection components, and an Enhance-Net for adjusting the illumination. Despite its purpose, Retinex-Net unfortunately results in significant color distortion, leading to less natural-looking enhanced images⁸. EnlightenGAN⁹ uses a Generative Adversarial Network (GAN) ,which used U-Net¹⁰ based on attention mechanisms as generator and a global-local discriminator to obtain the enhancement results. ZeroDCE¹¹ trains a lightweight network (DCE-NET) to fit the brightness map** curve, and it is then used to adjust the brightness distribution of the image. Retinex- inspired Unrolling with Architecture Search (RUAS)¹² uses an unfolding architecture search to handle low-light image enhancement. Self-Calibrated Illumination (SCI)¹³ proposes a simplified network that fits physical principles to achieve low-light enhancement and introduces a calibration process in the training stage to improve the low-light enhancement model’s ability, thereby further improving the enhancement effect.

Methods combining CNN and transformer

CNN operations provide efficiency and universality, but their receptive fields are limited and cannot fully consider long-range pixel relationships in input images, which can affect image enhancement performance. In contrast, in Transformers, the self-attention mechanism focuses on modeling long-range dependencies, enabling it to capture global information well. However, it lacks attention on the most relevant information³⁷ and its complexity grows exponentially with spatial resolution¹⁴, leading to poor performance in some tasks. Thus, combining the two effectively to improve image enhancement quality is the focus of this paper. Conformer²⁹ uses a CNN branch and a Transformer branch and combines them through Feature Coupling Units to fuse local convolution blocks, self-attention modules, and MLP units to adjust feature resolution and channel numbers while continually eliminating semantic differences between the CNN and Transformer branches. HNCT³⁰ integrates CNN and Transformer while using local and non-local priors to extract features beneficial for super-resolution and an enhanced spatial attention module to further improve performance. ECFAN³¹ proposes a new hybrid super-resolution method, called ACT, that combines CNN and Vision Transformer¹⁹ to effectively aggregate local and non-local features and introduces cross-scale token attention modules to effectively utilize multi-scale token representations.

Through careful consideration and experimental comparison, we have found that our method uses three TransformerBlocks as the encoder to preserve the most useful self-attention values, avoiding the further propagation of aggregated highlight features, allowing useful global features to be fully utilized, and transmitting useful local features to ensure that the enhanced low-light images have sufficient details. Two CNN blocks serve as the decoder to further utilize the feature information obtained from the Transformer blocks to better enhance the details and texture information of low-light images, leveraging the advantages of CNN networks.

Sparse attention

Images captured in real-world scenarios often suffer from uneven illumination³². For example, images taken at night may contain both dark and bright areas or overexposed regions, such as areas around light sources. Existing methods often enhance both the dark and bright regions of the image simultaneously, which can affect the visual quality of the enhancement results. However, current low-light image enhancement methods have not fully addressed this open problem. Zhao et al.³⁸ proposed sparse Transformer to select the attention degree of the model. Fu et al.³⁷ proposed a target focus network and sparse Transformer technique for visual object tracking. The target focus network focuses on the target of interest in the search region and highlights the features of the most relevant information for better estimating the states of the target. Inspired by SparseTT³⁷, we adapt sparse Transformer to the low-light enhancement task. For low-light images with uneven illumination, the Transformer is susceptible to the influence of high-light features when computing self-attention, resulting in higher attention values. This naturally leads to a bias towards enhancing high-light features rather than low-light features with low attention values when modeling global feature dependencies. Therefore, we propose a sparse attention operation that differs from the usual one, choosing to set high-light features to lower values to effectively suppress high-light information and focus on the most relevant information in low-light enhancement tasks.

Proposed method

In this section, the framework of CUI-Net and the two main modules: which is the enhancement module will be introduced and the auxiliary module. Finally, we will explain the unsupervised training losses used in our neural network model.

Overall procedure

The proposed CUI-Net is a cascaded two-stage image enhancement network (Fig. 2) In the first stage, a Transformer network is introduced to obtain global information, which can better enhance the details of low-light images. In the second stage, an auxiliary network based on multiple convolutional network blocks is constructed, and the original input image is used as a constraint to control the output detail features of the first stage. Unlike traditional methods, the training part of CUI-Net requires multiple Enhancement Modules and Auxiliary Modules, while the testing part only contains Enhancement Module.

Here, assume that the low-light input image $I\in {\mathbb {R}}^{H\times W\times C}$. where the height is H, the width is W, and the number of channels is C. For RGB images, C is equal to 3. According to the Retinex theory, the low-light image I can be obtained by performing the following operation on the clear image R and the illumination image L⁵:

$$\begin{aligned} I = R\otimes L \end{aligned}$$

(1)

Therefore, the enhanced image R can be obtained through input image I and the illumination map L.

During the training process, the entire framework can be divided into two parts, namely the Enhancement Module (EM) and the Auxiliary Module (AM):

$$\begin{aligned} \varepsilon _{t}= & {} E\!M_{t}({\mathscr {A}}_{t-1}+I;\vartheta )+{\mathscr {A}}_{t-1 } \end{aligned}$$

(2)

$$\begin{aligned} {\mathscr {A}}_{t}= & {} A\!M_{t}(I\oslash \varepsilon _t;\mu ),\gamma _{0}=I \end{aligned}$$

(3)

where $EM_{t}$ is the $t$-th image enhancement module network with learnable parameters $\vartheta$, and $AM_t$ is the $t$-th auxiliary module network with learnable parameters $\mu$. When $t=1$, i.e., in $EM_1$, only the original low-light image I is used as input, i.e. $EM_1 (I;\vartheta )$, and the original low-light image I is not added as input.

Unlike the training part, the auxiliary module is not needed in the testing part, and only one enhancement module is used to obtain the clear image.

$$\begin{aligned} R = I\oslash [EM(I;\vartheta )+I] \end{aligned}$$

(4)

Image enhancement module

The image enhancement module consists of an efficient Transformer block and a CNN block, serving as the encoder and decoder, respectively. The Transformer model enhances low-light images by filtering out information from uneven lighting channels and local details, and then transferring the useful features to the next part of the network. The core of the Transformer block lies in the Multi-Dimensional Sparse Attention (MDSA) mechanism and the Cross-Gated Feed-Forward Network (CGFN). MDSA can effectively reduce redundant features and improve the weights of important features, thus enhancing the network’s robustness and generalization ability. The cross-gated mechanism can compensate for the lack of information in the spatial dimension, allowing useful information to propagate further and enhance the integrity of the entire feature representation. The CNN block replaces the attention block in the traditional Transformer network with deep convolutions and the feed-forward layer with a simplified CNN structure, ensuring lightness. Meanwhile, a structure similar to the Transformer network can further process feature information and has the generality and efficiency advantages of a convolutional neural network.

In summary, the channel-wise sparse attention and cross-gated Transformer are used as the encoder in the image enhancement module. With the increase of layers number, the extracted features become increasingly abstract and semantically rich. The CNN block is used as the decoder to extract and enhance features at a higher level, making it more suitable for image enhancement tasks in uneven lighting conditions. Realizing pixel-level information transfer and context association through convolution calculation can further improve the performance and efficiency of the model.

The specific process of the image enhancement module is shown in Fig. 3. The network structure diagrams of the Transformer and CNN modules in the enhancement module are shown in Fig. 4. First, the input low-light image I undergoes a $3\times 3$ convolutional operation to extract low-level features and increase the number of channels. It then passes through three Transformer encoders and two CNN decoders. The residual connections and upsampling and downsampling operations are utilized to extract sufficient detail features. Finally, a $3\times 3$ convolutional operation is used to restore the original number of channels, and the resulting image is added to the input low-light image I to produce the final output image. C stands for Concatenation operation.

Multi-Dconv head sparse attention

In traditional Transformer modules, multi-head self-attention mechanisms compute global information through self-attention mechanisms in the spatial dimension, resulting in a quadratic growth in complexity with increasing resolution. The main purpose of sparse attention mechanisms is to reduce the time and space complexity of traditional Transformers³⁹. In this paper, the channel attention mechanism used in the MDSA module not only reduces model complexity and improves efficiency, but also helps the model better understand local features in the image. In low-light enhancement tasks, the appearance of too many high-brightness local features may interfere with the model’s ability to capture other low-light features. Therefore, this paper uses sparse attention mechanisms to assist the model in better representing local features and improving its enhancement ability.

The specific structure of MDSA is shown in Fig. 5. The input tensor is denoted as $I\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times 3}$. Q, K, and V represent query, key, and value. The $1\times 1$ point-wise convolution is applied to aggregate pixel-level cross-channel context, followed by a $3\times 3$ depth-wise convolution to encode channel-level spatial context. The operation $\circledR$ in the figure stands for reshape. $I\!s\!I\!n\!M\!ap$ is used to filter out the weights in the attention map matrix that are the same as the weights in the TopK matrix, and set the corresponding weights in the attention map to 0.01.

Different from the Vision Transformer model¹⁹, MDSA uses self-attention mechanism to calculate the similarity between each channel, i.e., attention calculation is performed on the channel dimension rather than on the spatial dimension. This enables MDSA to better capture the relationships between feature channels, thereby improving the model’s representation ability and robustness.

Specifically, the TopK operation is performed on the attention map to select the top K attention values, followed by further operations. It should be noted that, unlike general sparse attention calculation, in the low-light task under uneven illumination, the channel information of the high-light area in the attention map is more likely to receive higher attention scores. These K attentions need to be set to 0.01 to allow the low-light channel features to be sent to CGFN for obtaining the required local information.

$$\begin{aligned} {\hat{X}}= & {} W_pSpAttention({\hat{Q}},{\hat{K}},{\hat{V}})+I \end{aligned}$$

(5)

$$\begin{aligned} SpAttention({\hat{Q}},{\hat{K}},{\hat{V}})= & {} {\hat{V}}\cdot softmax(TopK({\hat{K}}\cdot {\hat{Q}}/\lambda )) \end{aligned}$$

(6)

Here, ${\hat{Q}}\in {\mathbb {R}}^{{\hat{H}}{\hat{W}}\times {\hat{C}}},{\hat{K}}\in {\mathbb {R}} ^{{\hat{C}} \times {\hat{H}}{\hat{W}}},{\hat{V}}\in {\mathbb {R}} ^{{\hat{H}}{\hat{W}}\times {\hat{C}}}$ are obtained by resha** the original scale ${\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}$. The meaning of SpAttention is sparse attention. $W_p$ represents a $1\times 1$ point-wise convolution. $\lambda$ is a learnable scaling parameter used to control the magnitude of the dot product of ${\hat{K}}$ and ${\hat{Q}}$.

Cross-gated feed-forward network

The two inputs of the Cross-Gated feed-forward Network (CGFN) are the input and output obtained through MDSA. The cross-gating part is equivalent to calculating weights on the spatial dimension and weighting specific positions, in order to compensate for the lack of spatial dimension information in the image that has not passed through MDSA.

The specific structure of the CGFN is shown in Fig. 6. Each single path of the CGFN module has two branches. One branch is a gating unit used to obtain the activation state of each pixel. The $1\times 1$ convolutional layer is used to expand the channel number, followed by a $3\times 3$ depthwise convolutional layer and StarReLU to generate the gate map. The other branch does not need to pass through the StarReLU activation function. Then, the two branches are dot-multiplied. The cross-gating is cross-calculated on the two paths to compensate for the lack of spatial information. If the input of CGFN from MDSA is $X\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}$ , $Y\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}$ is the input from the previous module without MSDA then the CGFN can be represented as follows:

$$\begin{aligned} {\hat{Z}}= & {} W_o^{2}((W_m^{2}(W_o^{1}(W_m^{1}({\hat{X}})\odot {\hat{Y}})+{\hat{X}}))\odot {\hat{X}}) + {\hat{Y}} \end{aligned}$$

(7)

$$\begin{aligned} {\hat{X}}= & {} W_p^0G(X)+X \end{aligned}$$

(8)

$$\begin{aligned} {\hat{Y}}= & {} W_p^0G(Y)+Y \end{aligned}$$

(9)

$$\begin{aligned} G(X)= & {} \phi (W_d^1 W_p^1(LN(X)))\odot W_d^2 W_p^2 (LN(X)) \end{aligned}$$

(10)

where $\odot$ denotes element-wise multiplication, $\phi$ represents the StarReLU non-linear activation function, and LN stands for Layer Normalization. $W_m$ performs softmax operation. $W_o$ performs dropout operation. ${\hat{Z}}$ will serve as the input to the next module.

Auxiliary module

The auxiliary module is necessary for unsupervised image enhancement methods as they may have limitations such as over-enhancement and color bias⁸. Therefore, the CNN network with high efficiency and generalization ability is chosen as the auxiliary module to converge the outputs of multiple enhancement modules to one enhancement effect, enabling the use of one enhancement module during the testing phase to achieve the same enhancement effect as the multiple enhancement modules during the training part.

As shown in Fig. 2 , Formulas (2) and (3), the purpose of the auxiliary module is to correct the input of the enhancement module, indirectly affecting the output of the enhancement module. The input of the auxiliary module can be obtained by element-wise addition of the output of the previous enhancement module and the output of the auxiliary module, followed by division with the original low-light image. Thus, the auxiliary module can obtain the features of the enhancement module and correct the uneven illumination through the original low-light image.

The auxiliary module uses depth-wise convolution multiple times, which can effectively reduce the number of parameters and computation cost, as shown in Fig. 7. Firstly, the input image is passed through a $3\times 3$ convolution layer to increase the channel number, and then through three CNN blocks. Finally, a $3\times 3$ convolution layer is used to reduce the channel dimension. As shown in Fig. 4, the CNN block enhances the local details by passing the input features through depth-wise convolutions of $3\times 3$ and $5\times 5$, followed by StarReLU activation function and multiple $1\times 1$ convolutions to minimize the number of parameters. The corrected illumination information is then inputted to the enhancement module, improving the enhancement effect of the enhancement module.

Training loss

In order to consider color preservation, artifact removal and gradient backpropagation, the loss function needs to be optimized. The loss function used by CUI-Net is as follows:

$$\begin{aligned} {\mathscr {L}} = \alpha {\mathscr {L}}_c + \beta {\mathscr {L}}_s \end{aligned}$$

(11)

Here, L represents the total loss, ${\mathscr {L}}_c$ and ${\mathscr {L}}_c$ represent the correction loss and the smoothness loss respectively, and $\alpha$ and $\beta$ are two positive balancing parameters. In the experiments, the balancing parameters are set to $\alpha =1.5$ and $\beta =1$. The correction loss ${\mathscr {L}}_c$ is to ensure the consistency between the estimated illumination and the adjusted result, that is:

$$\begin{aligned} {\mathscr {L}}_c = \sum _{x = 1}^{3} {\parallel EM_x - AM_{x-1} \parallel }^2 \end{aligned}$$

(12)

Here, $EM_x$ is the x-th enhancement module, and $AM_x$ is the x-th auxiliary module. $AM_0$ is the original input I. As an unsupervised loss, this loss function only constrains the output through the auxiliary module.

Then, the smoothness loss is used⁴⁰, that is:

$$\begin{aligned} {\mathscr {L}} _s = \sum _{i = 1}^{N} \sum _{j\in {\mathscr {N}}(i)}^{} Weight_{i,j}\mid X_i^t - X_j^t\mid \end{aligned}$$

(13)

Here, N is the total number of pixels. i is the i-th pixel. ${\mathscr {N}}(i)$ represents the neighboring pixels in its 5 × 5 window. $Weight_{i,j}$ represents the weight, which is specified as equation 14, where c represents the image channel in the YUV color space, and $\sigma =0.1$ is the standard deviation of the Gaussian kernel.

$$\begin{aligned} Weight_{i,j} = exp(-\frac{\sum _{c}{((y_{i,c}+s_{i,c}^{t-1})-(y_{j,c}+s_{j,c}^{t-1}))}^2}{2\sigma ^2}) \end{aligned}$$

(14)

Experiment

To test the effectiveness of the algorithm, this paper verifies it on multiple datasets and tasks. Firstly, the experimental settings are given, and tests are conducted on public datasets to demonstrate the effectiveness of the algorithm through quantitative comparison and qualitative analysis with existing methods. Then, high-level tasks, including low-light object detection, dark face detection, and nighttime semantic segmentation, are tested and compared with existing algorithms to further validate the effectiveness of the algorithm. Finally, ablation experiments are conducted to verify the effectiveness of each module.

Experimental settings

The experiment is based on PyTorch and conducted on a computer with an Intel i9-10940X CPU, two RTX 3090 GPUs, and 32GB of memory for training and testing. The main parameters are batch size of 1, initial learning rate of $10^{-4}$, weight decay of $\epsilon =10^{-8}$, and training epoch of 500. In the enhancement module, the number of Transformer blocks is set to 4, 6, 6, and 8 from the first layer to the fourth layer, the number of attention heads in MDTA is set to 2, 4, and 8, and the number of channels is set to 48, 96, and 192. StarReLU⁴¹ and Adan⁴² optimizer are introduced in CUI-Net. StarReLU is a variant of Squared ReLU designed to eliminate distribution shift. StarReLU performs well in both algorithm performance and computational efficiency due to reducing the computational cost of the activation function⁴³. Adan can complete the training of ViT¹⁹ with only half the computational cost. Compared with the popular optimizer Adam⁴⁴, Adan has an additional hyperparameter $\beta _2$ for adjustment. $\beta _2$ is set to 0.08 in the experiments⁴².

Here, $EM_x$ represents the x-th enhancement module, and $AM_x$ represents the x-th auxiliary module. $AM_0$ represents the original input I. As an unsupervised loss, the loss function L only constrains the output through the auxiliary module.

To verify the effectiveness and superiority of the proposed algorithm, CUI-Net is compared with state-of-the-art (SOTA) methods, including EnlightenGAN⁹, KinD⁴⁵ , ZeroDCE¹¹, ZeroDCE++⁴⁶, RUAS¹², SCI¹³, and Uretinex-Net⁴⁷. Additionally, comparisons are made in high-level vision tasks such as face detection, object detection, and semantic segmentation.

Benchmark description and evaluation metrics

For image enhancement testing, 100 random images from the MIT dataset⁴⁸ and 50 random images from the LSRW dataset⁴⁹ are used for testing. To quantitatively measure the algorithm’s performance, three full-reference metrics, including PSNR, SSIM, and LPIPS⁵⁰, and four no-reference metrics, including NIQE⁵¹, ILNIQE⁵², NIMA⁵³, and MUSIQ⁵⁴, are used as evaluation metrics.

For dark face detection tasks, the DARK FACE dataset⁵⁵, consisting of 1000 challenging test images, is used. 500 random images are selected as the training set, and 50 images are used for testing, with the average precision (AP) used as the evaluation metric.

For low-light object detection tasks, the ExDark dataset⁵⁶ specifically designed for low-light object detection is used. 1051 images are selected as the training set, and 406 images are used for testing, with evaluation metrics including $mAP_{0.5:0.95}$ and $mAP_{0.5}$.

For nighttime semantic segmentation tasks, the ACDC dataset⁵⁷ is used. The ACDC dataset is a self-driving dataset released in ICCV 2021. 400 dark condition images are used for training, and the remaining 106 images are used as the test set. The evaluation metrics include IoU and mIoU.

Quantitative and qualitative metrics

The quantitative results on the MIT dataset are shown in Table 1. CUI-Net achieved the best performance in SSIM, PSNR, LPIPS, and ILNIQE among the seven evaluation metrics. Specifically, CUI-Net achieved a PSNR of 193.328dB, which is 1.0259dB higher than the best existing best algorithm’s score of 18.3201dB, and an ILNIQE evaluation metric has a score of 31.9151, which is 1.5756 lower than the score of the best existing algorithm.

Table 1 Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four no-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the MIT dataset.

Full size table

The enhancement results on the MIT dataset are shown in Fig. 8. Compared with the ground truth (Fig. 8GT) for the input low-light original image (Fig. 8LL), EnlightenGAN (Fig. 8a), KinD (Fig. 8b), ZeroDCE (Fig. 8d), SCI (Fig. 8f), and Uretinex (Fig. 8g) methods show inadequate enhancement, while ZeroDCE++ (Fig. 8e) shows over-enhancement. RUAS (Fig. 8c) enhances the white petals on the upper part of the image into pinkish color, but the overall saturation is too high. In contrast, CUI-Net Fig. 8h) shows better color restoration while maintaining realistic lighting conditions.

The quantitative results on the LSRW dataset are shown in Table 2 . Among the seven evaluation metrics, CUI-Net achieved the best result in the NIMA and the third-best results in the PSNR, NIQE and MUSIQ. Uretinex achieved good results on the LSRW dataset, which may be because the data augmentation method of the LSRW dataset is similar to that of the LOL dataset used in supervised training of Uretinex. However, our unsupervised method may be less sensitive to artificially augmented datasets.

Table 2 Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four non-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the LSRW dataset.

Full size table

The enhancement results on the LSRW dataset are shown in Fig. 9. Except for ZeroDCE++ (Fig. 9e), which shows over-enhancement, the overall enhancement effect of the EnlightenGAN (Fig. 9a), KinD (Fig. 9b), RUAS (Fig. 9c), ZeroDCE (Fig. 9d), ZeroDCE++ (Fig. 9e), SCI (Fig. 9f), Uretinex (Fig. 9g), and CUI-Net (Fig. 9h) methods is similar. By enlarging the selected local areas for detailed comparison, we observed two parts of the scene: the outdoor and indoor scenes are observed separately. RUAS (Fig. 10c), ZeroDCE++ (Fig. 10e), and SCI (Fig. 10f) showed over-exposure in the outdoor scenes. Uretinex (Fig. 10g), which achieved better quantitative results, also showed over-exposure. It is worth noting that even the ground truth (Fig. 10GT) shows over-enhancement in the outdoor scenes compared to the low-light original image (Fig. 10LL). Since CUI-Net (Fig. 10h) can suppress highlight areas under uneven lighting conditions, better enhancement of outdoor scenes may not always contribute to some evaluation metrics. For indoor scenes, EnlightenGAN (Fig. 10a), KinD (Fig. 10b), and ZeroDCE (Fig. 10d) resulted in blurred text and less realistic surface reflections, while CUI-Net can not only enhance the details and contours of low-light areas but also restore the realistic lighting conditions of the scene. In addition, CUI-Net can enhance the text on the white paper and paper box on the desk more clearly, which may have practical applications in low-light image text extraction tasks.

Although CUI-Net has some shortcomings in quantitative metrics on the LSRW dataset, the qualitative analysis of the enhancement results shows some discrepancies between the relevant metrics and subjective observations in practical applications.

We conducted training and testing on the unpaired low-light enhancement datasets MEF⁵⁸, VV, DICM⁵⁹, and LIME³⁵, with the qualitative results illustrated in Figs. 11, 12, 13, and 14 , respectively. As can be observed, our method effectively prevents overexposure across all four datasets, achieves a satisfactory enhancement of details, and restores realistic shadows and lightings. This can be observed, for instance, in the details of the tabletop, facial features, the flower cluster and door numbers, and the cliff and buildings.

The quantitative results are shown in Tables 3, 4, 5, and 6 .

Table 3 Quantitative test results on the MEF dataset.

Full size table

Table 4 Quantitative test results on the VV dataset.

Full size table

Table 5 Quantitative test results on the DICM dataset.

Full size table

Table 6 Quantitative test results on the LIME dataset.

Full size table

From the tables, it can be observed that our method outperforms others in terms of quantitative results on unpaired low-light datasets, further demonstrating the robustness of our approach.

Dark face detection

The DSFD⁶⁰ face detection framework was utilized for the experiment, which adopts the SSD⁶¹ network structure and was trained on the WIDER FACE⁶² dataset. In the face detection experiment, results from different low-light enhancement methods were used as inputs to DSFD. Finally, we compared the AP (average precision) at different IoU thresholds. The test results are shown in Table 7, where CUI-Net achieved the highest AP values at IoU thresholds of 0.5 and 0.6 and the second-highest AP value at an IoU threshold of 0.7.

Table 7 AP (average precision) at IoU of 0.5, 0.6, 0.7 thresholds.

Full size table

Figure 15 shows the detection results of different methods and adds the low-light input image (Fig. 15LL) and its face detection result (Fig. 15LD) for comparison. The lower right corner of each method’s result image is the corresponding magnified detail image. It can be seen that at an IoU threshold of 0.5, only RUAS (Fig. 15c) and CUI-Net (Fig. 15h) can detect the face in the area pointed by the arrow. EnlightenGAN (Fig. 15a), KinD (Fig. 15b), ZeroDCE (Fig. 15d), ZeroDCE++ (Fig. 15e), SCI (Fig. 15f), and Uretinex (Fig. 15g) failed to detect the face in the area pointed by the arrow. However, RUAS has serious overexposure, and the details on the ground cannot be seen clearly. CUI-Net not only can detect more face but also produces realistic enhancement effects, with better quantitative indicators than other SOTA methods.

Low-light object detection

We trained the YOLOv3⁶³ model on the ExDark object detection dataset and tested it on the ExDark validation dataset. YOLOv3 is a series of object detection frameworks and models pre-trained on the COCO dataset⁶⁴. Unlike face detection experiments, we fine-tuned the YOLOv3 pre-trained model for object detection, i.e., we retrained the object detection model to evaluate the enhancement effects of all methods. Table 8 shows the quantitative results among different methods. CUI-Net achieved the best mAP values in both $mAP_{0.5:0.95}$ and $mAP_{0.5}$.

Table 8 Quantitative results of object detection on the ExDark dataset.

Full size table

The experimental results were obtained by performing object detection on low-light images after being enhanced by various SOTA algorithms. The baseline is object detection directly on the unenhanced low-light images. The specific detection results object detection on the low-light image (Fig. 16LL) are shown in Fig. 16, Only RUAS (Fig. 16c), ZeroDCE++ (Fig. 16e), Uretinex (Fig. 16g), and CUI-Net (Fig. 16h) can recognize the most targets. EnlightenGAN (Fig. 16a), KinD (Fig. 16b), ZeroDCE (Fig. 16d), SCI (Fig. 16f), and baseline(Fig. 16LD) did not detect the targets completely. The overall average confidence values of RUAS, ZeroDCE++ and Uretinex are lower than CUI-Net. In addition, the main reason why RUAS and ZeroDCE++ have lower mAP values in Table 8 is due to the overexposure problem. However, CUI-Net found a good balance and was able to avoid the overall lower mAP scores caused by overexposure.

Low-light semantic segmentation

We evaluated the performance of all segmentation methods on the ACDC low-light semantic segmentation dataset using the DeepLab-V3+⁶⁵ model with pre-training and fine-tuning mode. The pre-trained model was trained on the Cityscape dataset⁶⁶. Table 9 shows the mIoU values for multiple categories and the overall average among different low-light enhancement methods. CUI-Net achieved the best mIoU score among the six segmentation targets and was the second-best method among the seven segmentation targets. It outperformed the second-best method by 4.5 in the wall category, 1.9 in the traffic light category, and 6.6 in the motorcycle category. The overall average mIoU value was 2.8 higher than the second-best method.

Table 9 Shows the mIoU values for multiple categories average among different low-light enhancement methods.

Full size table

Table 10 shows the mAcc values for multiple categories average among different low-light enhancement methods. CUI-Net achieved the highest mAcc values for five segmentation targets, with 12.7 higher than the second-best method in the motor category and 22.9 higher in the rider category. CUI-Net also obtained the second-highest mAcc value for four segmentation targets, with an overall mAcc value 5 higher than the second-best method.

Table 10 Shows the mAcc values for multiple categories average among different low-light enhancement methods.

Full size table

Figure 17 shows the overlaid results of semantic segmentation masks and enhanced images on the ACDC dataset. Overall, RUAS (Fig. 17c) and SCI (Fig. 17f) exhibited overexposure. EnlightenGAN (Fig. 17a), KinD (Fig. 17b), ZeroDCE (Fig. 17d), ZeroDCE++ (Fig. 17e), Uretinex (Fig. 17g), and CUI-Net(Fig. 17h) methods showed no significant differences, but for nighttime semantic segmentation applications, attention to detail is particularly important, such as timely segmentation of pedestrians traffic signs on the road to avoid serious accidents during nighttime autonomous driving.

The local detailed semantic segmentation results for each method corresponding to the red box in Fig. 17 are shown in Fig. 18. Comparing with the ground truth in Fig. 19, for the first red box region, which contains two traffic signs, EnlightenGAN (Fig. 18a), KinD (Fig. 18b), RUAS (Fig. 18c), ZeroDCE++ (Fig. 18e), and Uretinex (Fig. 18g) failed to segment both traffic signs, while ZeroDCE (Fig. 18d) and SCI (Fig. 18f) only recognized the left traffic sign. However, CUI-Net (Fig. 18h) was able to recognize both traffic signs. For the middle red box region, which contains two pedestrians and two traffic signs, only ZeroDCE++ (Fig. 18e) and Uretinex (Fig. 18g) recognized both traffic signs, while our CUI-Net (Fig. 18h) recognized an additional pedestrian. For the right red box region, which contains two pedestrians, only KinD (Fig. 18b), SCI (Fig. 18f), and CUI-Net (Fig. 18h) were able to segment both pedestrians well. In addition, for the pedestrian crossing category that does not exist in the ACDC dataset, it can be seen from Fig. 17 that CUI-Net has the most obvious enhancement effect, which may play a role in nighttime safety autonomous driving tasks. Clearly, CUI-Net has some potential in nighttime semantic segmentation tasks.

Ablation study

To verify whether the network structure of the enhancement module in CUI-Net can improve the model’s enhancement ability, we conducted four ablation experiments on the LSRW dataset for training and testing, and evaluated the quality of the enhanced images using SSIM, PSNR, and LPIPS.

Firstly, to verify whether Adan and StarReLu can accelerate the convergence of the model, we choose to train for 50 epochs. The results obtained are shown in Table 11, where it can be observed that replacing GeLu with StarReLu and Adam with Adan can lead to better results in a smaller number of epochs.

Table 11 Ablation experiment of replacing GeLu and Adam with StarReLu and Adan.

Full size table

Secondly, to verify whether the network structure design of the enhancement module is effective, we replaced the five modules in the overall network with a full CNN module, a full Transformer module, and the three Transformer modules and two CNN modules of CUI-Net for experimental analysis. The results obtained are shown in Table 12, and the network structure of CUI-Net can achieve better performance.

Table 12 Replacing the five modules used in the original CUI-Net with different ones.

Full size table

Thirdly, to verify whether MDSA and CGFN can improve the model’s enhancement ability, we selected MDTA and GDFN in Restormer for ablation study. The results are shown in Table 13, and both MDSA and CGFN can improve the performance of the model.

Table 13 Ablation experiments were conducted to compare the network module used in the Transformer block of CUI-Net with MDTA and GDFN.

Full size table

Finally, an ablation study was conducted on the sparse attention operation on channels in the MDSA module of CUI-Net. The results are shown in Table 14. The $Topk\_normal$ operation is the usual sparse attention operation where all attention weights except for the TopK are set to zero. In contrast, the $Top\_CUI$ operation used in CUI-Net reduces the attention weights of the channels obtained by TopK to a very low value. The results of the ablation study indicate that the sparse attention on channels used in CUI-Net contributes to achieving better enhancement results.

Table 14 Perform an ablation experiment comparing the usual sparse attention mechanism with the sparse attention mechanism used in the CUI-Net network.

Full size table

Conclusion

In this paper, we propose a CUI-Net framework consisting of an enhancement module and an auxiliary module, which can achieve differential enhancement of low-light and highlight regions in low-light environments. In the enhancement module, an efficient low-light enhancement Transformer and CNN network are introduced to enhance low-light images by acquiring global pixel information. In the auxiliary module, a lightweight CNN network is designed to assist the enhancement module to converge better and correct lighting effects. Quantitative analysis and qualitative comparison of CUI-Net with other state-of-the-art low-light image enhancement methods were conducted on two public low-light datasets, demonstrating the effectiveness of the proposed method. Furthermore, the practicality of the method was further verified through high-level vision tasks, namely low-light object detection, dark face detection, and nighttime semantic segmentation.

Data availability

The MIT dataset used during the current study are available in the https://data.csail.mit.edu/graphics/fivek/. The LSRW dataset used during the current study are available in the https://github.com/JianghaiSCU/R2RNet. The DarkFace dataset used during the current study are available in the https://flyywh.github.io/CVPRW2019LowLight/. The ExDark dataset used during the current study are available in the https://github.com/cs-chan/Exclusively-Dark-Image-Dataset. The ACDC dataset used during the current study are available in the https://acdc.vision.ee.ethz.ch/. The MEF, VV, DICM and LIME datasets used during the current study are available in the https://github.com/Li-Chongyi/Lighting-the-Darkness-in-the-Deep-Learning-Era-Open/.

References

Cui, H., Li, J., Hua, Z. & Fan, L. Progressive dual-branch network for low-light image enhancement. IEEE Trans. Instrum. Meas. 71, 1–18 (2022).
Google Scholar
Stark, J. A. Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE Trans. Image Process. 9, 889–896 (2000).
Article ADS CAS PubMed Google Scholar
Abdullah-Al-Wadud, M., Kabir, M. H., Dewan, M. A. A. & Chae, O. A dynamic histogram equalization for image contrast enhancement. IEEE Trans. Consum. Electron. 53, 593–600 (2007).
Article Google Scholar
Wang, W., Wu, X., Yuan, X. & Gao, Z. An experiment-based review of low-light image enhancement methods. IEEE Access 8, 87884–87917 (2020).
Article Google Scholar
Land, E. H. An alternative technique for the computation of the designator in the retinex theory of color vision. Proc. Natl. Acad. Sci. 83, 3078–3080 (1986).
Article ADS CAS PubMed PubMed Central Google Scholar
Kimmel, R., Elad, M., Shaked, D., Keshet, R. & Sobel, I. A variational framework for retinex. Int. J. Comput. Vis. 52, 7–23 (2003).
Article MATH Google Scholar
Tao, L. & Asari, V. Modified luminance based msr for fast and efficient image enhancement. In 32nd Applied Imagery Pattern Recognition Workshop, 2003. Proceedings 174–179 (IEEE, 2003).
Kuang, B. & Zhang, Z. Two-stage low-light image enhancement network with an attention mechanism and cross-stage connection. J. Electron. Imaging 31, 053001 (2022).
Article ADS Google Scholar
Jiang, Y. et al. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 30, 2340–2349 (2021).
Article ADS PubMed Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 Oct, 2015, Proceedings, Part III 234–241 (Springer, 2015).
Guo, C. et al. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1780–1789 (2020).
Liu, R., Ma, L., Zhang, J., Fan, X. & Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10561–10570 (2021).
Ma, L., Ma, T., Liu, R., Fan, X. & Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5637–5646 (2022).
Zamir, S. W. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5728–5739 (2022).
Li, W., Liu, K., Zhang, L. & Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 10, 11307 (2020).
Article PubMed PubMed Central Google Scholar
Yin, M., Wang, P., Ni, C. & Hao, W. Cloud and snow detection of remote sensing images based on improved unet3+. Sci. Rep. 12, 14415 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017).
Lin, Z. et al. A structured self-attentive sentence embedding. ar**v preprint ar**v:1703.03130 (2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020).
Shaw, P., Uszkoreit, J. & Vaswani, A. Self-attention with relative position representations. ar**v preprintar**v:1803.02155 (2018).
Guo, J., Jia, N. & Bai, J. Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image. Sci. Rep. 12, 15473 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Ma, J. et al. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE CAA J. Autom. Sin. 9, 1200–1217 (2022).
Article Google Scholar
Yang, F., Yang, H., Fu, J., Lu, H. & Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 5791–5800 (2020).
Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision 1833–1844 (2021).
Luthra, A., Sulakhe, H., Mittal, T., Iyer, A. & Yadav, S. Eformer: Edge enhancement based transformer for medical image denoising. ar**v preprint ar**v:2109.08044 (2021).
Song, Y., He, Z., Qian, H. & Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 32, 1927–1941 (2023).
Article ADS Google Scholar
Tu, Z. et al. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5769–5780 (2022).
Cui, Z. et al. You only need 90k parameters to adapt light: a light weight transformer for image enhancement and exposure correction. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, 21–24 Nov, 2022 (BMVA Press, Durham, 2022).
Peng, Z. et al. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision 367–376 (2021).
Fang, J., Lin, H., Chen, X. & Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 1103–1112 (2022).
Yoo, J. et al. Enriched cnn-transformer feature aggregation networks for super-resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2023).
Li, C. et al. Low-light image and video enhancement using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9396–9416 (2021).
Article Google Scholar
Li, X. et al. Transformer-based visual segmentation: A survey. ar**v preprint ar**v:2304.09854 (2023).
Fu, X. et al. A fusion-based enhancing method for weakly illuminated images. Signal Process. 129, 82–96 (2016).
Article Google Scholar
Guo, X., Li, Y. & Ling, H. Lime: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 26, 982–993 (2016).
Article ADS MathSciNet MATH Google Scholar
Wei, C., Wang, W., Yang, W. & Liu, J. Deep retinex decomposition for low-light enhancement. In British Machine Vision Conference (British Machine Vision Association, 2018).
Fu, Z., Fu, Z., Liu, Q., Cai, W. & Wang, Y. Sparsett: Visual tracking with sparse transformers. ar**v preprint ar**v:2205.03776 (2022).
Zhao, G. et al. Explicit sparse transformer: Concentrated attention through explicit selection. ar**v preprint ar**v:1912.11637 (2019).
Child, R., Gray, S., Radford, A. & Sutskever, I. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers (2019).
Fan, Q., Yang, J., Wipf, D., Chen, B. & Tong, X. Image smoothing via unsupervised learning. ACM Trans. Graph. (TOG) 37, 1–14 (2018).
Article Google Scholar
Yu, W. et al. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10819–10829 (2022).
**e, X., Zhou, P., Li, H., Lin, Z. & Yan, S. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. ar**v preprint ar**v:2208.06677 (2022).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (gelus). ar**v preprint ar**v:1606.08415 (2016).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014).
Zhang, Y., Guo, X., Ma, J., Liu, W. & Zhang, J. Beyond brightening low-light images. Int. J. Comput. Vis. 129, 1013–1037 (2021).
Article Google Scholar
Li, C., Guo, C. & Loy, C. C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 4225–4238 (2021).
Google Scholar
Wu, W. et al. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5901–5910 (2022).
Bychkovsky, V., Paris, S., Chan, E. & Durand, F. Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR 2011 97–104 (IEEE, 2011).
Hai, J. et al. R2rnet: Low-light image enhancement via real-low to real-normal network. J. Vis. Commun. Image Represent. 90, 103712 (2023).
Article Google Scholar
Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 586–595 (2018).
Mittal, A., Soundararajan, R. & Bovik, A. C. Making a “completely blind’’ image quality analyzer. IEEE Signal Process. Lett. 20, 209–212 (2012).
Article ADS Google Scholar
Zhang, L., Zhang, L. & Bovik, A. C. A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process. 24, 2579–2591 (2015).
Article ADS MathSciNet MATH Google Scholar
Talebi, H. & Milanfar, P. Nima: Neural image assessment. IEEE Trans. Image Process. 27, 3998–4011 (2018).
Article ADS MathSciNet MATH Google Scholar
Ke, J., Wang, Q., Wang, Y., Milanfar, P. & Yang, F. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision 5148–5157 (2021).
Yang, W. et al. Advancing image understanding in poor visibility environments: A collective benchmark study. IEEE Trans. Image Process. 29, 5737–5752 (2020).
Article ADS Google Scholar
Loh, Y. P. & Chan, C. S. Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 178, 30–42 (2019).
Article Google Scholar
Sakaridis, C., Dai, D. & Van Gool, L. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision 10765–10775 (2021).
Ma, K., Zeng, K. & Wang, Z. Perceptual quality assessment for multi-exposure image fusion. IEEE Trans. Image Process. 24, 3345–3356 (2015).
Article ADS MathSciNet PubMed MATH Google Scholar
Lee, C., Lee, C. & Kim, C.-S. Contrast enhancement based on layered difference representation. In 2012 19th IEEE International Conference on Image Processing 965–968 (IEEE, 2012).
Li, J. et al. Dsfd: Dual shot face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5060–5069 (2019).
Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 Oct, 2016, Proceedings, Part I 21–37 (Springer, 2016).
Yang, S., Luo, P., Loy, C.-C. & Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5525–5533 (2016).
Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. ar**v preprint ar**v:1804.02767 (2018).
Lin, T.-Y. et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 Sept, 2014, Proceedings, Part V 740–755 (Springer, 2014).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) 801–818 (2018).
Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3213–3223 (2016).

Download references

Acknowledgements

This work was supported in part by National Science Foundation Project of P.R. China under Grant No. 52071349, No. U1906234, partially supported by the Open Project Program of Key Laboratory of Marine Environmental Survey Technology and Application, Ministry of Natural Resource MESTA-2020-B001, Young and Middle-aged Talents Project of the State Ethnic Affairs Commission, the Crossdisciplinary Research Project of Minzu University of China (2020MDJC08), and the Graduate Research and Practice Projects of Minzu University of China(SJCX2022037, SJCX2022038, SZKY2022001).

Author information

Authors and Affiliations

School of Information Engineering, Minzu University of China, Bei**g, 100081, China
Ke Chao, Wei Song, Sen Shao, Dan Liu, **angchun Liu & **aoBing Zhao
Key Laboratory of Marine Environmental Survey Technology and Application, Ministry of Natural Resource, Guangzhou, 510300, China
Wei Song
Language Information Security Research Center, Institute of National Security MUC, Minzu University of China, Bei**g, 100081, China
Wei Song & **aoBing Zhao
National Language Resource Monitoring and Research Center of Minority Languages, Minzu University of China, Bei**g, 100081, China
Wei Song & **aoBing Zhao
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Bei**g, China
Wei Song & **aoBing Zhao

Authors

Ke Chao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Song
View author publications
You can also search for this author in PubMed Google Scholar
Sen Shao
View author publications
You can also search for this author in PubMed Google Scholar
Dan Liu
View author publications
You can also search for this author in PubMed Google Scholar
**angchun Liu
View author publications
You can also search for this author in PubMed Google Scholar
**aoBing Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by K.C. and W.S.; The first draft of the manuscript was written by K.C.; S.S. and D.L. prepared the Figures and Tables; X.L. and X.Z. reviewed and advised this article. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wei Song.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chao, K., Song, W., Shao, S. et al. CUI-Net: a correcting uneven illumination net for low-light image enhancement. Sci Rep 13, 12894 (2023). https://doi.org/10.1038/s41598-023-39524-5

Download citation

Received: 27 April 2023
Accepted: 26 July 2023
Published: 09 August 2023
DOI: https://doi.org/10.1038/s41598-023-39524-5
Springer Nature Limited

CUI-Net: a correcting uneven illumination net for low-light image enhancement

Abstract

Similar content being viewed by others

MaCo: efficient unsupervised low-light image enhancement via illumination-based magnitude control

IATN: illumination-aware two-stage network for low-light image enhancement

Luminance domain-guided low-light image enhancement

Introduction

Related work

Traditional enhancement methods

Deep learning-based methods

Methods combining CNN and transformer

Sparse attention

Proposed method

Overall procedure

Image enhancement module

Multi-Dconv head sparse attention

Cross-gated feed-forward network

Auxiliary module

Training loss

Experiment

Experimental settings

Benchmark description and evaluation metrics

Quantitative and qualitative metrics

Dark face detection

Low-light object detection

Low-light semantic segmentation

Ablation study

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation