Introduction

Low-light image enhancement has been studied for many years and has important applications in areas such as night-time video surveillance and autonomous vehicles. Therefore, using low-light image enhancement algorithms to restore low-light images to normal-light images provides a solid foundation for subsequent high-level vision tasks, such as object detection, object tracking, and semantic segmentation. At the same time, low-light image enhancement technology is also indispensable in fields such as military security and deep-sea exploration1.

Traditional methods2,3,4,5,6,7 for low-light image enhancement are typically based on histogram equalization and Retinex-based approaches. These methods have some effect in increasing the brightness of low-light images, but they often suffer from over-enhancement and detail loss, as well as excessive noise and color distortion due to the reduction of grayscale levels, scene complexity, and unstable prior knowledge extraction1.

With the improvement of computer hardware technology, the speed of data processing has been greatly increased. Many deep learning-based methods8,9,10,11,12,13 have shown good performance in the field of low-light image enhancement. Currently, most low-light image enhancement methods are based on convolutional neural networks (CNNs), which learn the map** relationship from low-light images to normal-light images from a large amount of data through carefully designed CNN structures. However, the limited receptive field during the convolution operation in CNNs cannot fully consider long-distance pixel relationships in the input image, which affects the image enhancement effect14. The self-attention mechanism15,16 in Transformers17,18,19,20,21 can solve this problem. The self-attention mechanism models long-range dependencies, which can better preserve image details and reduce the impact of noise, thereby improving the quality of the image22. Transformer-based methods have made important progress in low-level vision tasks such as image super-resolution23,24, image denoising25, and image dehazing26. Currently, related Transformer methods27,28 have also been applied to low-light image enhancement and have achieved good performance, as they can better model non-local information to achieve high-quality image reconstruction. However, these methods do not enhance the local features of the image well, which is what CNNs excel at. Therefore, recent researchr29,30,31 has attempted to combine CNN and Transformer networks to combine their advantages and improve the performance of the corresponding tasks. For low-light enhancement tasks, network architecture design needs to be adapted to the characteristic of low-light images having more low-light features than high-light features. At the same time, for low-light enhancement tasks in real scenes, zero-shot learning32 methods are needed to better solve high-level vision tasks in real scenes where paired datasets are lacking. Specifically, zero-shot learning means that no paired or unpaired data is needed during training.

The substantial contributions of this study are meticulously designed to combat the issue of uneven illumination. Transformers, armed with their global attention mechanism, can comprehensively process long-range pixel relations in an input image. However, the traditional self-attention mechanism demands a high quantity of computational resources, and its multitude of parameters could lead to overfitting. On the other hand, CNN networks are well-regarded for enhancing local features and maintaining robustness. Still, they struggle in capturing global context information. The integration of these two networks, without thoughtful design of the CNN network, could lead to an ineffective learning of global information features generated by the Transformer network.

Aiming to unite the advantages of CNN’s local feature extraction and Transformer’s global modeling, the network introduced in this study comes with specific improvements. The complexity of the Transformer module increases linearly, not quadratically, with the rise in image resolution, facilitating efficient acquisition of contextual information. The CNN module’s class Transformer structure is designed to concentrate better on the features extracted by the Transformer, making up for the difficulties in global information acquisition and thus enhancing the model’s efficiency33. Ablation experiments were conducted during the development process, and multiple combinations were tested before finalizing the network architecture presented in this paper.

Particularly, the channel attention mechanism of the auxiliary module and the Multi-Dconv head Sparse Attention(MDSA) module designed in this research addresses to some extent the issue of high time and space complexity inherent to traditional Transformers. The introduction of the sparse attention mechanism provides a deeper understanding and handling of the local features in the image. In low-light enhancement tasks, overly bright local features may hinder the model’s ability to capture other critical low-light features. To mitigate this problem, the MDSA module is adopted for a more precise depiction of local features and to boost their enhancement ability, marking the first application of the improved sparse attention mechanism in low-light enhancement tasks.

Figure 1 illustrates that in unevenly lit low-light environments, conventional self-attention mechanisms or ordinary sparse self-attention mechanisms tend to place the primary focus and weight on the highlight features, which is not ideal for low-light enhancement tasks. The sparse self-attention mechanism applied in this study properly biases the main weight towards low-light features while effectively reducing the weight of highlight features, significantly improving the model’s performance in low-light enhancement tasks. This method, unexplored in original methodologies, represents innovative thinking.

Figure 1
figure 1

It depicts the handling strategies of different attention mechanisms under the conditions of unevenly lit low-light environments. The traditional self-attention mechanism generally prefers to place its main focus on highlight features. Furthermore, the conventional sparse self-attention mechanism tends to concentrate a significant portion of the weights on the highlight features. Such an approach is not ideal for low-light enhancement tasks because it results in a tendency for overexposure in highlight areas while inhibiting sufficient enhancement of details in low-light areas. However, our proposed sparse self-attention mechanism breaks away from this norm. It is capable of appropriately shifting the majority of the weights towards low-light features while simultaneously effectively reducing the weights of highlight features. This facilitates a more balanced extraction and processing of features.

Among the two inputs in the Cross Gating Feedforward Network(CGFN) module, one is processed through the MDSA module, and the other bypasses it. The MDSA module implements the sparse attention mechanism on the channel dimension. Therefore, the proposed CGFN calculates weights in the spatial dimension, addressing the lack of spatial information after the feature passes through the MDSA module. Additionally, the presence of the gating mechanism can better suppress the further propagation of information features that are unfavorable to model convergence. In low-light enhancement tasks, the feature information in the highlight area can severely hamper the enhancement quality. The CGFN module can further alleviate this problem, introducing a method not previously seen in other methodologies.

Therefore, considering the characteristics of low-light images under uneven lighting, this article proposes a more effective zero-shot learning low-light enhancement network structure. The main contributions are summarized as follows:

  • A zero-shot learning low-light enhancement network named CUI-Net was designed. The entire network comprises enhancement modules and auxiliary modules. The enhancement module merges the global attention mechanism of the Transformer and the ability of the CNN network to process local features. It has efficient computing efficiency and powerful modeling capabilities. This unique structure enables better handling of the problem of uneven lighting, richer feature information extraction, and achievement of image enhancement in low-light environments. The CNN network in the auxiliary module augments the convergence ability of the enhancement module and indirectly rectifies the influence of lighting.

  • A Multi-Dconv head Sparse Attention (MDSA) module was designed. The MDSA module constrains highlight features at the channel level and increases the weight of important local features. This design helps quell the interference of overly bright features, allowing the model to focus on and extract low-light features better, thereby enhancing the model’s performance in low-light enhancement tasks.

  • A novel Cross Gating Feedforward Network (CGFN) was proposed. CGFN can not only effectively suppress the further spread of information features that are not conducive to model convergence but also supplement the information loss in the spatial dimension through information exchange, thereby further boosting the efficiency and effect of the model. For low-light enhancement tasks, the feature information in the highlight area can seriously disrupt the enhancement quality of low-light enhancement tasks. The existence of the CGFN module can further mitigate this problem.

  • A multitude of experiments was conducted on nine challenging datasets. Most of the experimental results indicate that CUI-Net surpasses current state-of-the-art methods in terms of image quality enhancement effects and various evaluation indicators. More importantly, CUI-Net’s superior performance in high-level visual tasks (such as object detection, face detection, and semantic segmentation) in real-world low-light scenarios further validates its practical value and effectiveness.

Related work

Traditional enhancement methods

Traditional low-light enhancement methods can be primarily divided into two types: the methods based on histogram equalization (HE) and the methods based on the Retinex model. Methods based on HE2,3 redistribute pixel values based on the cumulative distribution function of the input image to expand the dynamic range. However, these methods are also prone to color fidelity loss and the generation of noise, resulting in image distortion4. The Retinex theory5 decomposes low-light images into reflectance part and illumination part based on prior knowledge or regularization, such as the Single Scale Retinex model (SSR)6 and the Multi-Scale Retinex model (MSR)7. MSR is considered a weighted sum of several different SSR outputs. The output of these methods may cause changes. The relative proportions of the enhanced three color channels can be affected. Compared to the original image, this can lead to color distortion4. Fu et al.34 proposed a fusion method that combines the advantages of sigmoid function and histogram equalization, which has improved performance compared to2,3. Guo et al.35 initialized the illumination map of the image by finding the maximum value in the RGB channels and then optimized the initial illumination map by adding a structural prior to achieve image enhancement. These methods have some effect in increasing the brightness of low-light images, however, some algorithms ignore the correlation between the bright and dark parts, resulting in color distortion in images with significant brightness differences.

Deep learning-based methods

Most of the network-based low-light image enhancement algorithms are based on CNN methods. In CNN-based methods, methods based on Retinex usually enhance the illumination and reflection components separately through dedicated sub-networks32. Wei et al.36 introduced the Retinex-Net model, which aims to enhance low-light images. The model comprises two parts: a Decom-Net for decomposing images into illumination and reflection components, and an Enhance-Net for adjusting the illumination. Despite its purpose, Retinex-Net unfortunately results in significant color distortion, leading to less natural-looking enhanced images8. EnlightenGAN9 uses a Generative Adversarial Network (GAN) ,which used U-Net10 based on attention mechanisms as generator and a global-local discriminator to obtain the enhancement results. ZeroDCE11 trains a lightweight network (DCE-NET) to fit the brightness map** curve, and it is then used to adjust the brightness distribution of the image. Retinex- inspired Unrolling with Architecture Search (RUAS)12 uses an unfolding architecture search to handle low-light image enhancement. Self-Calibrated Illumination (SCI)13 proposes a simplified network that fits physical principles to achieve low-light enhancement and introduces a calibration process in the training stage to improve the low-light enhancement model’s ability, thereby further improving the enhancement effect.

Methods combining CNN and transformer

CNN operations provide efficiency and universality, but their receptive fields are limited and cannot fully consider long-range pixel relationships in input images, which can affect image enhancement performance. In contrast, in Transformers, the self-attention mechanism focuses on modeling long-range dependencies, enabling it to capture global information well. However, it lacks attention on the most relevant information37 and its complexity grows exponentially with spatial resolution14, leading to poor performance in some tasks. Thus, combining the two effectively to improve image enhancement quality is the focus of this paper. Conformer29 uses a CNN branch and a Transformer branch and combines them through Feature Coupling Units to fuse local convolution blocks, self-attention modules, and MLP units to adjust feature resolution and channel numbers while continually eliminating semantic differences between the CNN and Transformer branches. HNCT30 integrates CNN and Transformer while using local and non-local priors to extract features beneficial for super-resolution and an enhanced spatial attention module to further improve performance. ECFAN31 proposes a new hybrid super-resolution method, called ACT, that combines CNN and Vision Transformer19 to effectively aggregate local and non-local features and introduces cross-scale token attention modules to effectively utilize multi-scale token representations.

Through careful consideration and experimental comparison, we have found that our method uses three TransformerBlocks as the encoder to preserve the most useful self-attention values, avoiding the further propagation of aggregated highlight features, allowing useful global features to be fully utilized, and transmitting useful local features to ensure that the enhanced low-light images have sufficient details. Two CNN blocks serve as the decoder to further utilize the feature information obtained from the Transformer blocks to better enhance the details and texture information of low-light images, leveraging the advantages of CNN networks.

Sparse attention

Images captured in real-world scenarios often suffer from uneven illumination32. For example, images taken at night may contain both dark and bright areas or overexposed regions, such as areas around light sources. Existing methods often enhance both the dark and bright regions of the image simultaneously, which can affect the visual quality of the enhancement results. However, current low-light image enhancement methods have not fully addressed this open problem. Zhao et al.38 proposed sparse Transformer to select the attention degree of the model. Fu et al.37 proposed a target focus network and sparse Transformer technique for visual object tracking. The target focus network focuses on the target of interest in the search region and highlights the features of the most relevant information for better estimating the states of the target. Inspired by SparseTT37, we adapt sparse Transformer to the low-light enhancement task. For low-light images with uneven illumination, the Transformer is susceptible to the influence of high-light features when computing self-attention, resulting in higher attention values. This naturally leads to a bias towards enhancing high-light features rather than low-light features with low attention values when modeling global feature dependencies. Therefore, we propose a sparse attention operation that differs from the usual one, choosing to set high-light features to lower values to effectively suppress high-light information and focus on the most relevant information in low-light enhancement tasks.

Proposed method

In this section, the framework of CUI-Net and the two main modules: which is the enhancement module will be introduced and the auxiliary module. Finally, we will explain the unsupervised training losses used in our neural network model.

Overall procedure

The proposed CUI-Net is a cascaded two-stage image enhancement network (Fig. 2) In the first stage, a Transformer network is introduced to obtain global information, which can better enhance the details of low-light images. In the second stage, an auxiliary network based on multiple convolutional network blocks is constructed, and the original input image is used as a constraint to control the output detail features of the first stage. Unlike traditional methods, the training part of CUI-Net requires multiple Enhancement Modules and Auxiliary Modules, while the testing part only contains Enhancement Module.

Figure 2
figure 2

Overall framework of the CUI-Net. Only one enhancement module is used to obtain results during the testing phase.

Here, assume that the low-light input image \(I\in {\mathbb {R}}^{H\times W\times C}\). where the height is H, the width is W, and the number of channels is C. For RGB images, C is equal to 3. According to the Retinex theory, the low-light image I can be obtained by performing the following operation on the clear image R and the illumination image L5:

$$\begin{aligned} I = R\otimes L \end{aligned}$$
(1)

Therefore, the enhanced image R can be obtained through input image I and the illumination map L.

During the training process, the entire framework can be divided into two parts, namely the Enhancement Module (EM) and the Auxiliary Module (AM):

$$\begin{aligned} \varepsilon _{t}= & {} E\!M_{t}({\mathscr {A}}_{t-1}+I;\vartheta )+{\mathscr {A}}_{t-1 } \end{aligned}$$
(2)
$$\begin{aligned} {\mathscr {A}}_{t}= & {} A\!M_{t}(I\oslash \varepsilon _t;\mu ),\gamma _{0}=I \end{aligned}$$
(3)

where \(EM_{t}\) is the \(t\)-th image enhancement module network with learnable parameters \(\vartheta\), and \(AM_t\) is the \(t\)-th auxiliary module network with learnable parameters \(\mu\). When \(t=1\), i.e., in \(EM_1\), only the original low-light image I is used as input, i.e. \(EM_1 (I;\vartheta )\), and the original low-light image I is not added as input.

Unlike the training part, the auxiliary module is not needed in the testing part, and only one enhancement module is used to obtain the clear image.

$$\begin{aligned} R = I\oslash [EM(I;\vartheta )+I] \end{aligned}$$
(4)

Image enhancement module

The image enhancement module consists of an efficient Transformer block and a CNN block, serving as the encoder and decoder, respectively. The Transformer model enhances low-light images by filtering out information from uneven lighting channels and local details, and then transferring the useful features to the next part of the network. The core of the Transformer block lies in the Multi-Dimensional Sparse Attention (MDSA) mechanism and the Cross-Gated Feed-Forward Network (CGFN). MDSA can effectively reduce redundant features and improve the weights of important features, thus enhancing the network’s robustness and generalization ability. The cross-gated mechanism can compensate for the lack of information in the spatial dimension, allowing useful information to propagate further and enhance the integrity of the entire feature representation. The CNN block replaces the attention block in the traditional Transformer network with deep convolutions and the feed-forward layer with a simplified CNN structure, ensuring lightness. Meanwhile, a structure similar to the Transformer network can further process feature information and has the generality and efficiency advantages of a convolutional neural network.

In summary, the channel-wise sparse attention and cross-gated Transformer are used as the encoder in the image enhancement module. With the increase of layers number, the extracted features become increasingly abstract and semantically rich. The CNN block is used as the decoder to extract and enhance features at a higher level, making it more suitable for image enhancement tasks in uneven lighting conditions. Realizing pixel-level information transfer and context association through convolution calculation can further improve the performance and efficiency of the model.

The specific process of the image enhancement module is shown in Fig. 3. The network structure diagrams of the Transformer and CNN modules in the enhancement module are shown in Fig. 4. First, the input low-light image I undergoes a \(3\times 3\) convolutional operation to extract low-level features and increase the number of channels. It then passes through three Transformer encoders and two CNN decoders. The residual connections and upsampling and downsampling operations are utilized to extract sufficient detail features. Finally, a \(3\times 3\) convolutional operation is used to restore the original number of channels, and the resulting image is added to the input low-light image I to produce the final output image. C stands for Concatenation operation.

Figure 3
figure 3

Network architecture of the enhancement module.

Figure 4
figure 4

The network structure diagrams of the Transformer module used in the enhancement module and the CNN module used in both the enhancement and auxiliary modules.

Multi-Dconv head sparse attention

In traditional Transformer modules, multi-head self-attention mechanisms compute global information through self-attention mechanisms in the spatial dimension, resulting in a quadratic growth in complexity with increasing resolution. The main purpose of sparse attention mechanisms is to reduce the time and space complexity of traditional Transformers39. In this paper, the channel attention mechanism used in the MDSA module not only reduces model complexity and improves efficiency, but also helps the model better understand local features in the image. In low-light enhancement tasks, the appearance of too many high-brightness local features may interfere with the model’s ability to capture other low-light features. Therefore, this paper uses sparse attention mechanisms to assist the model in better representing local features and improving its enhancement ability.

The specific structure of MDSA is shown in Fig. 5. The input tensor is denoted as \(I\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times 3}\). Q, K, and V represent query, key, and value. The \(1\times 1\) point-wise convolution is applied to aggregate pixel-level cross-channel context, followed by a \(3\times 3\) depth-wise convolution to encode channel-level spatial context. The operation \(\circledR\) in the figure stands for reshape. \(I\!s\!I\!n\!M\!ap\) is used to filter out the weights in the attention map matrix that are the same as the weights in the TopK matrix, and set the corresponding weights in the attention map to 0.01.

Figure 5
figure 5

Network structure diagram of MDSA module.

Different from the Vision Transformer model19, MDSA uses self-attention mechanism to calculate the similarity between each channel, i.e., attention calculation is performed on the channel dimension rather than on the spatial dimension. This enables MDSA to better capture the relationships between feature channels, thereby improving the model’s representation ability and robustness.

Specifically, the TopK operation is performed on the attention map to select the top K attention values, followed by further operations. It should be noted that, unlike general sparse attention calculation, in the low-light task under uneven illumination, the channel information of the high-light area in the attention map is more likely to receive higher attention scores. These K attentions need to be set to 0.01 to allow the low-light channel features to be sent to CGFN for obtaining the required local information.

$$\begin{aligned} {\hat{X}}= & {} W_pSpAttention({\hat{Q}},{\hat{K}},{\hat{V}})+I \end{aligned}$$
(5)
$$\begin{aligned} SpAttention({\hat{Q}},{\hat{K}},{\hat{V}})= & {} {\hat{V}}\cdot softmax(TopK({\hat{K}}\cdot {\hat{Q}}/\lambda )) \end{aligned}$$
(6)

Here, \({\hat{Q}}\in {\mathbb {R}}^{{\hat{H}}{\hat{W}}\times {\hat{C}}},{\hat{K}}\in {\mathbb {R}} ^{{\hat{C}} \times {\hat{H}}{\hat{W}}},{\hat{V}}\in {\mathbb {R}} ^{{\hat{H}}{\hat{W}}\times {\hat{C}}}\) are obtained by resha** the original scale \({\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}\). The meaning of SpAttention is sparse attention. \(W_p\) represents a \(1\times 1\) point-wise convolution. \(\lambda\) is a learnable scaling parameter used to control the magnitude of the dot product of \({\hat{K}}\) and \({\hat{Q}}\).

Cross-gated feed-forward network

The two inputs of the Cross-Gated feed-forward Network (CGFN) are the input and output obtained through MDSA. The cross-gating part is equivalent to calculating weights on the spatial dimension and weighting specific positions, in order to compensate for the lack of spatial dimension information in the image that has not passed through MDSA.

The specific structure of the CGFN is shown in Fig. 6. Each single path of the CGFN module has two branches. One branch is a gating unit used to obtain the activation state of each pixel. The \(1\times 1\) convolutional layer is used to expand the channel number, followed by a \(3\times 3\) depthwise convolutional layer and StarReLU to generate the gate map. The other branch does not need to pass through the StarReLU activation function. Then, the two branches are dot-multiplied. The cross-gating is cross-calculated on the two paths to compensate for the lack of spatial information. If the input of CGFN from MDSA is \(X\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}\) , \(Y\in {\mathbb {R}}^{{\hat{H}}\times {\hat{W}}\times {\hat{C}}}\) is the input from the previous module without MSDA then the CGFN can be represented as follows:

Figure 6
figure 6

Network structure diagram of CGFN module.

$$\begin{aligned} {\hat{Z}}= & {} W_o^{2}((W_m^{2}(W_o^{1}(W_m^{1}({\hat{X}})\odot {\hat{Y}})+{\hat{X}}))\odot {\hat{X}}) + {\hat{Y}} \end{aligned}$$
(7)
$$\begin{aligned} {\hat{X}}= & {} W_p^0G(X)+X \end{aligned}$$
(8)
$$\begin{aligned} {\hat{Y}}= & {} W_p^0G(Y)+Y \end{aligned}$$
(9)
$$\begin{aligned} G(X)= & {} \phi (W_d^1 W_p^1(LN(X)))\odot W_d^2 W_p^2 (LN(X)) \end{aligned}$$
(10)

where \(\odot\) denotes element-wise multiplication, \(\phi\) represents the StarReLU non-linear activation function, and LN stands for Layer Normalization. \(W_m\) performs softmax operation. \(W_o\) performs dropout operation. \({\hat{Z}}\) will serve as the input to the next module.

Auxiliary module

The auxiliary module is necessary for unsupervised image enhancement methods as they may have limitations such as over-enhancement and color bias8. Therefore, the CNN network with high efficiency and generalization ability is chosen as the auxiliary module to converge the outputs of multiple enhancement modules to one enhancement effect, enabling the use of one enhancement module during the testing phase to achieve the same enhancement effect as the multiple enhancement modules during the training part.

As shown in Fig. 2 , Formulas (2) and (3), the purpose of the auxiliary module is to correct the input of the enhancement module, indirectly affecting the output of the enhancement module. The input of the auxiliary module can be obtained by element-wise addition of the output of the previous enhancement module and the output of the auxiliary module, followed by division with the original low-light image. Thus, the auxiliary module can obtain the features of the enhancement module and correct the uneven illumination through the original low-light image.

The auxiliary module uses depth-wise convolution multiple times, which can effectively reduce the number of parameters and computation cost, as shown in Fig. 7. Firstly, the input image is passed through a \(3\times 3\) convolution layer to increase the channel number, and then through three CNN blocks. Finally, a \(3\times 3\) convolution layer is used to reduce the channel dimension. As shown in Fig. 4, the CNN block enhances the local details by passing the input features through depth-wise convolutions of \(3\times 3\) and \(5\times 5\), followed by StarReLU activation function and multiple \(1\times 1\) convolutions to minimize the number of parameters. The corrected illumination information is then inputted to the enhancement module, improving the enhancement effect of the enhancement module.

Figure 7
figure 7

Overall architecture diagram of the auxiliary module.

Training loss

In order to consider color preservation, artifact removal and gradient backpropagation, the loss function needs to be optimized. The loss function used by CUI-Net is as follows:

$$\begin{aligned} {\mathscr {L}} = \alpha {\mathscr {L}}_c + \beta {\mathscr {L}}_s \end{aligned}$$
(11)

Here, L represents the total loss, \({\mathscr {L}}_c\) and \({\mathscr {L}}_c\) represent the correction loss and the smoothness loss respectively, and \(\alpha\) and \(\beta\) are two positive balancing parameters. In the experiments, the balancing parameters are set to \(\alpha =1.5\) and \(\beta =1\). The correction loss \({\mathscr {L}}_c\) is to ensure the consistency between the estimated illumination and the adjusted result, that is:

$$\begin{aligned} {\mathscr {L}}_c = \sum _{x = 1}^{3} {\parallel EM_x - AM_{x-1} \parallel }^2 \end{aligned}$$
(12)

Here, \(EM_x\) is the x-th enhancement module, and \(AM_x\) is the x-th auxiliary module. \(AM_0\) is the original input I. As an unsupervised loss, this loss function only constrains the output through the auxiliary module.

Then, the smoothness loss is used40, that is:

$$\begin{aligned} {\mathscr {L}} _s = \sum _{i = 1}^{N} \sum _{j\in {\mathscr {N}}(i)}^{} Weight_{i,j}\mid X_i^t - X_j^t\mid \end{aligned}$$
(13)

Here, N is the total number of pixels. i is the i-th pixel. \({\mathscr {N}}(i)\) represents the neighboring pixels in its 5 × 5 window. \(Weight_{i,j}\) represents the weight, which is specified as equation 14, where c represents the image channel in the YUV color space, and \(\sigma =0.1\) is the standard deviation of the Gaussian kernel.

$$\begin{aligned} Weight_{i,j} = exp(-\frac{\sum _{c}{((y_{i,c}+s_{i,c}^{t-1})-(y_{j,c}+s_{j,c}^{t-1}))}^2}{2\sigma ^2}) \end{aligned}$$
(14)

Experiment

To test the effectiveness of the algorithm, this paper verifies it on multiple datasets and tasks. Firstly, the experimental settings are given, and tests are conducted on public datasets to demonstrate the effectiveness of the algorithm through quantitative comparison and qualitative analysis with existing methods. Then, high-level tasks, including low-light object detection, dark face detection, and nighttime semantic segmentation, are tested and compared with existing algorithms to further validate the effectiveness of the algorithm. Finally, ablation experiments are conducted to verify the effectiveness of each module.

Experimental settings

The experiment is based on PyTorch and conducted on a computer with an Intel i9-10940X CPU, two RTX 3090 GPUs, and 32GB of memory for training and testing. The main parameters are batch size of 1, initial learning rate of \(10^{-4}\), weight decay of \(\epsilon =10^{-8}\), and training epoch of 500. In the enhancement module, the number of Transformer blocks is set to 4, 6, 6, and 8 from the first layer to the fourth layer, the number of attention heads in MDTA is set to 2, 4, and 8, and the number of channels is set to 48, 96, and 192. StarReLU41 and Adan42 optimizer are introduced in CUI-Net. StarReLU is a variant of Squared ReLU designed to eliminate distribution shift. StarReLU performs well in both algorithm performance and computational efficiency due to reducing the computational cost of the activation function43. Adan can complete the training of ViT19 with only half the computational cost. Compared with the popular optimizer Adam44, Adan has an additional hyperparameter \(\beta _2\) for adjustment. \(\beta _2\) is set to 0.08 in the experiments42.

Here, \(EM_x\) represents the x-th enhancement module, and \(AM_x\) represents the x-th auxiliary module. \(AM_0\) represents the original input I. As an unsupervised loss, the loss function L only constrains the output through the auxiliary module.

To verify the effectiveness and superiority of the proposed algorithm, CUI-Net is compared with state-of-the-art (SOTA) methods, including EnlightenGAN9, KinD45 , ZeroDCE11, ZeroDCE++46, RUAS12, SCI13, and Uretinex-Net47. Additionally, comparisons are made in high-level vision tasks such as face detection, object detection, and semantic segmentation.

Benchmark description and evaluation metrics

For image enhancement testing, 100 random images from the MIT dataset48 and 50 random images from the LSRW dataset49 are used for testing. To quantitatively measure the algorithm’s performance, three full-reference metrics, including PSNR, SSIM, and LPIPS50, and four no-reference metrics, including NIQE51, ILNIQE52, NIMA53, and MUSIQ54, are used as evaluation metrics.

For dark face detection tasks, the DARK FACE dataset55, consisting of 1000 challenging test images, is used. 500 random images are selected as the training set, and 50 images are used for testing, with the average precision (AP) used as the evaluation metric.

For low-light object detection tasks, the ExDark dataset56 specifically designed for low-light object detection is used. 1051 images are selected as the training set, and 406 images are used for testing, with evaluation metrics including \(mAP_{0.5:0.95}\) and \(mAP_{0.5}\).

For nighttime semantic segmentation tasks, the ACDC dataset57 is used. The ACDC dataset is a self-driving dataset released in ICCV 2021. 400 dark condition images are used for training, and the remaining 106 images are used as the test set. The evaluation metrics include IoU and mIoU.

Quantitative and qualitative metrics

The quantitative results on the MIT dataset are shown in Table 1. CUI-Net achieved the best performance in SSIM, PSNR, LPIPS, and ILNIQE among the seven evaluation metrics. Specifically, CUI-Net achieved a PSNR of 193.328dB, which is 1.0259dB higher than the best existing best algorithm’s score of 18.3201dB, and an ILNIQE evaluation metric has a score of 31.9151, which is 1.5756 lower than the score of the best existing algorithm.

Table 1 Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four no-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the MIT dataset.

The enhancement results on the MIT dataset are shown in Fig. 8. Compared with the ground truth (Fig. 8GT) for the input low-light original image (Fig. 8LL), EnlightenGAN (Fig. 8a), KinD (Fig. 8b), ZeroDCE (Fig. 8d), SCI (Fig. 8f), and Uretinex (Fig. 8g) methods show inadequate enhancement, while ZeroDCE++ (Fig. 8e) shows over-enhancement. RUAS (Fig. 8c) enhances the white petals on the upper part of the image into pinkish color, but the overall saturation is too high. In contrast, CUI-Net Fig. 8h) shows better color restoration while maintaining realistic lighting conditions.

Figure 8
figure 8

Enhancement results on the MIT dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; (LL) is the input low-light original image; (GT) is the ground truth with sequence number E.

The quantitative results on the LSRW dataset are shown in Table 2 . Among the seven evaluation metrics, CUI-Net achieved the best result in the NIMA and the third-best results in the PSNR, NIQE and MUSIQ. Uretinex achieved good results on the LSRW dataset, which may be because the data augmentation method of the LSRW dataset is similar to that of the LOL dataset used in supervised training of Uretinex. However, our unsupervised method may be less sensitive to artificially augmented datasets.

Table 2 Quantitative results of three supervised metrics (PSNR, SSIM, and LPIPS) and four non-reference metrics (NIQE, NIMA, MUSIQ, and ILNIQE) on the LSRW dataset.

The enhancement results on the LSRW dataset are shown in Fig. 9. Except for ZeroDCE++ (Fig. 9e), which shows over-enhancement, the overall enhancement effect of the EnlightenGAN (Fig. 9a), KinD (Fig. 9b), RUAS (Fig. 9c), ZeroDCE (Fig. 9d), ZeroDCE++ (Fig. 9e), SCI (Fig. 9f), Uretinex (Fig. 9g), and CUI-Net (Fig. 9h) methods is similar. By enlarging the selected local areas for detailed comparison, we observed two parts of the scene: the outdoor and indoor scenes are observed separately. RUAS (Fig. 10c), ZeroDCE++ (Fig. 10e), and SCI (Fig. 10f) showed over-exposure in the outdoor scenes. Uretinex (Fig. 10g), which achieved better quantitative results, also showed over-exposure. It is worth noting that even the ground truth (Fig. 10GT) shows over-enhancement in the outdoor scenes compared to the low-light original image (Fig. 10LL). Since CUI-Net (Fig. 10h) can suppress highlight areas under uneven lighting conditions, better enhancement of outdoor scenes may not always contribute to some evaluation metrics. For indoor scenes, EnlightenGAN (Fig. 10a), KinD (Fig. 10b), and ZeroDCE (Fig. 10d) resulted in blurred text and less realistic surface reflections, while CUI-Net can not only enhance the details and contours of low-light areas but also restore the realistic lighting conditions of the scene. In addition, CUI-Net can enhance the text on the white paper and paper box on the desk more clearly, which may have practical applications in low-light image text extraction tasks.

Figure 9
figure 9

Enhanced images on the LSRW dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; (LL) the input low-light image; (GT) the ground truth.

Figure 10
figure 10

Details of the corresponding enlarged areas in Fig. 9 of the LSRW dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; (LL) the input low-light image; (GT) the ground truth.

Although CUI-Net has some shortcomings in quantitative metrics on the LSRW dataset, the qualitative analysis of the enhancement results shows some discrepancies between the relevant metrics and subjective observations in practical applications.

We conducted training and testing on the unpaired low-light enhancement datasets MEF58, VV, DICM59, and LIME35, with the qualitative results illustrated in Figs. 11, 1213, and 14 , respectively. As can be observed, our method effectively prevents overexposure across all four datasets, achieves a satisfactory enhancement of details, and restores realistic shadows and lightings. This can be observed, for instance, in the details of the tabletop, facial features, the flower cluster and door numbers, and the cliff and buildings.

Figure 11
figure 11

Test result display on the MEF dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net.

Figure 12
figure 12

Test result display on the VV dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net.

Figure 13
figure 13

Test result display on the DICM dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net.

Figure 14
figure 14

Test result display on the DICM dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net.

The quantitative results are shown in Tables 3, 45, and 6 .

Table 3 Quantitative test results on the MEF dataset.
Table 4 Quantitative test results on the VV dataset.
Table 5 Quantitative test results on the DICM dataset.
Table 6 Quantitative test results on the LIME dataset.

From the tables, it can be observed that our method outperforms others in terms of quantitative results on unpaired low-light datasets, further demonstrating the robustness of our approach.

Dark face detection

The DSFD60 face detection framework was utilized for the experiment, which adopts the SSD61 network structure and was trained on the WIDER FACE62 dataset. In the face detection experiment, results from different low-light enhancement methods were used as inputs to DSFD. Finally, we compared the AP (average precision) at different IoU thresholds. The test results are shown in Table 7, where CUI-Net achieved the highest AP values at IoU thresholds of 0.5 and 0.6 and the second-highest AP value at an IoU threshold of 0.7.

Table 7 AP (average precision) at IoU of 0.5, 0.6, 0.7 thresholds.

Figure 15 shows the detection results of different methods and adds the low-light input image (Fig. 15LL) and its face detection result (Fig. 15LD) for comparison. The lower right corner of each method’s result image is the corresponding magnified detail image. It can be seen that at an IoU threshold of 0.5, only RUAS (Fig. 15c) and CUI-Net (Fig. 15h) can detect the face in the area pointed by the arrow. EnlightenGAN (Fig. 15a), KinD (Fig. 15b), ZeroDCE (Fig. 15d), ZeroDCE++ (Fig. 15e), SCI (Fig. 15f), and Uretinex (Fig. 15g) failed to detect the face in the area pointed by the arrow. However, RUAS has serious overexposure, and the details on the ground cannot be seen clearly. CUI-Net not only can detect more face but also produces realistic enhancement effects, with better quantitative indicators than other SOTA methods.

Figure 15
figure 15

Results of dark face detection: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; (LL) Unenhanced low-light image as input; (LD) result of face detection directly on the Unenhanced low-light input image.

Low-light object detection

We trained the YOLOv363 model on the ExDark object detection dataset and tested it on the ExDark validation dataset. YOLOv3 is a series of object detection frameworks and models pre-trained on the COCO dataset64. Unlike face detection experiments, we fine-tuned the YOLOv3 pre-trained model for object detection, i.e., we retrained the object detection model to evaluate the enhancement effects of all methods. Table 8 shows the quantitative results among different methods. CUI-Net achieved the best mAP values in both \(mAP_{0.5:0.95}\) and \(mAP_{0.5}\).

Table 8 Quantitative results of object detection on the ExDark dataset.

The experimental results were obtained by performing object detection on low-light images after being enhanced by various SOTA algorithms. The baseline is object detection directly on the unenhanced low-light images. The specific detection results object detection on the low-light image (Fig. 16LL) are shown in Fig. 16, Only RUAS (Fig. 16c), ZeroDCE++ (Fig. 16e), Uretinex (Fig. 16g), and CUI-Net (Fig. 16h) can recognize the most targets. EnlightenGAN (Fig. 16a), KinD (Fig. 16b), ZeroDCE (Fig. 16d), SCI (Fig. 16f), and baseline(Fig. 16LD) did not detect the targets completely. The overall average confidence values of RUAS, ZeroDCE++ and Uretinex are lower than CUI-Net. In addition, the main reason why RUAS and ZeroDCE++ have lower mAP values in Table 8 is due to the overexposure problem. However, CUI-Net found a good balance and was able to avoid the overall lower mAP scores caused by overexposure.

Figure 16
figure 16

Experimental results of object detection on the ExDark dataset: (a) EnlightenGAN; (b) KinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net; (LL) Unenhanced low-light image as input; (LD) result of object detection directly on the Unenhanced low-light input image.

Low-light semantic segmentation

We evaluated the performance of all segmentation methods on the ACDC low-light semantic segmentation dataset using the DeepLab-V3+65 model with pre-training and fine-tuning mode. The pre-trained model was trained on the Cityscape dataset66. Table 9 shows the mIoU values for multiple categories and the overall average among different low-light enhancement methods. CUI-Net achieved the best mIoU score among the six segmentation targets and was the second-best method among the seven segmentation targets. It outperformed the second-best method by 4.5 in the wall category, 1.9 in the traffic light category, and 6.6 in the motorcycle category. The overall average mIoU value was 2.8 higher than the second-best method.

Table 9 Shows the mIoU values for multiple categories average among different low-light enhancement methods.

Table 10 shows the mAcc values for multiple categories average among different low-light enhancement methods. CUI-Net achieved the highest mAcc values for five segmentation targets, with 12.7 higher than the second-best method in the motor category and 22.9 higher in the rider category. CUI-Net also obtained the second-highest mAcc value for four segmentation targets, with an overall mAcc value 5 higher than the second-best method.

Table 10 Shows the mAcc values for multiple categories average among different low-light enhancement methods.

Figure 17 shows the overlaid results of semantic segmentation masks and enhanced images on the ACDC dataset. Overall, RUAS (Fig. 17c) and SCI (Fig. 17f) exhibited overexposure. EnlightenGAN (Fig. 17a), KinD (Fig. 17b), ZeroDCE (Fig. 17d), ZeroDCE++ (Fig. 17e), Uretinex (Fig. 17g), and CUI-Net(Fig. 17h) methods showed no significant differences, but for nighttime semantic segmentation applications, attention to detail is particularly important, such as timely segmentation of pedestrians traffic signs on the road to avoid serious accidents during nighttime autonomous driving.

Figure 17
figure 17

Segmentation results on the ACDC dataset: (a) EnlightenGAN; (b) kinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net.

The local detailed semantic segmentation results for each method corresponding to the red box in Fig.  17 are shown in Fig.  18. Comparing with the ground truth in Fig. 19, for the first red box region, which contains two traffic signs, EnlightenGAN (Fig. 18a), KinD (Fig. 18b), RUAS (Fig. 18c), ZeroDCE++ (Fig. 18e), and Uretinex (Fig. 18g) failed to segment both traffic signs, while ZeroDCE (Fig. 18d) and SCI (Fig. 18f) only recognized the left traffic sign. However, CUI-Net (Fig. 18h) was able to recognize both traffic signs. For the middle red box region, which contains two pedestrians and two traffic signs, only ZeroDCE++ (Fig. 18e) and Uretinex (Fig. 18g) recognized both traffic signs, while our CUI-Net (Fig. 18h) recognized an additional pedestrian. For the right red box region, which contains two pedestrians, only KinD (Fig. 18b), SCI (Fig. 18f), and CUI-Net (Fig. 18h) were able to segment both pedestrians well. In addition, for the pedestrian crossing category that does not exist in the ACDC dataset, it can be seen from Fig. 17 that CUI-Net has the most obvious enhancement effect, which may play a role in nighttime safety autonomous driving tasks. Clearly, CUI-Net has some potential in nighttime semantic segmentation tasks.

Figure 18
figure 18

Enlarged details of the red boxes in Fig. 17: (a) EnlightenGAN; (b) KinD; (c) RUAS; (d) ZeroDCE; (e) ZeroDCE++; (f) SCI; (g) Uretinex; (h) CUI-Net;

Figure 19
figure 19

Left: Ground truth; Right: Image of zoomed-in details corresponding to the red area in the ground truth image.

Ablation study

To verify whether the network structure of the enhancement module in CUI-Net can improve the model’s enhancement ability, we conducted four ablation experiments on the LSRW dataset for training and testing, and evaluated the quality of the enhanced images using SSIM, PSNR, and LPIPS.

Firstly, to verify whether Adan and StarReLu can accelerate the convergence of the model, we choose to train for 50 epochs. The results obtained are shown in Table 11, where it can be observed that replacing GeLu with StarReLu and Adam with Adan can lead to better results in a smaller number of epochs.

Table 11 Ablation experiment of replacing GeLu and Adam with StarReLu and Adan.

Secondly, to verify whether the network structure design of the enhancement module is effective, we replaced the five modules in the overall network with a full CNN module, a full Transformer module, and the three Transformer modules and two CNN modules of CUI-Net for experimental analysis. The results obtained are shown in Table 12, and the network structure of CUI-Net can achieve better performance.

Table 12 Replacing the five modules used in the original CUI-Net with different ones.

Thirdly, to verify whether MDSA and CGFN can improve the model’s enhancement ability, we selected MDTA and GDFN in Restormer for ablation study. The results are shown in Table 13, and both MDSA and CGFN can improve the performance of the model.

Table 13 Ablation experiments were conducted to compare the network module used in the Transformer block of CUI-Net with MDTA and GDFN.

Finally, an ablation study was conducted on the sparse attention operation on channels in the MDSA module of CUI-Net. The results are shown in Table 14. The \(Topk\_normal\) operation is the usual sparse attention operation where all attention weights except for the TopK are set to zero. In contrast, the \(Top\_CUI\) operation used in CUI-Net reduces the attention weights of the channels obtained by TopK to a very low value. The results of the ablation study indicate that the sparse attention on channels used in CUI-Net contributes to achieving better enhancement results.

Table 14 Perform an ablation experiment comparing the usual sparse attention mechanism with the sparse attention mechanism used in the CUI-Net network.

Conclusion

In this paper, we propose a CUI-Net framework consisting of an enhancement module and an auxiliary module, which can achieve differential enhancement of low-light and highlight regions in low-light environments. In the enhancement module, an efficient low-light enhancement Transformer and CNN network are introduced to enhance low-light images by acquiring global pixel information. In the auxiliary module, a lightweight CNN network is designed to assist the enhancement module to converge better and correct lighting effects. Quantitative analysis and qualitative comparison of CUI-Net with other state-of-the-art low-light image enhancement methods were conducted on two public low-light datasets, demonstrating the effectiveness of the proposed method. Furthermore, the practicality of the method was further verified through high-level vision tasks, namely low-light object detection, dark face detection, and nighttime semantic segmentation.