Introduction

Underwater object detection is crucial for assessing marine biodiversity, including the distribution, quantity, and species of marine life. It plays a key role in monitoring ecological shifts in marine environments and provides essential data for conserving fishery ecosystems [1,2,3]. The efficacy of underwater object detection hinges on the algorithm's adaptability to the multifaceted marine environment, which is distinct from terrestrial and aerial settings due to factors like water quality-induced color shifts, variable light intensity, and other optical disturbances. Challenges such as the prevalence of small-scale object clusters and the detection of marine organisms with complex textures, camouflaged against their surroundings, underscore the importance of develo** robust underwater object detection algorithms for fishery ecosystem surveillance.

Current solutions for underwater object detection in complex lighting conditions are categorized into two approaches. The first leverages traditional digital image processing or deep learning-based image enhancement models to clarify images before detection, yielding impressive results [4, 5]. For instance, Zhang et al. enhanced night-time underwater videos using MSRCP before employing Det NASNet and Cascade R-CNN for precise nocturnal fish detection. Similarly [6], Lu et al. proposed a CNN-based two-stage enhancement method to convert degraded underwater images into clear visuals for subsequent detection tasks [7]. However, these solutions are resource-intensive, requiring computational allocation for both image enhancement and detection models. The second approach bypasses image enhancement, feeding raw images into detection models with modified modules to bolster resistance to optical disturbances. For example, Liu et al. introduced a hybrid attention mechanism within a Deep Residual CNN to iteratively extract features, countering light and shadow effects in marine ecosystems [8]. Despite their complexity, such models lack dedicated modules for diverse light interferences and falter without pre-enhancement.

Addressing the challenge of detecting small target clusters, some strategies involve incorporating multi-scale feature extraction modules or deepening the network to improve overall detection capabilities. Gao et al.'s introduction of an augmented weighted bi-directional feature pyramid network (AWBiFPN) exemplifies this, enhancing the detection of fine-grained features for small objects and achieving high MAP scores across major underwater datasets [9]. However, these enhancements typically increase network size and computational demands, which is problematic for resource-constrained underwater detection applications. Efforts to streamline these expanded networks, such as pruning or novel architectural designs, often struggle to maintain original accuracy levels.

Research on detecting camouflaged underwater objects is scant, lacking a definitive focus. Some studies borrow from conventional camouflage and salient object detection techniques, adjusting network modules for better hidden target identification. Xu et al.'s adversarial learning-based adaptive frame regression network is one such example, outperforming existing models in detecting camouflaged underwater targets [10]. Yet, these adaptations may inadvertently accentuate feature disparities between the camouflaged object's edges and the background, potentially undermining the efficacy of general object detection networks. Addressing camouflage challenges may thus require innovative loss function designs.

This paper introduces a comprehensive approach: initially utilizing deep and ordinary convolutional layers with varying sizes and receptive fields to mimic the light interference countermeasures of cone and rod photoreceptors, followed by employing spatial and channel attention mechanisms for feature weighting. Subsequently, the YOLOv8 backbone's ordinary convolutions are replaced with RepVgg modules, enhancing the network's capability to detect small object clusters through structural reparameterization and feature map fusion. Finally, replacing CIOU with WIOU for regression loss minimizes the impact of adverse gradients on camouflaged features. This method markedly improves overall MAP, as well as MAP scores for small and camouflaged objects, offering a robust solution for underwater detection in complex scenarios, including light interference, small object clustering, and camouflage.

Related work

Analysis of data sets

The dataset utilized in this study, curated by Fu et al. from Dalian University of Technology, represents a comprehensive collection of underwater imagery. This dataset comprises 14,000 high-resolution images and 74,903 labeled instances, segmented into a training and testing set at an 8:2 ratio. The images maintain a minimum resolution of 171 × 262 pixels, with object heights ranging from 1 to 3618 pixels. The labeled categories span a diverse array of marine life, including fish, divers, starfish, coral, turtles, echinuses, holothurians, scallops, cuttlefish, and jellyfish, encompassing 10 distinct categories as referenced in [11]. The distribution of these categories is illustrated in Fig. 1.

Fig. 1
figure 1

Percentage of labels by category

When compared to existing Underwater Object Detection (UOD) datasets, the images in this study encompass a broader range of complex marine environments. For instance, variations in water quality across different areas contribute to light refraction and scattering issues, such as fog effects, color biases, and extreme lighting conditions, both low and high. The natural grou** behavior of certain marine species like fish and jellyfish introduces challenges in detecting small object clusters. Furthermore, marine life exhibits a high degree of morphological diversity within classes and inter-class similarities, often compounded by their camouflage abilities. Examples include cuttlefish and certain species of corals and turtles, which possess subtle camouflage traits. These elements collectively escalate the difficulty of accurately detecting objects underwater. Figure 2 visually demonstrates these complexities, showcasing instances of small object clustering, camouflage challenges, and various light interference issues like fog effects, color deviations, and extreme lighting conditions.

Fig. 2
figure 2

Complex underwater environment

Data pre-processing

This study employs two principal data augmentation strategies to enhance model performance: mosaic data augmentation [12] and mixup data augmentation [6 (right) and 7, providing a detailed sketch of the proposed model architecture.

$$\begin{aligned} Channel\_attention( {F_{in} } ) &= \sigma ( W2 * W1 * MAXPool( {F_{in} } )\\ &\quad + W2 * W1 * AVGPool( {F_{in} } ) )\end{aligned} $$
$$ Spatiall\_attention_{{\left( {Fin} \right)}} = \sigma \left( {conv_{7 * 7} \left( {\left[ {MAXPool_{ch} \left( {F_{in} } \right);\;AVGPool_{ch} \left( {F_{in} } \right)} \right]} \right)} \right) $$
Fig. 7
figure 7

Sketch of the two attention modules, Channel attention (left) Spatial attention (right) (r in the left figure represents the channel compression multiplier in the attention module)

Equation 1 W1 is the weight of the first fully-connected layer, W2 is the weight of the second fully-connected layer, and the fully-connected layer weights are shared.

Small object and structural re-parameterization improvement networks

Detecting small underwater objects presents significant challenges due to their diminutive scale, mutual occlusion, or overlap among object groups, making it difficult to capture location details and distinguish boundaries [18, 19]. This paper introduces a structural reparameterization multi-scale feature extraction module, inspired by RepVgg, into the model's Backbone and Neck. This innovation enhances the model's proficiency in learning multi-scale features and extracting diverse semantic information on background, foreground, object edges, and textures. It effectively addresses the clustering issue of small underwater objects while achieving lossless compression of the model.

Distinct from conventional multi-scale feature extraction modules (e.g., SPP [20], Inception [21]), RepVgg's architecture [22] differs during training and inference phases. During training, the structure bifurcates into three branches: a 3 × 3 convolution + BN branch, a 1 × 1 convolution + BN branch, and an identity map** + BN branch. At inference, the module undergoes structural reparameterization, simplifying into a singular 3 × 3 convolutional layer. This reparameterization process involves convolutionalizing and fusing all BN operations with the original convolutional kernel into one operator, converting each channel into a path with only 3 × 3 convolutions. Then, applying the distributive law of convolution operations, the weights and biases in the convolutional kernel are aggregated to form a unified 3 × 3 convolution, enhancing structural efficiency (Fig. 8).

Fig. 8
figure 8

RepVgg module structure reparameterization flow

In the benchmark model YOLOv8, feature maps fed into the Neck network derive from the second to fourth C2F blocks, excluding the initial convolutional module and C2F block outputs. However, considering the detection of small object clusters sensitive to spatial information, retaining the shallow feature extraction maps is crucial. These maps are rich in spatial and edge details. Consequently, the output of the first convolutional layer (aligned with the cone-rod cell module's output) and the first C2F block are preserved. These outputs undergo processing via a RepVgg module and a maxpooling module before being concatenated with the detect head part's input feature maps (with dimensions of 80 × 80 and 40 × 40) along the channel dimension (Fig. 9), optimizing the model for small object detection in complex underwater environments.

$${y}_{i}=\frac{{x}_{i}-{u}_{i}}{\sqrt{{\sigma }_{i}^{2}+\varepsilon }}{\gamma }_{i}+{\beta }_{i}$$
$$={x}_{i}\frac{{\gamma }_{i}}{{\sigma }_{i}}+{\beta }_{i}-\frac{{\gamma }_{i}{u}_{i}}{{\sigma }_{i}}$$
$$ W_{i}^{\prime } = \frac{{\gamma_{i} }}{{\sigma_{i} }} $$
$$ B^{\prime}_{i} = \beta_{i} - \frac{{\gamma_{i} u_{i} }}{{\sigma_{i} }} $$
Fig. 9
figure 9

Structure of the modified network, modified parts in color

Equation 2 BN layer convolutionalization (\({x}_{i}\) the input, \({y}_{i}\) the output after the BN, \({u}_{i}\) and \({\sigma }_{i}\) represent the mean and variance of the ith channel, and γi with βi are the training parameters of the bn layer, the \(W^{\prime}_{i}\) and \(B^{\prime}_{i}\) represent the weights and biases after convolutionalization)

$$Out=\left(in*{W}_{1}+{B}_{1}\right)+\left(in*{W}_{2}+{B}_{2}\right)+\left(in*{W}_{3}+{B}_{3}\right)$$
$$\qquad\quad =in*\left({W}_{1}+{W}_{2}+{W}_{3}\right)+({B}_{1}+{B}_{2}+{B}_{3})$$

Equation 3 Parallel Convolutional Layer Operator Fusion (in reprsents input feature map, the \({W}_{i}\) represents the weight of the ith convolution, and \({B}_{i}\) represents the bias of the ith convolution).

Camouflage issues with WIOU

Camouflaged object image processing stands as a notable area within computer vision, with considerable research dedicated to the detection and segmentation of camouflaged creatures in natural environments, as evidenced by datasets such as COD10k [23] and NC4k [24]. However, underwater camouflaged object detection remains relatively underexplored, presenting significant challenges for model performance due to the intricacies of object camouflage. This paper aims to enhance the network's ability to detect camouflaged objects by focusing on sample variety and loss function optimization.

The dataset employed comprises camouflaged or weakly camouflaged underwater objects. A critical aspect of successful detection lies in identifying unique textures that predominantly or exclusively characterize a given class. However, during network forward propagation, these distinctive textures can easily blend with common or similar textures and image noise, leading to a dilution or loss of unique texture information. This fusion compromises the accuracy of camouflaged object detection. Addressing this, the paper emphasizes the importance of increasing the representation of unique camouflage features in training to boost detection efficacy.

To this end, the training samples are categorized into three types: high-quality, ordinary, and low-quality samples, as illustrated in Fig. 10. Low-quality samples, characterized by complex environmental noise, present a challenge in object identification and localization due to excessive noise in feature extraction. Accordingly, the loss associated with these samples should be minimized. High-quality samples, with clear and distinct object imaging, are easier to recognize and localize. Despite their clarity, these samples often contain a blend of unique and common textures, suggesting a reduction in computed loss. Ordinary samples, likely to contain textures specific to the camouflaged object, warrant a higher loss weighting. This stratification aims to refine the network's focus on crucial texture details, enhancing the detection of camouflaged objects in underwater settings.

Fig. 10
figure 10

low quality samples, normal samples, high quality samples (the objects of the holothurian in the figure)

The WIOU metric has been developed in three iterations [25] due to computational resource constraints, making a comprehensive grid search for optimal hyperparameter combinations unfeasible. Consequently, this research employs parameter combinations recommended by the original paper, utilizing WIOUv3 for its advancements over WIOUv1. WIOUv1's loss computation is bifurcated into IOU and RWIOU components, with RWIOU accounting for the ratio of the center points' distance between the labeled and predicted boxes relative to the diagonal length of the smallest encompassing rectangle. This approach accentuates the IOU loss for ordinary samples, while diminishing RWIOU loss for high-quality samples, thereby lessening the emphasis on center distance in cases of substantial overlap between the anchor and object (Fig. 4).

$$ IOU = \frac{{\left( {W_{gt} * H_{gt} + W_{pred} * H_{pred} - W^{\prime} * H^{\prime}} \right)}}{{\left( {W^{\prime} * H^{\prime}} \right)}} $$
$$RWIOU=\text{exp}(\frac{{\left({x}_{pred}-{x}_{gt}\right)}^{2}+{\left({y}_{pred}-{y}_{gt}\right)}^{2}}{({W}^{2}+{H}^{2})})$$
$$WiseIO{U}_{v1} =RWIOU*IOU$$

Equation (4) \(WiseIO{U}_{v1} loss\)

WIOUv3 introduces a novel parameter, β, representing the outlier degree calculated as the ratio of a sample's IOU to the average IOUs of all samples within a batch. A small β indicates a high-quality sample, warranting a minimal gradient assignment, whereas a large β signifies a low-quality sample, to which a small weight is assigned to mitigate adverse gradients. This system prioritizes samples of average quality for bounding box regression. The gradient assignment strategy dynamically adapts, optimizing loss values when the outlier degree meets a predefined constant, C. The dynamic nature of IOU_mean ensures that WIOUv3's sample quality criteria and gradient assignment strategies remain optimally aligned with current data characteristics, as depicted in Fig. 11, enhancing the model's overall performance in bounding box regression.

Fig. 11
figure 11

gt_box represents the true labeling box pred_box represents the prediction box of the network

$$\beta =\frac{IOU}{IOU\_mean}$$
$$r=\frac{\beta }{{\delta \alpha }^{\beta -\delta }}$$
$$WiseIO{U}_{v3} =r WiseIO{U}_{v1}$$

Equation 5 \(WiseIO{U}_{v2} loss\) (\(\delta \alpha \) are manually set hyperparameters)

Analysis of experimental results

Experimental environment

The configuration of this experimental machine is shown in the Table 1, and some of the hyperparameters are set as Table 2

Table 1 Experimental machine configuration
Table 2 Experimental hyperparameter settings

Evaluation index

The evaluation index of this experiment is mainly the AP (AP0.75) value of each label, and the MAP (MAP0.75) value by all labels, the IOU threshold is 0.75 in judging the positive and negative samples, and the score_threshold of CONFIDENCE is 0.5, the formula for calculating the MAP and the AP of each category is shown in (6).

$$ P_{{{\text{int}} erp}} \left( r \right) = \mathop {\max \left\{ {P\left( {r^{\prime}} \right)} \right\}}\limits_{{r^{\prime} > r}} $$
$$AP=\sum_{i=1}^{n}\left({r}_{i+1}-{r}_{i}\right){P}_{interp}\left({r}_{i+1}\right)$$
$$MAP=\frac{{\sum }_{i=1}^{K}A{P}_{i}}{K}$$

Equation 6 i represents the current category, k represents a total of k categories, r represents the current recall value, P represents the p–r curve, and Pinterp represents the pr curve after map interpolation.

Parameter and performance of each combination (module)

The analysis of data presented in Table 3 reveals that the model achieves optimal performance with convolutional kernels sized 7 for rod blocks and 3 for cone blocks, maintaining a kernel number ratio of 3:2. This configuration yields a performance enhancement of 3.6% over the baseline model. Such improvement aligns with the physiological makeup of the human retina, wherein the number of cone cells is fewer than rod cells, and cone cells have a smaller receptive field than rod cells. This correlation substantiates the efficacy of the proposed cone-rod cell module's approach to spatial channel separation, underscoring the heuristic design's validity.

Table 3 Performance of different rod module design options

Additionally, Table 4 outlines the evolution of network structure parameters, computational demand, and MAP (Mean Average Precision) values throughout the model's design process. Notably, the incorporation of the RepVgg module at two distinct phases resulted in a cumulative MAP increase of 5.3% (2.5% + 2.8%). Subsequent application of reparameterization techniques effectively reduced the increased parameters and computational requirements, achieving lossless compression. When compared to alternative methods such as distillation, pruning, and lightweight modular design, this network architecture demonstrates superior performance enhancements, highlighting its advantages in optimizing model efficiency and effectiveness.

Table 4 Parametric quantities and calculations and performance at each stage of model modification

Comparison of full model effects

The models under comparison are lightweight object detection frameworks, detailed in Tables 5 and 6. The enhanced model introduced in this paper, termed CRWYOLO, is evaluated from three distinct perspectives to gauge its effectiveness. Firstly, the overall performance of the model is considered, which includes the global detection capability (Mean Average Precision, MAP) alongside the number of parameters and computational operations. Secondly, the model's effectiveness in detecting small-volume objects is assessed, examining the Average Precision (AP) values for six categories of small-volume labels: echinus, holothurian, fish, scallop, jellyfish, and starfish. Lastly, the model's performance in detecting camouflaged objects is analyzed through the AP values for two categories of camouflaged object labels: cuttlefish and corals.

Table 5 Model calculations and number of parameters
Table 6 AP/%

This multifaceted evaluation approach allows for a comprehensive assessment of the CRWYOLO model's capabilities, encompassing general detection efficiency, proficiency in recognizing small-volume objects, and accuracy in identifying camouflaged entities.

Global detection results

This study evaluates the viability of an improved lightweight design for underwater object detection, analyzing the model named CRWYOLO. With 4.62 billion parameters, CRWYOLO only surpasses YOLOv8-n, YOLOv5_n(v6.1), and Efficientdet-b0 in parameter count among all compared models. Its computational demand, at 10.2 billion operations, is lower than most control models. Despite its efficiency, CRWYOLO's Mean Average Precision (MAP) significantly exceeds all control models, outdoing the top-performing model, YOLOv8-s, by 5.8%. This demonstrates that strategic modifications at various stages enhance underwater detection capabilities without excessively increasing the network's size or computational burden, thereby maintaining its lightweight design.

Small volume label detection results

The effectiveness of this improved approach is further assessed through the detection of small-sized object clusters across six categories: echinus, holothurian, fish, scallop, jellyfish, and starfish. The proposed detection strategy addresses the challenge of low spatial semantics at multiple scales and enhances the network's learning capacity. This results in improved detection of small-volume objects clustered in underwater environments, with notable improvements in fish, scallop, and jellyfish categories. Specifically, jellyfish detection improved by 1.9 AP values over the control model Detr-resnet50's best score of 66.2. However, enhancements in echinus, starfish, and holothurian categories were less pronounced, with echinus showing only a modest increase to an AP value of 53.1. This indicates the proposed method's effectiveness in tackling small-object clustering issues while also suggesting room for further refinement.

Figure 12 clearly illustrates the differences in detecting small object clusters between CRWYOLO and control models. The figure highlights the superior detection accuracy of CRWYOLO (Model E), particularly in densely clustered fish areas, where it outperforms control models by correctly identifying individual fish with higher probability values. In contrast, control models (A, B, D) inaccurately recognize seaweed as echinus, evidenced by orange detection boxes, showcasing CRWYOLO's enhanced detection capabilities.

Fig. 12
figure 12

A Detr-resnet50, B mobilSSD, C Efficientdet-b1, D YOLOv8-s, E Program of this paper (CRWYOLO), F Real labels

Camouflage labeling detection results

To assess the effectiveness of the proposed method in detecting camouflaged objects, now focuses on the Average Precision (AP) values for two specific categories: cuttlefish and corals. The cuttlefish category benefits from its distinctive texture, which is more readily identifiable in marine settings, resulting in impressive detection performance across various control models. Despite this, there remains room for enhancement. The proposed method outperforms the highest AP value reported by the control model for cuttlefish detection, YOLOv8-m (91.3), achieving an increase of 1.9.

Conversely, corals, due to their high similarity to the surrounding marine environment, present a greater challenge, often leading to omissions and false detections. The control models generally exhibit poorer performance in detecting corals. However, the proposed method marks a significant improvement in this category, surpassing the top-performing control model, Centernet-resnet50 (62.2), with an increase of 6.8 in AP value.

These results underscore the utility of incorporating Weighted Intersection Over Union (WIOU) in underwater object detection tasks, particularly for camouflaged objects. The technique enhances the detection by prioritizing the preservation of unique textural information in camouflaged entities, thereby improving the model's overall recognition capabilities.

Although control models A, B, C, and D demonstrated proficiency in identifying cuttlefish within Fig. 13, they faltered in accurately detecting coral labels. In contrast, the model developed in this study (E) effectively distinguishes corals from their environment. This achievement highlights the role of gradient weight attenuation in WIOU, which focuses on balancing the quality of samples during the loss calculation phase, thereby elevating the detection of underwater camouflaged objects.

Fig. 13
figure 13

A, Detr-resnet50, B mobileSSD, C Efficientdet-b1, D YOLOv8-s, E Program of this paper (CRWYOLO), F Real labels

Other indicators of model performance

In order to prove the credibility of the models and data in this paper, we have also collected some other indicators to measure the performance of deep learning models (these metrics will not be analysed here, but will only be provided to the reader as a supplement data of the experiment),the following image (Fig. 14) show the map, F1, Precision, recall curves of CRWYOLO.

Fig. 14
figure 14

map, F1, Precision, recall curves

Conclusion

This paper introduces innovative design elements focusing on model architecture and loss function optimization. The first innovation involves an input module inspired by the human retina's optic cone and rod cells. This module is adept at mitigating various types of optical noise prevalent in underwater environments. The second key enhancement is the integration of a structural reparameterization module into the network's Backbone and Neck. This addition significantly bolsters the model's capability to comprehend multi-scale features and image semantics, facilitating lossless compression. As a result, it achieves an effective balance between computational efficiency and the detection performance of small-object clusters.

In addition, the implementation of Weighted Intersection Over Union (WIOU) plays a crucial role in enhancing sample quality. It minimizes noise in lower-quality samples and suppresses common textures in high-quality samples that might obscure the unique textures of camouflaged objects. This strategy is specifically tailored to optimize underwater object detection.

Despite these advancements, areas for further improvement remain. The cone-rod cell module, while effective in reducing light interference, adds to the model's parameter count and computational demands. Future research aims to streamline this module, reducing its resource consumption for practical engineering applications. Additionally, while structural reparameterization has improved multi-scale feature extraction, there's potential for further enhancement in the AP values for certain small object categories such as Echinus, holothurian, fish, scallop, and jellyfish. Upcoming experiments will concentrate on improving the matching of positive and negative samples and refining the candidate frame selection algorithm, aiming to develop a more targeted approach for small object clusters.

Furthermore, following the implementation of structural reparameterization, the model is suitable for deployment on specific hardware. However, its acceleration performance is suboptimal when utilized with the TensorRT deployment framework. It is hypothesized that structural reparameterization might advance the fusion of operators in the deployment framework to the model code level [26, 27]. This hypothesis presents an intriguing avenue for future research, offering not only optimization strategies for underwater object detection but also practical insights into the control of structural reparameterization.