Introduction

Bitter gourd (Momordica charantia L.) stands out as a distinctive and versatile melon, garnering significant popularity in Asia. It has a long, thin curved shape and green skin. Belonging to the gourd family, bitter gourd boasts a unique bitter taste, enriched nutritional value, and a distinctive appearance, endearing it to a vast consumer. Its medicinal properties, such as promoting skin health, aiding in weight loss, and providing anti-diabetic benefits, have earned it the moniker “plant insulin”. In orchards, the conventional method of manually picking bitter gourd relies largely on accumulated experience, where attributes like large volume and a glossy surface are deemed suitable indicators. However, the picking criteria for bitter gourd often remain ambiguous. Research indicates that1 the optimal picking time is 10–15 days after blooming, yet the asynchronous flowering period on bitter gourd vines results in fruits of varying maturity on the same plant. This lack of uniformity hinders efficient management and picking. To address these challenges, there is a pressing need for an accurate and real-time method to estimate the maturity of bitter gourd. This development is essential to align with storage and transportation requirements, enhance yield and quality, and overcome the difficulties and inefficiencies associated with manual picking. An immediate and precise solution is required to assist farmers in the effective management of bitter gourd cultivation.

Advancements in computer vision applications for fruit maturity estimation within the agricultural sector have demonstrated significant progress. Traditional machine learning methods are commonly employed to distinguish various stages of fruits, primarily relying on color features. This approach is particularly evident in the classification of fruits such as grapes2, strawberry3, banana4, and other fruits. Tan et al.5 proposed a distribution algorithm for discerning the maturity of blueberry fruits. The algorithm utilized a Histogram of Oriented Gradients (HOG) feature vector for training a Support Vector Machine (SVM) classifier to swiftly identify regions of fruit style. K-Nearest Neighbors classifiers were employed to differentiate fruits at different maturity levels. The prevalence of deep learning has precipitated a surge in research on fruit ripeness recognition. A comprehensive survey by Matteo Rizzo et al.6 Delving into fruit ripeness classification, the survey concluded that pre-processing and fine-tuning of deep learning models is the most promising approach. Faisal et al.7 introduces a multi-stage intelligent harvest decision system for date palm recognition, employing pre-trained models like VGG-198, Inception-V39, and NASNet10 architectures. In another study, Chen et al.11 proposed an improved EfficientDet12 method for olive fruit maturity estimation, which introduced Convolution Block Attention Module (CBAM)13 into the feature extraction network to refine the feature map** between different rims. The precision, recall and mAP of the improved EfficientDet model in the test set were 92.89%, 93.59% and 94.60%, respectively, by enhancing information flow in the feature pyramid network. The most advanced techniques in maturity estimation include the two-stage Region-based Convolutional Neural Network (RCNN)14,15,16 and the one-stage You Only Look Once (YOLO)17,18,

Figure 2
figure 2

Shooting scene diagram of bitter gourd dataset.

Following the collection phase, the dataset of bitter gourd underwent manual cleaning, resulting in a total of 1121 images. The Labelme annotation software was employed. This tool facilitated the marking of the polygonal areas corresponding to bitter gourd in each image, with the annotations saved in txt format. Comprehensive analysis was conducted on all 1121 images, and the results were compiled to generate bitter gourd instance objects. Subsequently, in order to further expand the enhanced bitter gourd dataset and enable the model to have strong generalization ability on complex scenes, we adopted data augmentation methods, as shown in Fig. 3, including Horizontal Flip**, Translation, Rotation and HSV transforms. As a result, we obtained a total of 2242 final images of bitter gourd dataset, the dataset is divided into training set, verification set and test set according to 7:2:1. The detailed division of bitter gourd dataset is shown in Table 1.

Figure 3
figure 3

Four methods of image enhancement.

Table 1 Bitter gourd dataset partition.

Bitter gourd instance segmentation

In the context of bitter gourd instance segmentation, compared to box positioning detection, instance segmentation provides precise positional data for automated picking robots. It accurately defines object boundaries and shapes, individually analyzing and annotating each instance of bitter gourd within the image, thereby offering a detailed understanding of the characteristic information associated with different instances of bitter gourd. This multi-task learning model incorporates variations and correlations among different tasks, thereby enhancing the model's generalization capabilities. Notably robust against real farm background interference, instance segmentation also provides semantic insights into the object, specifying its exact location within the surrounding environment, thus demonstrating commendable performance across various practical application scenarios. In 2022, the Ultralytics team introduced YOLOv5-seg, an innovative instance segmentation model. Drawing inspiration from the YOLACT5. The DS Conv process comprises two main components: (1) Offset Calculation: A portion of the offset field in both the X and Y directions is derived through the convolution of the input feature map. This offset field, serves to dynamically adjust the shape and position of convolutional nuclei in each application. In contrast to common convolutional approaches with square regions, the use of offsets allows for dynamic adaptation, optimizing the shape and position of the convolutional kernel. (2) Output Feature Map: The input feature map undergoes convolution through the convolution kernel of offsets to generate the output feature map. This unique convolution kernel facilitates easier learning of segmentation, fitting bar structures more effectively, and prioritizing core features.

Figure 5
figure 5

Dynamic Snake Convolution (DS Conv).

Conceivably, some of the key technologies in image processing applications are also inspired by biological behaviors, such as ant colony30 foraging, echolocation by bats31 and dolphins32. The foundation of these similar forms of motion in image processing comes from the chain coding, proposed by Freeman33 in 1961. Chain coding is a method for representing image contours, converting continuous pixel contours in an image into a series of connected chain codes. Chain coding encodes the shape information of the contour and the connection relationship between the points on the contour into a sequence, making it convenient to represent and process the image contour. Building upon the movement behavior observed in ant colony foraging, Mouring et al.30 proposed a new image coding method inspired by ant movement trajectory, which offers a higher compression ratio and is easier to implement compared with other methods. Other recent trends in research have also drawn inspiration from biological behaviors for chain coding movements, addressing many real-world problems and exploring new aspects from various angles.

Diverse branch block module

The diverse branch block (DBB) is a versatile building block for convolutional neural networks, introduced by Ding et al.34 with the goal of enhancing the model's feature extraction capability and robustness. In a previous study, Zhang et al.35 successfully incorporated the DBB module into the YOLOX-S model, demonstrating improved feature extraction capabilities and achieving strong performance on open datasets. The DBB module achieves this by enriching the feature space through the combination of different branches with varying scales and complexities, thereby enhancing the representation capability of individual convolutions. Similar to the Inception architecture, the inclusion of various receptive fields and multi-path convolution operations with different complexities serves to improve feature extraction capabilities. the DBB includes six types of transformations: batch normalization (BN), branch addition, deep concatenation, multi-scale operation, average pooling, and convolution sequence. \(I \in {R^{C \times H \times W}}\) is the input, \(O \in {R^{D \times H^{\prime} \times W^{\prime}}}\) is the output, \(F \in {R^{D \times C \times K \times K}}\) is the K × K convolution kernel, and b is the offset term of the K × K convolution kernel. The convolution operation formula can be expressed as follows:

$$O = I \otimes F + REP(b)$$
(1)

here \(\otimes\) represents the convolution operation and \({\text{REP}}({\text{b}}) \in {{\text{R}}^{D \times H^{\prime} \times W^{\prime}}}\) represents the bias after the operation. Usually, a BN layer is added after the unbiased term, and the value on the jth output channel is determined by the following Eq. (2):

$${O_{j,:,:}} = \left( {{{\left( {I \otimes F} \right)}_{j,:,:}} - {\mu_j}} \right)\frac{{\gamma_j}}{{\sigma_j}} + {\beta_j}$$
(2)

where \({\mu_j}\) and \({\sigma_j}\) are the mean value and standard deviation of BN, \({\gamma_j}\) and \({\beta_j}\) are the scale factors and deviation terms of learning. Using \(F_{j,:,:,:}^{\prime}\) to replace \(\frac{{\gamma_j}}{{\sigma_j}}{F_{j,:,:}}\) and \(b_j^{\prime}\) to replace \(- \frac{{{\mu_j}{\gamma_j}}}{{\sigma_j}} + {\beta_j}\), a BN fusion formula is obtained, then Transform I:

$$O_{j,:,:} = I \otimes F_{j,:,:,:}^{\prime} + b_j^{\prime}.$$
(3)

Since two or more convolution branches with the same configuration are additive. Additivity can combine the outputs of multiple convolution with the same configuration into a single convolution. The branch merging of Transform II is as follows:

$$F^{\prime} \leftarrow {F_1} + {F_2},\;\;b^{\prime} \leftarrow {b_1} + {b_2}$$
(4)

Transform III is to merge a sequence of 1 × 1 conv-BN-K × K conv-BN into one single K × K conv. Perform transformation I to obtain Conv 1 × 1 Conv-K × K Conv. With \({F_1} \in {R^{D \times C \times 1 \times 1}}\) as the convolution kernel of 1 × 1 Conv and \({F_2} \in {R^{E \times D \times K \times K}}\) as the convolution kernel of K × K Conv. Their biases are \({b_1} \in {R^D}\) and \({b_2} \in {R^E}\). The output is

$$O^{\prime} = \left( {I \otimes {F_1} + REP\left( {b_1} \right)} \right) \otimes {F_2} + REP\left( {b_2} \right).$$
(5)

The Eq. (5) for fusion is as follows:

$$O^{\prime} = I \otimes F^{\prime} + REP\left( {b^{\prime}} \right)$$
(6)

According to the additivity of the branch merging and convolution above, we can get:

$$O^{\prime} = I \otimes {F_1} \otimes {F_2} + REP\left( {b_1} \right) \otimes {F_2} + REP\left( {b_2} \right).$$
(7)

since \(I \otimes {F_1}\) performs a 1 × 1 linear transformation, such a transformation can be achieved by transposing convolution, ditto, the transposed term is transformed as follows:

$$F^{\prime} = {F_2} \otimes TRANS\left( {F_1} \right)$$
(8)
$$REP\left( {b^{\prime}} \right) = REP\left( {b_1} \right) \otimes {F_2} + REP\left( {b_2} \right) = REP\left( {\hat b} \right) + REP\left( {b_2} \right)$$
(9)

Transform V is the average pooling layer changes the K × K convolution into a volume with a certain step length, and the average pooling convolution with a convolution kernel of K and a step length of S is equivalent to replace the C-channel feature map with \(F^{\prime} \in {R^{C \times C \times K \times K}}\), which is composed of the following Eq. (10):

$$F_{d,c,:,:}^{\prime} = \left\{ {\begin{array}{*{20}{l}} {\frac{1}{{K^2}}}&{\quad if\;\;d = c,} \\ 0&{\quad elsewise.} \end{array}} \right.$$
(10)

The multi-scale convolution fusion of the final Transform VI is to convert \({k_h} \times {k_w}\) \(\left( {{k_h} \leqslant K,\;{k_w} \leqslant K} \right)\) convolution kernel to K × K convolution kernel by zero padding.

A representative example of the DBB module is shown in Fig. 6. It does not involve the series of the deep series Transform IV25. Based on the idea of lightweight network model and heavy parametric structure36, a series of combination methods are used to enhance the original 3 × 3 convolution. The 1 × 1 convolution kernel is initialized to the identity matrix, and the other convolution kernels are initialized by default. Each operation has a different sensitivity field and complexity, which can improve the fine-grained recognition ability and greatly enrich the feature space. Finally, a nonlinear layer is added after the convolution operation to improve the nonlinear fitting ability.

Figure 6
figure 6

DBB module.

Focal-EIOU loss

The intersection over union (IOU)37 is a metric commonly used to evaluate the performance of target detection models, particularly in the context of two-dimensional collections of rectangular boxes. When a model generates a series of bounding boxes, the IOU is employed to quantify the overlap between the generated bounding boxes and the actual target bounding boxes. The IOU is calculated using the following Eqs. (11) and (12):

$$IoU = \frac{A \cap B}{{A \cup B}}$$
(11)
$${L_{loU}} = 1 - IoU$$
(12)

This formula can reflect the difference between positive and negative samples and has scale invariance. However, when two overlap** boxes do not intersect, IOU = 0 and Loss = 0, which cannot reflect the distance gap between boxes and the gradient return update gradient. The YOLOv5 target detection model uses CIOU38 as the boundary box loss function, and the calculation Eq. (13) as follows:

$$CIoU = IoU - \left( {\frac{{{\rho^2}\left( {b,{b^{gl}}} \right)}}{{c^2}} + \alpha \upsilon } \right)$$
(13)
$$\upsilon = \frac{4}{{\pi^2}}{\left( {arctan\frac{{{w^{gl}}}}{{{h^{gl}}}} - arctan\frac{w}{h}} \right)^2}$$
(14)
$$\alpha = \frac{\upsilon }{{\left( {1 - IoU} \right) + \upsilon }}$$
(15)

takes into account aspect ratio, α is the weight parameter, υ is used to measure the similarity of aspect ratio and reflect the difference of aspect ratio. The EIOU39 Eq. (16) is as follows:

$$EIoU = IoU - \frac{{{\rho^2}\left( {b,{b^{g^{\prime}}}} \right)}}{{c^2}} - \frac{{{\rho^2}\left( {w,{w^{g^{\prime}}}} \right)}}{{{c_w}^2}} - \frac{{{\rho^2}\left( {h,{h^{g^{\prime}}}} \right)}}{{{c_h}^2}}$$
(16)
$${L_{EIoU}} = 1 - EIoU$$
(17)

Using the true width and height of the prediction box rather than the aspect ratio for regression, compared with CIOU, eliminates the negative impact of the aspect ratio uncertainty and is more conducive to network performance optimization. Moreover, within the training phase of bounding box regression, the bitter gourd dataset presents a notable challenge due to an excessive number of instance objects in the L1 stage. This results in a sample imbalance issue, with bounding box regression playing a pivotal role in determining target positioning performance. Consequently, to address this concern, the experiment employs Focal-EIOU loss as the loss function for bounding box regression. This choice aims to alleviate the model's tendency to overly concentrate on the expansion stage of L1, thereby offering a more effective measurement of the bounding box's positioning problem. Its formula is:

$${L_{Focal - EloU}} = Io{U^\gamma }{L_{EloU}}$$
(18)

where γ is a parameter that controls the degree of outlier suppression.

Our model

YOLOv5 (You Only Look Once version 5) is widely acclaimed in the industry as a classic model for object detection, exerting a significant influence on computer vision and deep learning. Its key strength lies in processing images through a single-stage convolutional neural network to directly obtain image categories and coordinates. This end-to-end design grants YOLOv5 exceptional real-time performance in object detection, swiftly identifying objects in images or videos. YOLOv5 can detect multiple targets of different categories, providing precise boundary box position information while maintaining real-time performance. The YOLO series has evolved from YOLOv1 to YOLOv8, continuously improving detection accuracy, speed, architectural design, and support for multi-tasking. YOLOv5-seg extends the capabilities of YOLOv5 by introducing a mask head to facilitate instance segmentation. This model consists of three components: the backbone, neck, and head. The backbone extracts numerous features from images, the neck connects the backbone and head to fuse context information, enhancing model robustness, and the head outputs additional mask matrices for instance segmentation, utilizing a box + class + mask approach. In the down-sampling process, YOLOv5 incorporates high-performance complex modules like Conv, C3, and SPFF. The Conv module integrates convolution, BN and SILU activation functions. DS Conv is designed to perform a serpentine convolution operation before the up-sampling process, aiming to fit the target object when the neck fuses context information. The C3 module, a more powerful alternative to ordinary residual blocks, involves a compressed layer of 1 × 1 convolution, a standard 3 × 3 convolution, and an expanded layer of 1 × 1 convolution for residual connections and feature integration. The study enhances the C3 module with DBB. During training, the C3-DBB module is employed, but the first C3 module of the backbone network is retained to extract shallow features. Subsequent convolutional modules use DBB for training, allowing for a rich combination of shallow features and deeper representation capabilities. While maintaining the macro structure of DBB, the microstructure becomes intricate during training. Importantly, DBB is equivalent to a single convolutional layer during inference deployment, aligning with the original inference time cost. the improved YOLOv5-seg model is shown in Fig. 7.

Figure 7
figure 7

Improved YOLOv5-seg model.

Model training

This experiment was conducted on a Windows 10 operating system using Python version 3.10.6, PyTorch version 1.13.1, and CUDA 11.1. GPU training and inference were executed on the AutoDL experimental platform, which operates on a Linux platform. The Tesla T4 GPU was utilized for model training, while the Nvidia RTX 4090 was employed for inference purposes. The experiment encompassed a total of 100 epochs, employing a learning rate of 0.01 and a learning rate momentum of 0.937. The weight decay coefficient was set to 0.0005, and the gradient optimization algorithm used was SGD. The default image size was 640 × 640, with a batch size of 32 and 8 threads for processing.