Introduction

Pineapple have been grown commercially in Taiwan for over 300 years. It is a perennial herbaceous fruit well-suited to cultivation in the hilly areas of central and southern Taiwan; however, it is grown in flat areas for the purposes of mass cultivation. Taiwan has bred a number of pineapple varieties with different production periods. By 2002, Taiwan output had reached 2.56 million metric tons and by 2018, annual earnings had reached 43 million US dollars.

Pineapples are multi-flowered fruits, and although any given batch of pineapple fields may undergo flowering at roughly the same time, the times at which the florets opens can vary widely. Under these conditions, it is often necessary to harvest a single field three or four times. After flowering, pineapple1 enters the fruit development (BBCH code 7) period, during which the fruit begins to develop and expand. The fruit head gradually changes from conical to smooth, and the color of the peel gradually changes from reddish brown (BBCH code 703) to dark green (BBCH code 705). During the final stage of development (BBCH code 709), the size of the fruit does not change but the color of the peel changes from dark green to light green. In the fruit-ripening stage (BBCH code 8), the fruit gradually turns yellow and the emits the aroma of pineapple. Commercially-available pineapple fruit can be divided into solid-sounding fruit, hollow-sounding fruit and semi-hollow-sounding fruit2, in accordance with the sound that is reflected when the fruit is tapped. Solid sound fruit is characterized by low acidity, thick meat, and a bright golden color. The solid sound fruit was more mature than hollow sound, and the suitable harvest time was appeared light green sprinkled with white powder (when the peel turns yellow, it is too ripe). Hollow sounding is characterized by lower water content, thinner fibers, strong sweetness, string acidity, strong fragrance, and a light milky yellow. The suitable harvest time was ripening from green to yellow. The semi-hollow-sound is somewhere in between.

Mechanization has not been widely adopted for the harvesting of fruit from pineapple fields3. The lack of automated systems by which to assess the ripeness of fruit necessitates the presence of experienced professionals in the field during harvesting. Note that non-destructive sugar content detectors developed specifically for this purpose cannot replace professionals in the field. Sourcing workers for the harvesting of pineapple fruit is a perennial problem, due to the fact that the work requires heavy manual labor as well as experience in fruit identification.

It is not uncommon to cover ripening pineapples using paper hats, paper bag or blackening nets for sun protection and the prevention of dehiscence. Unfortunately, coving the fruit this way makes it impossible to determine the ripeness of the fruit without manually lifting the bag and checking with the naked eye4. The system developed in the current study was based on the assumption that the paper hats leave at least half of the fruit visible from beneath.

Numerous researchers have investigated methods by which to determine the maturity and varietal of pineapples in the field. Azman et al. used a convolutional neural network (CNN) for the indoor assessment of ripeness (unripe, partially ripe, and fully ripe), achieving classification accuracy of 94% when applied to 27 samples with known ripeness5. Angel et al. used color analysis and image recognition software to classify pineapple maturity into 11 levels when viewed under indoor conditions6. They achieved classification accuracy of 96% for 550 pineapple samples. Researchers in Thailand examined two pineapple varieties using gradient analysis of textures7. Sanyal et al. reported on the use of computer technology to diagnose different diseases aimed at increasing the production of pineapples8. Amjad et al. used VGG16 in identifying bromeliads, achieving classification accuracy of 100%9. Edwin et al. used image processing and fuzzy theory to classify post-harvest maturity into four categories. They achieved classification accuracy of 90% for partial maturity and 95% overall accuracy10. Note however that all of the methods described above rely on a controllable light source under indoor conditions. Wan et al. used a UAV equipped with a color camera to determine the number of pineapple crowns in a field. They achieved classification accuracy of 94%11.

Deep learning is a machine learning technique that uses multi-layer artificial neural networks to analyze signals. In the current study, we employed a CNN for the real-time detection of pineapple ripeness in the field as follows: unripe, ripe, ripening, and bagged. Unripe: un-harvestable, the fruit was not final the development and not to reaches the final size. Ripe: The solid sound fruit must be harvested when the dark green peel turns to light green, and the fruit has reached its final size at this time. Ripening: the peel turns to yellow. Bagged: the fruit can’t be saw the peel and can’t be identify the mature.

To deal with the difficulties of real-time computing and cost, we employed an embedded system NVIDIA TX2 in implementation of the YOLO12 pre-trail network. The Intel D435i13 camera was the input data to TX2, it has RGB data and depth data for each frame. We assume that this device (including cameras and embedded systems) was built under the platform of the elevated tracked vehicle and attached to a simple harvesting mechanism.

Results

In this section, we discuss the effects of training on recognition performance for different databases (Table 1). We also discuss the means by which to assess accuracy in classification and identification tasks.

Table 1 Test results using various training databases.

Deep learning

During training, root mean square error (RMSE) was used for evaluation, based on the standard deviation of the residual, where values that are closer to zero indicate superior accuracy.

Accumulative database

We categorized the fruit as ripening or unripe using 518 training images and 360 testing images. The results were as follows: training time (19.2 h), loss (0.0042), and RMSE (0.06). The classification accuracy (CA) and box selection accuracy (BA) are shown in Fig. 1. In training database as Fig. 1a–d, a–c can be correctly classified as ripening and (d) as unripe, and the confidence score for each image is 94.61%, 97.79%, 96.96%, and 91.07%. In testing database as Fig. 1e–h, e,f can be correctly classified as ripening, and (g) and (h) are unripe, and the confidence score for each image is 88.68%, 86.20%, 95.93%, and 61.37%. The results in the testing dataset had lower confidence than those in the training dataset, such as (g) and (h) are the fruits sunhats without in the training set, but they can be correctly distinguished and positioned in both frame selection and classification. Especially in the Fig. 1c, the all-yellow fruit without crowns is mainly due to the rapid changes in temperature, resulting in the reduction or disappearance of the growth point differentiation ability, which was appear in the actual field. Our database was generated for on-site assessment. In the future, the database of fruit development that will appear in field-related field management will continue to increase. In Table 2, the first part (database 1), the classification accuracy (CA) and box selection accuracy (BA) of unripe fruits (120 images) were 100% in the training set. However, one of the ripening fruits (total 398 images) fails to correctly selection the box of the fruit, and the rest of the box selection fruits can be correctly selected. In the test set, the CA of unripe fruits (21 images) was 100%, and only one of them was misclassified. However, among the ripening fruits (339 images), 4 images of them failed to correctly selection the box of the fruit, so the BA was 98.82%, among the 335 correct box selections, 11 images were classified incorrectly, so the CA was 96.71%.

Figure 1
figure 1

Experiment results: (ad) training images; (eh) test images.

We also categorized the fruit as unripe, ripening, or ripe using 693 training images and 572 test images. The results were as follows: training time (25.6 h), loss (0.0036), RMSE (0.06). For the additional category (i.e., unripe), framing of the pineapple was successful (i.e., the pineapples were correctly positioned in the frame), but some of the ripe fruits were classified as unripe due to similarities in color. In Table 2 (database2), 4 images in the ripe training set (175 images) were misclassified as ripe, and the CA was 97.91%, and the ripening BA was 99.75% which the one image fail to detect the fruit in ripening set (398 images). In the training part of unripe, the BA was 100%, but CA was 99.17% which one image misclassified as ripe. Among the 339 images of unripe fruit in test set, pineapple detection failed in 8 images (BA was 97.64%), 74 were misclassified as ripe and five were misclassified as ripening (CA was 77.34%). In the ripening test set, 21 images can be detected the fruits, but one image fail to classified as unripe (CA: 95.24% and BA:100%). Among the 212 test images of ripe, seven were misclassified as unripe (CA:96.70% and BA:100%).

Table 2 3D distance experiment: indoor. Significant values are in bold.

We also categorized the fruit as unripe, ripening, ripe, or bagged using 928 training images and 572 test images. The results were as follows: training time (34.5 h), loss (0.0056), RMSE (0.07). The position of the fruit was correctly framed in the training image set. Training images: 1) In classifying ripe pineapples, one image was misclassified as unripe and one image was misclassified as ripening (CA:98.86%). 2) In classifying ripening pineapples, one image was misclassified as bagged (CA:99.75%). 3) In classifying unripe pineapples, two images were misclassified as ripe (CA:98.33%). The position of the fruit was correctly framed in the test image. In classifying ripe and ripening pineapples, all images were correct. In unripe 339 images, 3 images of them failed to correctly selection the box of the fruit, so the BA was 99.12%. In classifying the 339 unripe images, one image was misclassified as ripening, one image was misclassified as bagged, and 75 images were misclassified as ripe (CA: 773.8%).

Increasing the number of images in the database was shown to improve detection performance. Our inability to exceed 95% accuracy can be attributed to the fact that ripe pineapples appear similar to unripe pineapples.

3D Distance

We performed experiments under indoor and outdoor testing environments. In the indoor test, 3D distance estimates were verified using a checkerboard grid with cells measuring 21 mm. The accuracy of 3D distance estimates could not be obtained under outdoor conditions; therefore, we used only Dz distance error.

The checkerboard in the indoor test was maintained in a fixed position using a ruler and a laser rangefinder. Depth values were obtained using the depth camera (D434i). In order to verify the accuracy of the distance we calculated, we added the laser rangefinder as the standard answer, calculated the difference between the two as the calculation error (D), and divided the difference by the standard distance to obtain the error rate (error rate of D). Laser rangefinder The architecture and arrangement of the camera are shown in Fig. 2a and d.The experiment setup is shown in Fig. 2b and a depth map is shown in Fig. 2e. We used the intersection of black and white cells closest to the center of the screen as a center point for calculating the distance based on sampled coordinate points, the results of which are shown in Table 3. In the experiment, the Z-axis distance was 250 ~ 800 mm, Dx distance was + 58.09 mm and -59.58 mm, Dy distance was -62.48 mm (upper area) and + 57.79 mm (lower area), and Dz distance (laser range- depth camera) finder was + 12 mm. We can find that in the value of the Z axis, our calculation results are more than 4 mm larger than the distance of the laser rangefinder. Therefore, when we subtract 6 mm from the calculated values as △Dz, the error rate of △Dz are all less than 1.12% at the distance of 300 ~ 800 mm. As shown in Table 3, over a distance range of 300 ~ 800 mm, 3D distance error was less than 2%, which means that under Z-axis distance of 300 mm, the maximum error would be 6 mm.

Figure 2
figure 2

Distance experiment setup: (a) and (d) device architecture; (b) and (e)indoor experiment; (c) and (f) field experiment.

Table 3 Z-axis distance experiment: outdoors. Significant values are in bold.

In the outdoor experiment, we attached the laser rangefinder aligned with the x-axis of the camera, as shown in Fig. 2c and f. During this test, we were unable to obtain precise distance measurements along the x- and y-axes; therefore, we tested accuracy only along the z-axis. Note that the position of the pineapple does not necessarily match the point measured by the laser rangefinder and the body of the pineapple is arc-shaped. We placed Fig. 2d in a pineapple field for testing, placed the camera at nine different distances and positions, and tested the difference from the laser rangefinder and our system. Thus, the maximum error obtained from nine groups of test results was 1.53% (see Table 1).

The YOLO V212 architecture made it possible to detect pineapple maturity in 99.27% of cases. In 3D distance experiments, error in 3D distance estimates was less than 2% at 300 ~ 800 mm in an indoor environment. Z-axis distance estimates obtained in the field were consistent with our indoor test results, resulting in error of 1.53%.

Discussion

This paper presents preliminary work on the development of an automated system for the harvesting of pineapples. We used a database of images collected in the field (Tainung No. 17) for offline training. The detected fruits were classified as harvestable (ripe and ripening) or un-harvestable (unripe). Note that these assessments were obtained for fruits that were partially shaded from the sun using a paper bag. Performance in box selection reached 99.27% with 3D distance error of less than 2%. A failure to exceed classification accuracy of 95% prompted us to subdivide the ripe samples into the following two groups with the aim of enhancing classification accuracy: stage 709 and stage 8. Trained network weight files were transferred from an i7-1165G7 2.8 GHz host computer with no graphics card to an embedded system equipped with an NVIDIA TX2, thereby increasing the frame rate from 5 to 15 FPS (including distance conversion). The distance data was then transferred to the harvesting machine via an RS232 transmitter.

We expect to improve on the poor recognition rate achieved in the test set in the future. We will also augment the database by adding more data pertaining to inflorescence, fruit development period, and pineapple varieties. The preliminary design and planning of the simple harvesting machinery is shown in Fig. 3. The harvester moves back and forth to align the attached slides (green in the image) with the fruit, as shown in Fig. 3a. After alignment, the two semi-circular halves of the harvester device are then lowered to surround the fruit, as shown in Fig. 3b. The two halves are then moved to within a set distance of the stem, whereupon an electric cutting device picks the fruit, as shown in Fig. 3c.

Methods

Offline training was conducted using Matlab 2021b running on a PC equipped with an Intel i9-11900F and RTX-3080Ti graphics card with the weight files held in an embedded NVIDIA TX2. In the following, we describe how we obtained the database and then outline the network architecture and our reasons behind its selection. Finally, we describe the means by which detection data was converted into 3D distances.

Figure 3
figure 3

Preliminary plan for simple harvesting mechanism: (a) lateral alignment of harvesting mechanism; (b) lowering of clam** device; (c) closure of clam** device around stem beneath fruit.

DataBase

The most important part of any deep learning implementation is the collection of data and the accuracy of ground truth values. Collecting data in the field can be time-consuming; therefore, we discussed the issue with Ching-Shan Kuan and Hsin-Yi Tseng at the Taiwan Agricultural Research Institute before initiating the collection process. Data collected in the field were classified and labelled in accordance with BBCH1 codes throughout the growing period. The correctness of the data was confirmed manually. We compiled a database of 8,852 images, as follows: stage 5 (2186), stage 6 (2176), stage 7 (3,449), stage 701 (1990), stage 705 (1002), stage 709 (457), and stage 8—ripening (419), and stage 8—ripe (387). We also collected 235 images of bagged pineapples. The database data we used was shown in Table 4. It was classified according to the BBCH, and randomly divided into training set and test set. Table 4 lists the categories and the number of images in each. Our main objective was to assess fruits in the fields during the final stage.

Table 4 Database categories and number of images based on BBCH code. Significant values are in bold.

The fruit was divided into four categories: bagged (Fig. 4a), ripe (Fig. 4b), ripening (Fig. 4c), and unripe (Fig. 4d). ‘Ripe’ refers to fleshy fruit that is entirely green but must be harvested. ‘Unripe’ refers to fleshy fruit and hollow-sounding fruit that are not harvestable. ‘Ripening’ refers to hollow-sounding fruit undergoing a color change from green to yellow. The extent of color change can be subdivided according to maturity level. In Fig. 4b,c, the head of the fruit is slightly different, the head of (c) is slightly yellow, and turns from bottom to top; (b) is the whole fruit entirely green without turn yellow. Note that when paper bags are used to protect the fruit from sun exposure, it is not possible to determine the degree of maturity. Note that images showing sun hats (Fig. 4e) were not included in the training or test sets.

Figure 4
figure 4

Database categories: (a) bag; (b) ripe; (c) ripening; (d) unripe; (e) sunhat.

Deep learning

The YOLO architecture is available in three versions: First generation14 (V1) unifies the two-stage solution (Class, Bounding Box) under the CNN architecture into a regression problem with calculation speed of 45 FPS; second generation12 (V2) offers improved accuracy, convergence, and calculation speed of 67 FPS; and third generation15 (V3) exhibits improved detection of small objects but is slower. The fourth generation16 (V4) was meant to find an optimal balance of resolution and the number of convolution layers. This enables a high recognition rate with high computing efficiency and speed of 65 FPS. V517 is largely the same as V4; however, the computing speed is twice as fast.

Note that in this study, we performed detection using a single station in the orchard. The size of objects detected in the images exceeded the 300*300 px resolution of the camera and no more than two objects were ever included in the same frame. Thus, there was no need to detect multiple small objects. The background in the pineapple field was less complex and less variable than that in the COCO database. Thus, in the selection of a network architecture, our primary consideration was speed and accuracy. We opted for the YOLO V2 network architecture using models pre-trained in MATLAB. The parameter settings were as follows: output images (448*448*3) with anchor boxes (344,338), (108,115), (223,290), (238,167). The Yolonet parameters in experiments were as follows: feature layer (leakyrelu_24), mini-batch-size (32), max-epochs (200), learning rate (0.001). In order to avoid the overfitting problem, we randomly selected the training set in the database, and used the test images that are not in the training set to confirm whether there is overfitting.

3D Distance

Note that the selected camera uses left to right dual-camera parallax to estimate distances in conjunction with TOF technology to correct for distance error, resulting in Z-axis distance error of less than 2%13. Note also that the working range of the camera was from 20 cm to 10 m, which was in line with our needs. As shown in Fig. 5, after the camera recognizes the frame selection position of the object through deep learning, the center point of the frame is calculated to read the Z-axis distance of the corresponding pixel position. Convert the 3-axis actual distance (mm) from the obtained Z-axis distance and the screen frame selection center point.

Figure 5
figure 5

System flowchart.

After determining the position of the pineapple via deep learning, as shown in Fig. 5, we used the coordinates of the upper left corner of the frame (xa, ya) to calculate the center point position (xb, yb) as showed in Eq. (1). The distance value of the depth image was read from this position. To account for the risk of distance failure under strong lighting conditions, we sought to strengthen the depth values using Eq. (2). Thus, we used the eight adjacent points and the center point to calculate depth value \(D_{Z}\). Once the camera is installed beneath the field harvester, the body of the machine should block strong light.

$$\left( {{\text{x}}_{b} ,y_{b} } \right) = \left( {{\text{x}}_{a} + \frac{w}{2},y_{a} + \frac{h}{2}} \right)$$
(1)
$$\begin{aligned} D_{Z} &= ({\text{depth image}}\left( {{\text{x}}_{b} - 1,y_{b} - 1} \right) + {\text{depth image}}\left( {{\text{x}}_{b} ,y_{b} - 1} \right) \\ &\quad + {\text{depth image}}\left( {{\text{x}}_{b} + 1,y_{b} - 1} \right) + {\text{depth image}}\left( {{\text{x}}_{b} - 1,y_{b} } \right) \\ &\quad + {\text{depth image}}\left( {{\text{x}}_{b} ,y_{b} } \right) + {\text{depth image}}\left( {{\text{x}}_{b} + 1,y_{b} } \right) \\&\quad + {\text{depth image}}\left( {{\text{x}}_{b} - 1,y_{b} + 1} \right) + {\text{depth image}}\left( {{\text{x}}_{b} ,y_{b} + 1} \right) \\ &\quad + {\text{depth image}}\left( {{\text{x}}_{b} + 1,y_{b} + 1} \right))/{\text{num }} \\ \end{aligned}$$
(2)
$${\text{num}} = {\text{Calculate the number of }}9{\text{ depth image values in }}D_{Z} {\text{ that is greater than }}1$$
(3)

Dz values were used to calculate the x-axis distance Dx and y-axis distance Dy. Using the known camera FOV (H: 64°, V: 41°) and camera resolution (1920*1080 pixels), distance can be calculated using trigonometric functions via Eqs. (45). In the center of the screen, Dx = Dy = 0 mm, in the upper right corner of the screen, Dx > 0, Dy < 0, and in the lower left corner of the screen, Dx < 0, Dy > 0.

$$D_{x} = \frac{{\tan \left( \frac{H}{2} \right) \times dz \times 2}}{m} \times \left( {x_{b} - x_{1} } \right)$$
(4)
$$D_{y} = \frac{{\tan \left( \frac{V}{2} \right) \times dz \times 2}}{n} \times \left( {y_{b} - y_{1} } \right)$$
(5)

where (× 1,y1) indicate center point of the image, the first value was m/2, and the second value was n/2. The position of the object determined using deep learning can be converted into 3D distance (mm) using Eqs. (15).