Background & Summary

In the 4th Industrial Revolution, there is a substantial surge in the demand for multifunctional robots. Gripper-equipped robots have gained popularity and are pivotal for gras** tasks in various industries. They offer the manufacturing industry a distinct advantage by reducing production time and enhancing overall throughput. A major portion of these tasks necessitates robots to exhibit competence in handling objects of diverse shapes, weights, and textures. However, it is worth noting that most techniques are utilized to train robots for tasks suited to structured environments, where prior knowledge of the scene and objects is readily available. Such tasks are prone to significant errors and present substantial challenges in achieving full automation, particularly within unstructured environments1. In unstructured environments, objects are disordered, and exhibit unknown shapes and geometries, thereby requiring robotic systems to rely on real-time perception and comprehension through robotic vision. This is in stark contrast to the structured environment where prior knowledge and object models can be employed. Consequently, the key to addressing this challenge lies in robotic perception, which enables robots to localize, segment, and grasp objects in unstructured settings.

At present, most vision-based applications and research predominantly rely on traditional vision sensors such as RGB and RGBD sensors. Nonetheless, conventional frame-based cameras exhibit notable limitations, including high power consumption and extensive storage demands stemming from continuous full-frame sensing and data storage.Furthermore, their characteristics of low sampling rates and susceptibility to motion blur can adversely impact the perceptual quality for many vision-based applications. For instance, the conventional RGB camera’s low sampling rate results in motion blur when capturing images of fast-moving objects on conveyor belts within production lines2. Thus, the accuracy and success rate of object picking and placing are reduced at the perceiving stage. Neuromorphic vision sensors draw inspiration from biological systems, particularly the visual capabilities of fly eyes, which can parallelly and in real-time sense data with a microsecond-level sampling rate3,4. Leveraging these unique properties of event cameras, an increasing amount of research is exploring the applications of neuromorphic vision technology to mitigate motion blur and enhance efficiency in various domains. These applications encompass object tracking5, depth estimation6, autonomous driving7, and robotic gras**8,9,10,11.

In perception-related tasks, segmentation serves as a foundational pre-processing step, crucial for estimating the attributes of individual objects. This is particularly vital in vision-based robotic gras** applications, where the accurate localization and geometric details of each object are essential for formulating precise gras** strategies12. In other words, the quality of perception and segmentation has a direct and substantial impact on the quality of the gras** process. In recent years, learning-based approaches to segmentation and other vision-based tasks triggered a massive surge. Datasets are significant for computer vision supervised learning methods13. Moreover, datasets allow the comparison among various algorithms to provide benchmarks14. Several RGB and RGBD-based segmentation datasets were constructed to provide ground truth for the training and evaluation of deep-learning-based segmentation approaches. For instance, EasyLabel15 offers instance segmentation RGB-D dataset with point-wise labeled point-clouds information for cluttered objects in an indoor environment, where the depth height and the objects in clutter are varied. Also, synthetic dataset TOD was generated for unknown object segmentation11,22 that utilize 2D spatial and temporal information. Nonetheless, these methods encounter difficulties in segmenting occluded objects. To circumvent this limitation, the inclusion of depth information from RGBD imagery can prove advantageous. Furthermore, events complemented by depth information can serve as the ground truth for deep learning-based depth estimation approaches, such as spiking neural networks-based depth estimation from mono event camera23. However, there is a conspicuous absence of deep learning approaches for neuromorphic segmentation of tabletop objects. This shortage can be attributed to the insufficient availability of labeled data required for both training and testing. Instead of develo** of novel instance segmentation methodologies, transfer learning from semantic segmentation networks could be a feasible and expedited approach to accomplish instance segmentation tasks. There are several approaches targeting event-based semantic segmentation for autonomous driving, such as EV-SegNet (2019)24, VID2E (2019)25, EVDistill (2021)26, EV transfer (2022)27, and ESS (2022)28. However, features provided by pure events are limited compared to RGB frames. The cross-modal networks, such as SA-GATE24 and CMX1.

Fig. 1
figure 1

Hardware setup. Experimental hardware setup (left-side figure): three cameras are fixed on the end-effector of the UR10’s manipulator. Camera configuration (right-side figure): The RGBD camera Intel D435 is placed in the middle, and two event-based cameras Davis 346c are mounted on the left and right sides with a tiled angle of 5 degrees towards the middle.

To ensure the complete overlap of left and right cameras, the relative tilt angle between the two event cameras is calculated as 5 degrees with the assumed height of 0.82 m. Therefore, the two event cameras are tiled with 5 degrees towards the RGBD camera. Furthermore, synchronization connectors are employed to synchronize the two event-based cameras, ensuring that events are triggered simultaneously. Given the microsecond-level sample rate, synchronization between the event cameras and the RGBD camera is achieved by identifying the nearest timestamps.

Experimental protocol

We collected and assembled the Event-based Segmentation Dataset (ESD) into two distinct subsets: ESD-1, designated for training purposes, and ESD-2, reserved for testing, primarily focusing on unseen object segmentation tasks. The training dataset encompasses up to 10 objects, whereas the testing dataset includes up to 5 objects. The data sequences were collected under various experimental conditions, which will be elaborated on in the subsection Dataset Challenging Factors and Attributes. These conditions encompass different object quantities (ranging from 2 to 10 objects in ESD-1 and 2 to 5 objects in ESD-2), varying lighting conditions (normal and low light), heights between cameras and tabletop (0.62 meters and 0.82 meters), occlusion conditions (with and without occlusion), varying camera movement speeds (0.15 m/s, 0.3 m/s, and 1 m/s in ESD-1, and 0.15 m/s and 1 m/s in ESD-2), as well as different camera trajectories (linear, and rotary and general motion in ESD-1, where general motion is a combination of linear and rotary motion, and linear and rotary motion in ESD-2). Furthermore, it is noteworthy that the objects present in ESD-1 differ from those in ESD-2, thereby rendering this dataset can be used for addressing challenges related to unknown object segmentation.

Before conducting experiments to collect data, all of the event cameras and the RGBD camera were calibrated to obtain the intrinsic and extrinsic parameters40, crucial for subsequent data processing and annotation. Then setting up the specific conditions for each particular experiment, such as the height of cameras and the lighting condition. The UR10 robot’s end-effector, carrying the cameras, consistently started from the same position for all experiments. Generally, the majority of experimental setups for robotic gras** adopt an “eye-in-hand” configuration10,41, where the camera is attached to the end effector. In scenarios involving object pick-and-place tasks, the camera commonly maintains a downward orientation focusing on the tabletop. Consequently, manipulating the end effector at various speeds and trajectories, along with the attached camera, facilitates the simulation and replication of conditions encountered in robotic manipulation. Additionally, in event-based camera observation, an issue arises when no events are detected as the camera’s motion direction aligns parallel to object edges. Therefore, we formulated three distinct motion patterns: linear motion, most affected; general motion, less affected due to rotation; and pure rotary motion, effectively eliminating this issue. Furthermore, in many table-top gras** scenarios, event camera has an “act-to-perceive” nature, where the resulting events depend heavily on the linear and angular velocity of the camera, and not just the scene. That is why it is important to generate events using different camera motion types to generate a sufficient dataset that can be used to train models capable of generalization. Figure 2 illustrates the three designed different moving trajectories in xyr space using quaternion, where xy indicates the plane that the cameras move on, and the rotation is denoted in r axis.

Fig. 2
figure 2

Designed moving trajectories in xyr space, where xy indicates the plane that cameras move on, and the rotation is denoted in r axis.

Overall 115 experiments and 30 experiments were conducted for ESD-1 and ESD-2, respectively. The RGB part of the dataset consists of 14,166 annotated images. In total, 21.88 million and 20.80 million events from the left and right event-based cameras were collected, respectively.

To measure the differences between similar frames, the difference between two adjacent frames is calculated by Root Mean Square Error (RMSE) of pixel values, which is a measure of the average pixel-wise difference providing a single scalar value representing the overall dissimilarity between the frames. The RMSE for the entire ESD dataset is 62.31. Additionally, we calculate RMSE for two specific scenarios: one involving a sequence with the slowest movement, consisting of two objects with fewer features, and the other with the fastest movement, involving ten objects with more features. The calculated RMSE values for these scenarios are 55.17 and 65.61, respectively. We also quantitatively evaluated the difference of consecutive RGB frames of widely used and newly released video datasets DAVIS42, MOSE43 and CLVOS2344. The calculated average RMSEs are 30.20, 28.69 and 26.27, respectively. By comparing the difference RMSE between our dataset and recently published datasets, we demonstrate that our dataset has sufficient variation in the collected images. This variation would supplement the generalization capabilities of machine learning models trained on our dataset.

$$RMSE=\sqrt{\frac{1}{n-1}{\sum }_{i=1}^{n-1}\frac{{\left({I}_{i+1}-{I}_{i}\right)}^{2}}{N}}$$
(1)

where Ii and Ii + 1 denote the i th and (i + 1) th images. The total number of images is described as N. Additionally, even if consecutive frames look similar, the same does not necessarily apply to events. For example, performing the same linear motion at different speeds could result in different RGB frames, but the associated events data would exhibit significant differences. As depicted in Fig. 3a,b show the RGB visualization captured from the RGBD camera which looks quite similar to each other. However, when we examine the corresponding events stream within 1 ms in x-y-t coordinate, a substantial difference becomes evident as shown in (c,d). The discrepancy arises because events are generated asynchronously, leading to variations in temporal information captured by the events data.

Fig. 3
figure 3

Visualization of RGB frames captured from RGBD camera and events streams obtained by event-based cameras in linear motion in the same time interval.

Image and events annotation

We tested different methods for the automatic annotation of RGB images and event data. Due to different features appearing with different perception angles of the camera, achieving precise automatic labeling of RGB images is quite challenging. Consequently, we manually labeled all RGB frames and utilize these manual annotations as references for the automatic annotation of event-based data.

Manual annotation of RGB frames

Our proposed ESD dataset contains 11,196 images for training and 2,970 images for testing in total. We used the online web annotator CVAT45 to manually annotate the tabletop objects in each frame. CVAT offers automatic features for pixel labelling. The polylines tool is used to draw the boundaries around the objects. Dealing with occlusion is one of the challenges of annotating this dataset. The occluded object is declared as the background whereas the front object is declared as the foreground.

Furthermore, the motion blur resulting from the low sampling rate of the RGBD camera introduces ambiguity when manually labeling the boundaries of objects. Therefore, we addressed this challenge by conducting a two-step labeling process, as illustrated in Fig. 4. The initial annotation was established based on manual inferences of the objects’ positions in accordance with the trajectory of camera movement. Following this, once the corresponding events were fitted and annotated (as elaborated in section Automatic labeling of events data), we can observe the events frame to understand whether the mask frame had been accurately labeled with well-defined shapes and object outlines. If the event frame exhibited precise annotation, the initial annotated mask was retained as the final version. However, if the event frame indicated imprecision, we initiated the second step, involving the re-labeling of the RGB frame, continuing this process until the events were meticulously annotated.

Fig. 4
figure 4

Two steps of labeling blurred images: initial annotation and re-annotation. If wrong labels show in the event frame, the second-round labeling of the RGB mask will be triggered according to the initial annotated events.

Automatic labeling of events data

Events are labeled according to the annotated RGB masks, and the Pseudo code of automatic annotation of a sequence of events captured in one experiment as described in Algorithm 1.

Algorithm 1

Automatic Annotation for Events Data.

Events recorded can be considered as a continuous data stream with a high frequency (few microseconds). Thus, we divided sequences of events into intervals “E” of around 60 ms which is the same sampling period of the RGBD camera by finding the nearest timestamp between events and the RGB frame. Simultaneously, annotated mask frames “S” in RGBD coordinate are transformed to events coordinates as “Se” as described in Eqs. 24. First, the forward projection is applied to transform mask frames “S” in RGBD coordinate into RGBD camera coordinate as “Sc” using the camera intrinsic parameters (Eq. 2). As expressed in Eq. 3, the coordinate transformation is applied twice to transform “Sc” into world coordinate and event camera coordinate in sequence using the cameras’ extrinsic parameters. Building on that, masks in event camera coordinate are backward projected into events coordinate as described in Eq. 4.

$$\left\{\begin{array}{c}x=(u-{c}_{x})z/{f}_{x}\\ y=(v-{c}_{y})z/{f}_{y}\end{array}\right.$$
(2)
$$\left[\begin{array}{c}{x}_{e}\\ {y}_{e}\\ {z}_{e}\\ 1\end{array}\right]={\left[\begin{array}{cc}{{\bf{R}}}_{{\bf{e}}} & {{\bf{T}}}_{{\bf{e}}}\\ 0 & 1\end{array}\right]}^{-1}\left[\begin{array}{cc}{\bf{R}} & {\bf{T}}\\ 0 & 1\end{array}\right]\left[\begin{array}{c}x\\ y\\ z\\ 1\end{array}\right]$$
(3)
$$\left\{\begin{array}{c}{u}_{e}={f}_{xe}{x}_{e}/{z}_{e}+{c}_{xe}\\ {v}_{e}={f}_{ye}{y}_{e}/{z}_{e}+{c}_{ye}\end{array}\right.$$
(4)

where (x, y, z), (u, v), (xe, ye, ze), (ue, ve) and (X, Y, Z) represent the same point in RGBD camera coordinate, RGB image plane, event camera coordinate, events image plane, and world coordinate systems respectively. \({c}_{x},{c}_{y}\) and \({c}_{xe},{c}_{ye}\), indicate the center points in RGB and events image planes, respectively. Similarly, \({f}_{x},{f}_{y}\) and \({f}_{xe},{f}_{ye}\) denote the focal length of RGBD and event camera, respectively. R and T express the rotation matrix and translation vector from the RGBD camera coordinate to the world coordinate system. Re, Te describes the rotation matrix and translation vector from the event camera coordinate to the world coordinate system.

However, the events recorded asynchronously appear in different locations since the camera keeps moving. Thus, events between two consecutive RGB frames are sliced into sub-intervals with 300 events “Eij”. Then fitting transformed event mask “Se” into events coordinate as “Set” by applying the Iterative Closest Point (ICP) algorithm46 to find the rigid body transformation between the two corresponding point sets X = {x1, x2, ‥, xn} and P = {p1, p2, ‥, pn}. The ICP algorithm assumes that the corresponding points xi and pi are the nearest ones, so the working principle is to find the rotation matrix R and translation t that minimizes the sum of the squared error E(R, T) as expressed in Eq. 5.

$$E(R,T)=\frac{1}{{N}_{p}}\mathop{\sum }\limits_{i=1}^{{N}_{p}}{\left\Vert {x}_{i}-{R}_{{p}_{i}}-t\right\Vert }^{2}$$
(5)

Leveraging the transformation matrix derived from rotation and translation, calculated through ICP (Iterative Closest Point), the corresponding location of events on annotated mask frames will be obtained as \({x}_{i}^{{\prime} }=[{\bf{RT}}]{x}_{i}\). Therefore, the labels assigned to pixels on the RGB mask are inherited and applied to the events as well. The working principle is illustrated in Fig. 5.

Fig. 5
figure 5

Principle of map** the interval of events on the RGB frame coordinate for annotation.

In our setup, we position two event cameras and an RGBD camera in different locations, as illustrated in Fig. 1. Consequently, their views may not entirely overlap due to constraints on the distance between objects and the cameras. As a result, areas exclusively sensed by the event camera may be erroneously labeled as background, even when they correspond to actual objects. To tackle this issue, we crop out the blind area which is only sensed by the event camera. The event-based camera not only captures spatial information but also temporal information. As a result, the event occurrences that are cropped at a specific timestamp persist in subsequent timestamps, making them amenable to further processing.

Data visualization

We partitioned ESD into training (ESD-1) and testing (ESD-2) subsets for unseen object segmentation tasks. Training and testing dataset consists of up to 10 objects and 5 objects respectively. Notably, the testing dataset presents a challenge in addressing unseen object segmentation tasks, as it features different objects compared to the training dataset, ESD-1. Data sequences were collected under various experimental conditions which will be discussed in detail in subsequent subsection Dataset challenging factors and attributes. Examples of ESD-1 in terms of the number of objects’ attributes can be visualized in Fig. 6. The raw RGB image, annotated mask, and the corresponding annotated events (N = 3000) are illustrated for conditions of the different number of objects. Particularly, in the clusters of 2 objects, both the objects (ie. Book and box) are distanced from each other. For clusters of more than 2 objects, there are occlusions among objects. Besides, examples of ESD-2 for unseen object segmentation in terms of other attributes are depicted in Fig. 7.

Fig. 6
figure 6

Example of the ESD-1 in terms of the number of objects attributes, under the condition of 0.15 moving speed, normal light condition, linear movement, and 0.82 height. Different colors in the RGB ground truth and annotated event mask mean different labels. Better view in color.

Fig. 7
figure 7

Example of unknown objects ESD-2 dataset in terms of the number of objects attributes, under the condition of 0.15 moving speed, normal light condition, linear movement, and 0.82 height. Different colors in the RGB ground truth and annotated event mask mean different labels. Better viewed in color.

Data Records

All of the ESD47 is available at Figshare, whose structure is demonstrated in Fig. 8.

Fig. 8
figure 8

Dataset structure. Each sequence was recorded in the “events” subfolder under different experimental conditions with a unique name under the training or testing path. Event-related and frame-related information is stored under “events” and “RGB” folders, respectively. Particularly, raw images and annotated masks are contained in the “RGB” subfolder under different experimental conditions. Events with RGBD information of both event cameras, image, and mask frames converted from RGBD coordinates and cameras’ movement are recorded under the “events” folder.

Data format

In each sequence, event-related data is stored in four distinct files within the “events” folder, which correspond to specific conditions. The “left.mat” file contains events information from left Davis 346c, RGBD information from Intel 435, and data regarding the movement of the cameras. Similarly, right.mat contains events and frames information from the right event-based camera and RGBD camera and information of cameras’ movement. Additionally, synchronous image frames and mask frames converted from RGBD camera coordinates for both event-based camera coordinates are reserved in events_frame.mat and mask_events_frame.mat. Moreover, the raw RGB images and ground truth masks are also provided in the “RGB” folder of all experimental conditions.

Dataset challenging factors and attributes

We constructed ESD47 dataset with various scenarios and challenges in the indoor cluttered environment. We briefly define the attributes as below, and the symbol * is varying in specific conditions:

  • Various number of objects (O*): The complexity of the scene can be affected by the number of objects. Thus, we selected different numbers of objects with various shapes and layouts to increase the diversity of the scenes. Particularly, scenes of 2, 4, 6, 8, and 10 objects were collected in ESD-1. Scenes of 2 and 5 objects were collected in ESD-2.

  • Cameras’ moving speed (S*): Motion blur is an open challenge in computer vision tasks. We collected data with different moving speeds (S015: 0.15 m/s, S03: 0.3 m/s, S1: 1 m/s) of cameras to introduce various degrees of motion blur of RGB frames.

  • Cameras’ moving trajectory (M*): From the observation, events may not be captured if they are on the edge which is parallel to the camera’s moving direction that is challenging in event-based processing. We introduce this attribute as linear (ML), rotary (MR), and general (linear + rotary (MLR)) moving trajectory to cover all the edge directions.

  • Illumination Variant (*L): Illumination has a substantial impact on an object’s appearance and is still an open challenging problem in segmentation. In the dataset recording, we collected data in low lighting (LL) and normal lighting (NL) conditions.

  • Height between tabletop and cameras (*H): It affects the size of the overlap area of stereo cameras. Thus, we introduce it as one of the attributes. The dataset is collected with a higher height (HH) and a lower height (LH) indicating that the sensing areas from the stereo camera are fully overlapped and partially overlapped, respectively.

  • Occlusion condition (*O): Occlusion is a classical and challenging scenario in segmentation, that is caused by the integration of objects in the scene. We placed objects with occlusion (OY) and without occlusion (ON).

Sample images taken from our proposed ESD47 dataset with various attributes are shown in Fig. 9. To address the needs of applications involving unknown objects, two distinct subsets were collected, namely ESD-1 and ESD-2, each featuring different sets of objects. ESD-2 serves as a dataset for testing the performance of models with unknown objects. A total of 115 sequences are collected and labeled in ESD-1, and their attributes are statisticized in Fig. 10 including different light conditions, moving speed, moving trajectories and objects with occlusion. ESD-2 comprises 30 sequences with corresponding data statistics illustrated in Fig. 11. Each sequence in both sub-datasets encompasses eight key aspects: end effector’s pose and moving velocity, RGB frames and depth maps from D435, RGB frames and events stream from left and right Davis 346 C.

Fig. 9
figure 9

Sample images of RGB, masks and annotated events are selected from our proposed ESD47 dataset. (a) Shows tabletop objects under low lighting conditions. (b) Shows the motion blur scenarios because of the fast camera motion with 1 m/s speed. (c) Shows the objects are occluded by others. (d) Shows the lower height of cameras with 0.62 m from the tabletop. Different colors in the RGB ground truth and annotated event masks mean different labels. Better view in color.

Fig. 10
figure 10

ESD-1 statistic: sequences (a), frames (b) and events (c) statistic in terms of attributes. ML, MR and MLR indicate linear, rotation and hybrid moving types; LN and LL represent normal and low light conditions; S015, S03, and S1 describe the camera’s moving speed of 0.15 m/s, 0.3 m/s and 1 m/s; Similarly, O2, O4, O6, O8 and O10 express sequences of 2–10 objects; The occlusion cases are with and without occlusion referred as OY and ON, respectively. Additionally, the total quantities of sequences, frames and events are also presented in (ac), respectively. Better viewed in color.

Fig. 11
figure 11

ESD-2 statistic: sequences (a), frames (b) and events (c) statistic in terms of attributes. ML and MLR indicate linear and hybrid moving types; LN and LL represent normal and low light conditions; S015 and S1 describe the camera’s moving speed of 0.15 m/s and 1 m/s; Similarly, O2 and O5 express sequences of 2 and 5 objects; The occlusion cases are with and without occlusion referred as OY and ON, respectively. Additionally, the total quantities of sequences, frames and events are also presented in (ac), respectively. Better view in color.

Technical Validation

Evaluation metrics

Our dataset ESD47 provides labels of events for individual objects, making it suitable for instance segmentation tasks. Additionally, ESD47 includes objects from various categories, rendering it useful for semantic segmentation as well.In this work, we assess our dataset by applying instance and semantic segmentation methods. Standard segmentation metrics, specifically accuracy and mean Intersection over Union (mIoU), are employed to quantify the testing results. Pixel accuracy, as defined in Eq. 6, measures the percentage of pixels correctly classified.

$$Acc(p,p{\prime} )=\frac{1}{N}\mathop{\sum }\limits_{i}^{N}\delta ({p}_{i},{p}_{i}^{{\prime} })$$
(6)

where p, p′, N, and δ represent the ground truth image, the predicted image, the total number of pixels, and Kronecker delta function, respectively. However, its descriptive power is limited for cases with a significant imbalance between foreground and background pixels. Therefore, mIoU is also utilized in this work as the evaluation metric due to its effectiveness in dealing with imbalanced binary and multi-class segmentation. Mean IoU (mIoU) is calculated across classes as Eq. 7:

$$mIoU(p,p{\prime} )=\frac{1}{C}\mathop{\sum }\limits_{i}^{C}\frac{\mathop{\sum }\limits_{i}^{N}\delta ({p}_{i,c},1)\delta ({p}_{i,c},{p}_{i,c}^{{\prime} })}{max(1,\delta ({p}_{i,c},1)+\delta ({p}_{i,c}^{{\prime} },1))}$$
(7)

where C denotes the number of classes. If a pixel i of prediction or ground truth belongs to a certain class c, pi, c and pi, c are 1; otherwise, pi, c and pi, c are 0.

Segmentation on RGB images

The approaches for RGBD instance segmentation are sophisticated, so we selected several well-known and widely used methods to evaluate our manually labeled RGB frames, such as FCN19, U-NET20, and DeepLab21. The testing results of ESD-1 and ESD-2 datasets using mIoU metrics is 59.36% on FCN, 64.19% for U-Net, and 68.77% for DeepLabV3+. Moreover, the segmentation results on other public conventional datasets MSCOCO17, PascalVoc13, and CityScape18 are also listed in Table 1. When comparing the segmentation results on known objects from ESD-1 to those of most publicly available datasets, both accuracy and mIoU scores appear lower. This discrepancy can be attributed to the shuffling of RGB frames within the sequences, which leads to image blurring when the camera is in motion. However, it is worth noting that, in contrast to other datasets with complex backgrounds, ESD, which is designed specifically for tabletop objects, offers a setting that is relatively more conducive to distinguishing between foreground and background. For this reason, the segmentation results on RGB images from MSCOCO dataset are comparatively lower. On the other hand, these evaluation results underscore the challenges posed by the RGB component of our dataset. These challenges arise not only from object occlusions but also from the impact of motion blur. In addition, the performance of all testing results on unknown objects from the ESD-2 sub-dataset exhibits a reduction of approximately 30%.

Table 1 Evaluation results of the state-of-the-art segmentation networks FCN, U-Net and DeepLab on RGB frames from ESD

Code availability

All the events were automatically labeled by the Matlab programs. All Matlab codes are available on GitHub49 https://github.com/yellow07200/ESD_labeling_tool.

References

  1. Chitta, S., Jones, E. G., Ciocarlie, M. & Hsiao, K. Mobile manipulation in unstructured environments: Perception, planning, and execution. IEEE Robotics and Automation Magazine 19, 58–71, https://doi.org/10.1109/MRA.2012.2191995 (2012).

    Article  Google Scholar 

  2. Zhang, Y. & Cheng, W. Vision-based robot sorting system. In IOP conference series: Materials science and engineering, vol. 592, 012154 (IOP Publishing, 2019).

  3. Indiveri, G. & Douglas, R. Neuromorphic vision sensors. Science 288, 1189–1190 (2000).

    Article  PubMed  Google Scholar 

  4. Lichtsteiner, P., Posch, C. & Delbruck, T. A 128\times 128120db15\mu s latency asynchronous temporal contrast vision sensor. IEEE journal of solid-state circuits 43, 566–576 (2008).

    Article  ADS  Google Scholar 

  5. Glover, A. & Bartolozzi, C. Event-driven ball detection and gaze fixation in clutter. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2203–2208 (IEEE, 2016).

  6. Rebecq, H., Gallego, G., Mueggler, E. & Scaramuzza, D. Emvs: Event-based multi-view stereo–3d reconstruction with an event camera in real-time. International Journal of Computer Vision 126, 1394–1414 (2018).

    Article  Google Scholar 

  7. Chen, G. et al. Event-based neuromorphic vision for autonomous driving: A paradigm shift for bio-inspired visual sensing and perception. IEEE Signal Processing Magazine 37, 34–49 (2020).

    Article  Google Scholar 

  8. Naeini, F. B. et al. A novel dynamic-vision-based approach for tactile sensing applications. IEEE Transactions on Instrumentation and Measurement 69, 1881–1893 (2019).

    Article  Google Scholar 

  9. Baghaei Naeini, F., Makris, D., Gan, D. & Zweiri, Y. Dynamic-vision-based force measurements using convolutional recurrent neural networks. Sensors 20, 4469 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  10. Muthusamy, R. et al. Neuromorphic eye-in-hand visual servoing. IEEE Access 9, 55853–55870, https://doi.org/10.1109/ACCESS.2021.3071261 (2021).

    Article  MathSciNet  Google Scholar 

  11. Huang, X. et al. Real-time gras** strategies using event camera. Journal of Intelligent Manufacturing 33, 593–615 (2022).

    Article  Google Scholar 

  12. Muthusamy, R., Huang, X., Zweiri, Y., Seneviratne, L. & Gan, D. Neuromorphic event-based slip detection and suppression in robotic gras** and manipulation. IEEE Access 8, 153364–153384 (2020).

    Article  Google Scholar 

  13. Everingham, M., Van Gool, L., Williams, C. K., Winn, J. & Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–308 (2009).

    Article  Google Scholar 

  14. Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).

  15. Suchi, M., Patten, T., Fischinger, D. & Vincze, M. Easylabel: A semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets. In 2019 International Conference on Robotics and Automation (ICRA), 6678–6684 (IEEE, 2019).

  16. **e, C., **ang, Y., Mousavian, A. & Fox, D. Unseen object instance segmentation for robotic environments. IEEE Transactions on Robotics 37, 1343–1359 (2021).

    Article  Google Scholar 

  17. Lin, T.-Y. et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, 740–755 (Springer, 2014).

  18. Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3213–3223 (2016).

  19. Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440 (2015).

  20. Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241 (Springer, 2015).

  21. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 834–848 (2017).

    Article  PubMed  Google Scholar 

  22. Barranco, F., Fermuller, C. & Ros, E. Real-time clustering and multi-target tracking using event-based sensors. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5764–5769 (IEEE, 2018).

  23. Hidalgo-Carrió, J., Gehrig, D. & Scaramuzza, D. Learning monocular dense depth from events. In 2020 International Conference on 3D Vision (3DV), 534–542 (IEEE, 2020).

  24. Alonso, I. & Murillo, A. C. Ev-segnet: Semantic segmentation for event-based cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 0–0 (2019).

  25. Gehrig, D., Gehrig, M., Hidalgo-Carrió, J. & Scaramuzza, D. Video to events: Recycling video datasets for event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3586–3595 (2020).

  26. Wang, L., Chae, Y., Yoon, S.-H., Kim, T.-K. & Yoon, K.-J. Evdistill: Asynchronous events to end-task learning via bidirectional reconstruction-guided cross-modal knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 608–619 (2021).

  27. Messikommer, N., Gehrig, D., Gehrig, M. & Scaramuzza, D. Bridging the gap between events and frames through unsupervised domain adaptation. IEEE Robotics and Automation Letters 7, 3515–3522 (2022).

    Article  Google Scholar 

  28. Sun, Z., Messikommer, N., Gehrig, D. & Scaramuzza, D. Ess: Learning event-based semantic segmentation from still images. In European Conference on Computer Vision, 341–357 (Springer, 2022).

  29. Liu, H., Zhang, J., Yang, K., Hu, X. & Stiefelhagen, R. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. ar**v preprint ar**v:2203.04838 (2022).

  30. Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrió, J. & Scaramuzza, D. Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robotics and Automation Letters 6, 2822–2829 (2021).

    Article  Google Scholar 

  31. Binas, J., Neil, D., Liu, S.-C. & Delbruck, T. Ddd17: End-to-end davis driving dataset. ar**v preprint ar**v:1711.01458 (2017).

  32. Burner, L., Mitrokhin, A., Fermüller, C. & Aloimonos, Y. Evimo2: an event camera dataset for motion segmentation, optical flow, structure from motion, and visual inertial odometry in indoor scenes with monocular or stereo algorithms. ar**v preprint ar**v:2205.03467 (2022).

  33. Chaney, K. et al. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 4016–4023 (2023).

  34. Saxena, A. et al. Depth estimation using monocular and stereo cues. In IJCAI 7, 2197–2203 (2007).

    Google Scholar 

  35. Zhou, Y., Gallego, G. & Shen, S. Event-based stereo visual odometry. IEEE Transactions on Robotics 37, 1433–1450, https://doi.org/10.1109/TRO.2021.3062252 (2021).

    Article  Google Scholar 

  36. Seitz, S. M., Curless, B., Diebel, J., Scharstein, D. & Szeliski, R. A comparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), vol. 1, 519–528 (IEEE, 2006).

  37. Rebecq, H., Gallego, G. & Scaramuzza, D. Emvs: Event-based multi-view stereo. (2016).

  38. Tosi, F., Aleotti, F., Poggi, M. & Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).

  39. Kar, A., Häne, C. & Malik, J. Learning a multi-view stereo machine. Advances in neural information processing systems 30 (2017).

  40. Ayyad, A. et al. Neuromorphic vision based control for the precise positioning of robotic drilling systems. Robotics and Computer-Integrated Manufacturing 79, 102419 (2023).

    Article  Google Scholar 

  41. Du, G., Wang, K., Lian, S. & Zhao, K. Vision-based robotic gras** from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artificial Intelligence Review 54, 1677–1734 (2021).

    Article  Google Scholar 

  42. Li, X. et al. Video object segmentation with re-identification. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017).

  43. Ding, H. et al. Mose: A new dataset for video object segmentation in complex scenes. ar**v preprint ar**v:2302.01872 (2023).

  44. Nazemi, A., Moustafa, Z. & Fieguth, P. Clvos23: A long video object segmentation dataset for continual learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2496–2505, https://doi.org/10.1109/CVPRW59228.2023.00248 (IEEE Computer Society, Los Alamitos, CA, USA, 2023).

  45. Computer vision annotation tool. https://cvat.org.

  46. Besl, P. J. & McKay, N. D. Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, vol. 1611, 586–606 (Spie, 1992).

  47. **aoqian, H. et al. ESD: A Neuromorphic Dataset for Object Segmentation in Indoor Cluttered Environment, Figshare, https://doi.org/10.6084/m9.figshare.c.6432548.v1 (2024).

  48. Chen, X. et al. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In European Conference on Computer Vision, 561–577 (Springer, 2020).

  49. **aoqian, H. et al. A Neuromorphic Dataset for Object Segmentation in Indoor Cluttered Environment-codes, Zenodo, https://doi.org/10.5281/zenodo.1234 (2023).

Download references

Acknowledgements

This work was performed at the Advanced Research and Innovation Center (ARIC), which is funded by STRATA Manufacturing PJSC (a Mubadala company), Sandooq Al Watan under Grant SWARD-S22-015, and Khalifa University of Science and Technology.

Author information

Authors and Affiliations

Authors

Contributions

X.H. and A.A. conceived the experiments, built the experimental setup, conducted the data acquisition, and wrote the Matlab code for automatic labeling of events; X.H. and K.S. annotated RGB frames and events; K.S. evaluated RGB frames of our dataset; X.H. validated the events data and integration of RGB and events data; X.H. organized the dataset; Y.Z., F.N. and D.M. supervised this work; Y.Z. is the project principle investigator; all authors reviewed the manuscript.

Corresponding author

Correspondence to Yahya Zweiri.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, X., Kachole, S., Ayyad, A. et al. A neuromorphic dataset for tabletop object segmentation in indoor cluttered environment. Sci Data 11, 127 (2024). https://doi.org/10.1038/s41597-024-02920-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41597-024-02920-1

  • Springer Nature Limited