Introduction

DNA damage is a major toxicological concern in drug development as it is a critical cause of cancer, as well as chronic and degenerative diseases such as heart diseases and Parkinson’s disease1,2,3,4,5,6. The DNA can be damaged by spontaneous chemical reactions, reactive oxygen, and nitrogen species generated by metabolism, or exogenous agents. If a damaged DNA is not repaired correctly, the damages might be accumulated and lead to mutations, cell death, or senescence that can induce various diseases including cancers and chronic diseases1. The assessment of DNA damage, therefore, is crucial in the toxicology field to detect genotoxicity and carcinogenic potential of certain compounds. It is also significant in medical research to understand the pathogenesis of diseases including cancer and chronic diseases, and can be applied in clinical medicine to monitor these disease conditions7,8.

Although there are several methods to assess DNA damage9,10,11,12, comet assay (also referred to as single-cell gel electrophoresis) is recommended for detecting the DNA strand breakage in ICH S2 (R1) guideline13. Its high sensitivity, high efficiency, and low cost made it easily accessible for researchers14,15. In the assay, damaged DNA migrates out of the nucleus to form a shape that resembles a tail of a comet during electrophoresis, whereas undamaged DNA in the nucleus forms the head of the comet shape. Followed by electrophoresis, DNA is visualized by staining with DNA-binding dye under fluorescence microscopy. If the comet head gets smaller and dimmer while its tail gets longer and brighter, it suggests that the DNA in the cell is significantly damaged. Hence, each cell detected in the image is referred to as a comet.

The damage of DNA on comet assay images can be determined either by visual scoring or computer image analysis16. Visual scoring classifies comets into several different categories based on DNA damage degree (visually estimated tail length and shape) subjectively. However, computer image analysis provides a range of different parameters of each comet by pixel-based computation. Although the visual scoring is simple and may appeal to those who cannot afford expensive computer programs, the computer image analysis is preferable in assessing DNA damage as it can provide objective quantified information of comets.

Several image analysis methods, such as Comet Score, CASP, OpenComet, CometQ, and HiComet for comet assay have been proposed to facilitate the comet scoring process16,17,18,19,20. They have two main stages, i.e., comet detection or segmentation and comet scoring. Depending on whether detection or segmentation process gets completed automatically, it can be divided into either a semi- or fully-automated tool. The primary performance differences come from whether a tool can detect and segment each comet correctly rather than scoring stage because the metric used to assess the comet score has predefined basic standards and rules21. However, these image analysis methods all use hand-crafted image features with traditional machine learning (such as support vector machine) at comet detection and segmentation stage. Users should define features manually and tune visually which might be laborious. It makes the methods have a few limitations on capturing comets from a noisy background and hardly distinguish two individual cells when overlapped. In contrast, DL can allow raw image data as input and learn to detect and segment comet in an end-to-end process.

The DL provides excellent achievements in the fields of computer vision, such as image classification22, image object detection23,24, and image object segmentation5a). The terms involved are explained as follows:

Figure 5
figure 5

Precision/recall curves of the DeepComet on the test dataset. (a) Intersection over Union (IoU) is given by the overlap** area between the predicted region and the ground truth region divided by the area of union of them. (b) The precision/recall curve of the DeepComet at IoU > 0.5 and IoU > 0.75 on our test dataset. The precision/recall curve can be used to evaluate the performance of an object detection model.

True positive (TP)

A detection where IoU ≥ threshold and the predicted class agrees with the ground truth class.

False positive (FP)

An incorrect detection.

False negative (FN)

A ground truth object not detected.

Precision

It measures relevant objects among all detected ones; it is expressed as follows.

$$Precision = \frac{TP}{{TP + FP}}$$
(1)

Recall

It measures detected objects among all relevant ones; it is expressed as follows:

$$Recall = \frac{TP}{{TP + FN}}$$
(2)

The AP is the precision averaged across all recall values between 0 and 1 by varying the detection confidence threshold. The COCO metric uses 101 interpolated data points to calculate the AP. In the following, APmask denotes that the IoU was estimated between segmentation masks, APbb indicates that the IoU was calculated between bounding boxes, mAP is the mean AP over all classes, AP@0.5 denotes that the IoU threshold was 0.5, and AP@[0.5:0.95] indicates that the AP was averaged over 10 IoU thresholds (from 0.5 to 0.95 with a step size of 0.05).

Figure 5b shows the precision/recall curve of the DeepComet at IoU > 0.5 and IoU > 0.75 on our test dataset. The precision/recall curve can be used to evaluate the performance of an object detection model as the confidence is changed by plotting a curve for each object class. A model is considered reliable if the curve stays close to y (precision) = 1, which means that the accuracy remains high as recall increases by varying the confidence threshold. For the segmentation of non-ghost and ghost cells, the DeepComet demonstrated that its precision remained over 0.9 until recall increased to approximately 0.8 at IoU > 0.5. The mAPmask@0.5 on the Normal test subset was 0.916, which is slightly higher than that of the Hard test dataset (0.87) and the total (Normal + Hard) test dataset (0.897). The results obtained on the Hard test dataset indicate that the DeepComet was well trained for segmenting comets on these noisy images. At IoU > 0.75, although the curves dropped as expected, the precision values were more significant than 0.8 until recall increased to approximately 0.8. The exception was when the recall value was over 0.4 with the Hard test dataset, where the precision values were slightly less than 0.8. Unlike at IoU > 0.5, there were gaps between the non-ghost curve and the ghost curve when IoU > 0.75.

In Table 2, we evaluated APmask and mAPmask averaged over IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The mAPmask@[0.5:0.95] values of the Normal, Hard, and total datasets were 0.616, 0.584, and 0.601, respectively. In general, the DeepComet demonstrated a better performance in segmenting ghost cells compared to non-ghost cells on two datasets.

Table 2 Comet segmentation performance of DeepComet on test dataset by APmask@[0.5:0.95] and mAPmask@[0.5:0.95].

Table 3 compares the performance of the DeepComet with OpenComet Version.1.3.1 (https://www.cometbio.org)17 and HiComet (https://github.com/taehoonlee/HiComet)19. We applied default settings in each program. Since these two programs have no function to classify cells into non-ghost and ghost cells, we ignored the classification function of the DeepComet and simply transformed the output as detection of comets for comparison. The outputs of the HiComet are the detected bounding boxes of the comets. Thus, here the APbb metric was used to evaluate and compare the results. As shown in Table 3, the DeepComet showed much higher APbb values in all set-ups. The APbb@0.5 of our results were all greater than 0.9 on the three test datasets, whereas the results obtained using the OpenComet and HiComet were under 0.5. Regarding the high IoU threshold, the APbb@0.75 and APbb@[0.5:0.95] values of the DeepComet were approximately 0.78 and 0.66, respectively, whereas the compared programs barely reached 0.2. When comparing the corresponding APbb results obtained by the Normal and Hard test datasets, it was demonstrated that the gap between the results from each dataset was much less for the DeepComet compared to other methods, where DeepComet showed a difference of less than 5% but greater than 30% for others. However, it should be considered that the experimental setup can be unfavorable for OpenComet and HiComet as they were not optimized for our comet image dataset.

Table 3 Comet object bounding box detection performance by APbb. OpenComet and HiComet were operated with default settings.

Figures 6 and 7 show the visual comparisons of the comet detection and segmentation results. Figure 6 shows the representative images from the Normal and Hard test datasets for comparison. The OpenComet could not segment slightly overlapped targets into individual comets. Furthermore, it did not effectively detect the entire tail of the comet for the segmentation No. 5 in the Normal image and No. 11 in the Hard image. There were also incorrect segmentations for ghost cells. From the HiComet output images in the third row, it was found that HiComet did not adequately recognize the tail part of the non-ghost cell and could not capture the entire ghost cell. However, the DeepComet in the bottom was able to segment slightly overlapped comets into two individuals. Also, the DeepComet captured a dim tail in the Normal image and a noisy tail in the Hard image. The DeepComet could also distinguish non-ghost and ghost cells correctly, and capture objects even on the edge of the image.

Figure 6
figure 6

The comparison of comet segmentation results from the DeepComet and other automatic programs.

Figure 7
figure 7

The comparison of individual comet segmentation results from the DeepComet and other programs.

Figure 7 shows the segmentation results on representative comets, which prove challenging to segment due to overlap**, background noise, or a dim tail (bright background). Here, we added the results from Comet Assay IV Version 4.3 (Instem-Perceptive Instruments Ltd., Suffolk, Halstead, UK), which is a semi-automated method that requires a manual click on a comet to score. As shown in the first row, if two comets slightly overlap each other, the OpenComet and Comet Assay IV could not separate them and recognize each comet. In contrast, the HiComet only knew one of the comets. The second row shows the results on an image with background noise. Here, OpenComet never recognized the comet as valid, and Comet Assay IV detected the comet but could not segment it accurately, whereas the HiComet did it correctly. The images in the bottom have a bright background and a dim comet tail. The OpenComet and HiComet did not capture the entire tail, and the Comet Assay IV captured excessive area, including the background.

Comet score comparison

We validated our calculated comet scores by verifying the correlation with those from Comet Assay IV using randomly selected 107 non-ghost cells that could easily be segmented. The two most crucial comet parameters, DNA (%) in tail and Olive moment, were calculated, and the correlation between the scores obtained from the DeepComet and Comet Assay IV were analyzed using Pearson correlation. Figure 8 shows the correlations between the scores obtained for DeepComet and Comet Assay IV to be 0.917 for DNA (%) in tail and 0.959 for Olive moment. For both scores, the correlation was positive and significantly high.

Figure 8
figure 8

Comet score correlations between DeepComet and Comet Assay IV. For both parameters, DNA (%) in tail and Olive moment, the correlation was positive and significantly high.

In practical application, a fully-automated comet image analysis program should input sets of full comet assay images and automatically output all detected and segmented comets with their scores. Here, the comet scoring performance of the DeepComet was validated in comparison with Ground truth, Comet Assay IV, OpenComet, and HiComet. Figure 9 shows box plots of comet score (DNA (%) in tail and Olive moment) distributions of 50 comet assay images which were randomly selected from Normal and Hard test subsets (25 for each). The ground truth comet score distribution was calculated on manually annotated comet regions. The comet scores using Comet Assay IV were generated by manual click on every comet on the images using the program. For the objects that Comet Assay IV did not segment properly, we manually modified the comet regions as accurately as possible. False positive detection of comets which had no overlap** region with ground truth were all excluded. It is observed from the box plots that the results of our method match the ground truth the most, followed by Comet Assay IV which also shows quite similar shapes of box plots to the ground truth. OpenComet and HiComet detected almost all comets, however, the segmented comet tails were too short on our datasets, whereas Comet Assay IV tended to segment a comet tail a little wider and longer. Student’s t-test was carried out at the significance level of 0.05 to identify statistical difference between Ground truth and respective other methods. There was no significant difference between Ground truth and DeepComet as well as between Ground truth and Comet Assay IV, whereas there was a significant difference between Ground truth and OpenComet as well as between Ground truth and HiComet.

Figure 9
figure 9

Comet score distributions evaluated at full comet assay image level. For both parameters, DNA (%) in tail and Olive moment, the result from DeepComet was the closest to the ground truth, followed by Comet Assay IV. OpenComet and HiComet were operated with default settings.

Discussion and conclusion

In this paper, we proposed DeepComet for comet segmentation that utilizes the Mask R-CNN architecture. Developed by Facebook AI research, the Mask R-CNN is one of the most popular architectures because it can be extended to other applications easily, such as human pose estimation. Its performance also has been verified in various practical applications44,45 and competitions (https://www.kaggle.com/competitions, https://github.com/matterport/Mask_RCNN) of image segmentation. Consequently, we selected the Mask R-CNN among different DL models for image object segmentationhttps://liveuou-my.sharepoint.com/:u:/g/personal/wooec52_liveuou_kr/EVIbAAmwXUBArB0B4uEmREMBe15FZTJB1LdU40KN8wwDUQ?e=i3jcnB).

There has been debate over the cause of ghost cells and how they should be analyzed appropriately. Due to their appearance, measurements of DNA (%) in the tail through image analysis are unreliable22. The DeepComet can classify cells into non-ghost and ghost cells, which helps users to apply customized measurements on ghost cells. With further post processing (e.g., background brightness subtraction, orientation correction, and head and tail separation), the comet score can be measured precisely for each segmented comet using the DeepComet.

In the precision/recall curve of DeepComet at IoU > 0.75, the precision values were higher than 0.8 until recall increased to approximately 0.8 except for the Hard test dataset, where the precision values were slightly lower than 0.8. Also, there were gaps between the curves of non-ghost and ghost cells. Based on visual observation, this occurred because the area of ghost cells was generally much more significant than non-ghost cells, which made IoU less sensitive to ghost cells. Furthermore, there were several non-ghost cells with very short and dim tails, which led to ambiguous annotations. The other experimental results for non-ghost and ghost cell segmentation demonstrated that the DeepComet obtained high mAPmask@0.5 (0.897) with the total test dataset. The results indicate that the DeepComet is superior to state-of-the-art products with comet bounding box detection using APbb. Besides, compared with the Normal test dataset, the APbb of the Hard test dataset gets reduced by approximately 30% for the state-of-the-arts, whereas the APbb of the DeepComet only decreased by less than 5%. In terms of comet scoring, our results proved that the DeepComet has a high correlation with the commercial Comet Assay IV program. Furthermore, the proposed method also shows the high performance when evaluated at a full comet assay image level. These points demonstrate that the DeepComet can be used for practical automatic comet assay image analysis.

Comet image variability of inter-laboratories is inevitable, and it could be a possible reason for the large gap in performance against our dataset between the DeepComet and other methods (as shown in Table 3 and Fig. 9). Comet images from different laboratories could vary in image resolution, background noise, comet shape, brightness, contrast, etc. As each method was developed and optimized using different datasets, DeepComet may also need to be validated on datasets other than ours. There should be datasets released from various places to carry out the more appropriate comparison, as well as to develop a more generalized algorithm.

We demonstrate that our newly proposed DL-based comet assay analysis program, DeepComet, has shown great performance on our publicly released datasets. The DeepComet and our datasets can serve as a baseline for comet image segmentation and analysis to facilitate future research in this field of toxicology and medical science. In the future, we hope to expand the datasets with multiple external sources from different experimental environments and develop a more specific and accurate deep learning-based architecture for comet assay image analysis.

Data availability

The datasets generated during and/or analyzed during the current study are available here: https://doi.org/10.5281/zenodo.6395303.