Background

Double-balloon enteroscopy (DBE) is important for managing small bowel disease [1]. Different from capsule endoscopy (CE), DBE not only serves as a diagnostic tool but allows for tissue sampling and therapeutic intervention [2, 3]. DBE has a real-time image reading mode. However, there are inherent oversight and distraction risks in the use of this mode due to its time-consuming and technically demanding nature. Additionally, diagnosis is influenced by endoscopist experience variability. It has been previously reported that the first DBE procedures yielded false-negative results in 20–23.8% of patients [4,5,6,7]. The lesion types missed during these procedures include ulcers, diverticulum, tumours, and vascular lesions [6]. As immediate repeat inspection is not routine for DBE, missed lesions are unlikely to be detected during the same procedure. This will delay the detection and treatment of bleeding lesions, which increases the risk of recurrent bleeding and repeat examinations [6].

Artificial intelligence (AI) has become a strong focus of interest in clinical practice, owing to its potential to reduce diagnostic errors and manual workload [8,9,10,11,12]. Convolutional neural networks (CNNs) have been used for big data analysis of medical images [9, 13,14,15,16,17,18,19,20,21]. CNN-based algorithms have displayed outstanding performance in gastrointestinal endoscopy that is comparable to or even superior to that of experts [22,23,24,25,26]. AI may assist in diagnosis during endoscopic examinations by automatically detecting, characterizing, measuring, and localizing various lesions. Computer-aided diagnostic (CAD) systems could improve the examination quality and diagnostic accuracy in gastrointestinal endoscopy, hel** clinicians formulate therapeutic strategies and prognosis predictions [27,28,29,30]. Recently, there has been overwhelming literature supporting the crucial role of CNNs on CE to automatically recognize classify, and localize small bowel abnormities [31,32,33,34,35,36,37,38,39,40,41,Experts reached a consensus on the standard answer of the images

Three experts (L Zhao, A Yin, and F Liao) reached a consensus to obtain the standard answer for training and testing. If more than two experts came to the same conclusion for a given image, the conclusion was the standard answer; otherwise, the experts discussed their findings and reached a consensus. The experts, with more than 200 cases of DBE experience, were all from RHWU. They labeled images by bounding each lesion with the smallest rectangular box that enclosed the lesion through an online annotator [44], and the labels were the standard answer for training and testing Model 1. Then, they classified all images into diverticulum, protruding lesion, erosion & ulcer, and angioectasia as the standard answer for training and testing Model 2.

Protruding lesions included: polyps, nodules, epithelial tumours, submucosal tumours, and venous structures. Erosions and ulcers were classified into the same category. Erosions, ulcers, and diverticulum included lesions of various aetiologies. Angioectasia included Yano–Yamamoto classification types 1a and 1b.

Construction of the CAD system

The CAD system consisted of two CNNs. The structure of ENDOANGEL-DBE is shown in Fig. 1. You only look once (YOLO) [45] was used for detection (Model 1). YOLO is a region-based object detector that uses a single CNN to detect lesion regions. YOLO has been trained for lesion detection in digestive endoscopic examinations and is widely used in studies for its fast and accurate detection [46,47,48]. ResNet-50 [49], a residual learning framework with good generalizability, was used to build Model 2. The residual blocks of ResNet-50 utilize skip connections to avoid vanishing gradients. The dataset was trained in Google’s TensorFlow. Mode l tuning was used for training. The model tuning parameters are shown in Supplementary Table 4. Dropout, data augmentation, and early stop** with patience of val_loss were used to lower the overfitting risk (Supplementary Tables 5 to 6).

Fig. 1
figure 1

Processing flow diagram of ENDOANGEL-DBE

The threshold value of Model 1 was set to 0.02 according to the ROC curve. Because Model 2 was a four-type classification model, the label with the highest threshold value and with a threshold value ≥0.25 among the four classification labels assigned to the target image was the final classification that the model outputs.

Training and testing

The training and testing flow chart is shown in Fig. 2 and the dataset distribution is shown in Supplementary Tables 1 to 3. We evaluated the individual performance of the two models with the image test set and the overall performance of ENDOANGEL-DBE with the video test set. In addition, ENDOANGEL-DBE performance was compared with endoscopists’ performance for the video test set.

Fig. 2
figure 2

Flow chart

Assessment with image test sets

Assessment of YOLO

The area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, and specificity were used to evaluate the performance of Model 1. The standard answer of the three experts, as mentioned above, was obtained using an online annotator. Model 2 used a bounding box to annotate lesions, and as long as there was one box bounding lesion, the model’s detection of the current image was considered correct.

Assessment of ResNet-50

In the image test set, lesion images abstracted from original images according to the standard answer obtained by the three experts mentioned above were used to assess the performance of Model 2. Accuracy, sensitivity, and specificity were used to evaluate the models.

Assessment with the retrospective video test set and comparison with endoscopists

We evaluated the overall performance of the proposed system on the video test set. Model 1 detected lesions with bounding boxes and Model 2 classified the boxed lesions, and the output classification results were recorded. Three experts provided standard answers for this video test set and we evaluated ENDOANGEL-DBE’s performance in terms of accuracy, sensitivity, and specificity. The videos were cropped by 25 frames per second, and the lesions were boxed by Model 1 and then input into Model 2 for classification. The models provided a diagnosis every second, and the final result of a video was obtained from the largest proportion of all the diagnoses from the video.

Four novices with less than 10 cases of DBE experience and an additional 4 experts with more than 200 cases of DBE experience participated in a human and machine contest. The 4 experts in the video test were different from the 3 experts who developed the standard answers. They diagnosed the same 65 videos independently. The test in ENDOANGEL-DBE was performed 8 times to compare the results with those 8 endoscopists.

Statistical analysis

Continuous variables are shown as the mean and standard deviation. McNemar’s test was used for comparisons between ENDOANGEL-DBE and the endoscopists. All calculations were performed using SPSS 26 (IBM, Chicago, Illinois, USA).

Ethics

The Ethics Committee approved this study at RHWU. The Ethics Committees waived informed consent in this retrospective study. Patient information was hidden during training and testing.

Results

Performance of ENDOANGEL-DBE for images

The AUC of Model 1 for the image test set was 0.947. The ROC curve is shown in Supplementary Fig. 1. The sensitivity of Model 1 in detecting lesions was 92% and the specificity was 93%. Ninety-eight out of 1318 normal images in the image test set were recognized as abnormal images, and 72 out of 915 abnormal images had no box bounding lesion. Normal mucosa, mucus, feces, light spots, dim spots, bubbles, and dark view induced false-positive results. Normal mucosa was the primary contributor to false-positive images. Most of the false-negative images were misjudged owing to dark view, lesions resembling the surrounding mucosa, and lesions with small flat haematin bases (Table 1).

Table 1 False positive images & false negative images analysis in the image test set

Typical images for classification are shown in Supplementary Fig. 2. The overall accuracy of Model 2 was 86%. The sensitivity of classifying protruding lesions was 93%, which was the highest among the four common lesions, and the lowest sensitivity was 80% (erosions & ulcers and diverticulum). ENDOANGEL-DBE also achieved high performance in classifying angioectasia in the image test set (85%). The specificity of classifying diverticulum was 99%.

Performance of ENDOANGEL-DBE for videos

There were 65 videos in the video test set. The detection sensitivity was 100% in the per-case analysis and 93% in the per-frame analysis. The total classification accuracy of ENDOANGEL-DBE was 85%. The number of correct cases that ENDOANGEL-DBE and human observe for different classes of lesions is shown in Table 2. The performance of ENDOANGEL-DBE for the video test set is shown as a confusion matrix in Supplementary Fig. 3.

Table 2 The number of correct cases diagnosed by endoscopists and ENDOANGEL-DBE in video test set. The results of endoscopists are shown as averages ± standard deviations

Human and machine comparison

The total accuracy of endoscopists was 77%. The overall performance of ENDOANGEL-DBE was superior to that of endoscopists (p < 0.01). Twenty out of 25 erosion & ulcer cases were diagnosed correctly by ENDOANGEL-DBE, and it had more correct cases than endoscopists in this class. The number of cases that ENDOANGEL and endoscopists diagnosed correctly in diverticulum and protruding lesions were comparable. In addition, the overall performance of ENDOANGEL-DBE was superior to that of novices (73%, p < 0.01). ENDOANGEL diagnosed more correct cases than the novices did in classifying erosion &ulcer and angioectasia. The overall performance of ENDOANGEL was comparable to the experts in classification (81%; p = 0.253). And it diagnosed more correct cases than experts did in classifying angioectasia. The number of correct cases that ENDOANGEL and experts diagnosed correctly was comparable in classifying the other three lesion classes (Table 2). A demonstration video is shown in Video 1.

In this video test, some erosion and angioectasia cases were confused. ENDOANGEL-DBE misdiagnosed three angioectasia cases as erosions and one erosion as angioectasia. One protruding lesion was misdiagnosed as a diverticulum because of peristalsis (Fig. 3).

Fig. 3
figure 3

Misdiagnosed cases of machine or endoscopists. A Erosions misdiagnosed as angioectasia by novices because bile affects observation. B small erosions misdiagnosed as angioectasia by novices. C Angioectasia with red background misdiagnosed as erosion&ulcer by machine and novices. D Pedunculated polyps in the lumen during peristalsis misdiagnosed as diverticulum by machine. E small angioectasia misdiagnosed as erosion& ulcer by machine. F angioectasia in the edge of view misdiagnosed as erosion& ulcer by machine

Discussion

The promising performance of CNN-based AI algorithms on CE images inspired us to explore their utility in image analysis in the field of DBE images.

In this study, we developed a CNN-based system to automatically recognize and classify small bowel lesions. This is the first study to develop a CAD system and evaluate it with both DBE images and videos. The CAD system performed similar to experts’ evaluations and better than novices. It might be a useful tool, to improve diagnostic yield in DBE, especially for novices. ENDOANGEL-DBE has the potential to reduce missed lesions, interoperator variability, and improve diagnostic accuracy for common small bowel lesions. Another potential application of ENDOANGEL-DBE is in DBE training. Automatic detection and classification CNN programs can assist in image reading training via real-time feedback, which may shorten the training time and accelerate experience accumulation for trainees. Furthermore, the classification model of this CAD system will facilitate further exploitation towards an automatic diagnosis and reporting system for DBE examination.

Automatic detection assists in reducing the miss rate of four common lesions instead of a single type, and classification will be finished after detecting them, which is one of the advantages of our study compared with previous studies [50,51,52]. Our system has reached a high accuracy in detecting and classifying small bowel lesions. To reduce false-positive cases, we used normal images as noise in the training set of the detection model, and we achieved satisfactory results. Notably, we found that ENDOANGEL-DBE’s diagnostic precision of erosion & ulcer was the lowest among the four classes of lesions for both the image and video test sets. Most of the misdiagnosed cases were small, red, and spot lesions, especially those with a red background. The targeted collection of small erosion and angioectasia images for further training improvement is necessary to improve ENDOANGEL-DBE diagnostic performance. In the video test set, dark view and red background were the main reasons for false-positive results for ENDOANGEL-DBE. Transfer learning will be used to decrease false-positive cases in further studies. In this video test, the main reason for endoscopists’ misdiagnosis is that the lesions are small, the background in the video is red in tone and bile can also affect observation. We found that ENDOANGEL-DBE diagnosed more correct cases than experts in diagnosing angioectasia in the video test set. Since there were only eight cases of angioectasia included in this video test set, we need further validation in larger test sets to make the results comparable.

Recent studies have shown that the sensitivity of AI in single disease classification under device-assisted enteroscopy has reached 88.5–97% [50,51,52]. Their training set included more images than our study, but the number of recruited cases in these studies was less than that in ours. Increasing the number of recruited cases will improve the generalizability of the system. These studies were only based on images and our study also assessed ENDOANGEL-DBE’s performance with videos. Miguel Martins et al. [51] developed a model recognizing erosions and ulcers from normal images, whose sensitivity was higher than ours. Since they did not use an independent test set, the images in their test set and training set might come from the same cases, and using such a test set might lead to higher results than those in the real world. These studies aim to detect one single type of lesions, but our system can detect and classify multiple types of lesions. AI systems for automatic detection and classification under CE achieved accuracy of 88.2–100% [31, 33, 34, 53]. These studies contained very large training sets and reached a high detection accuracy. Referring to the above research, we could attempt to use multiple binary classification models or enrich our training sets to further improve our system.

Several limitations of this study must be acknowledged. First, this is a single-centre retrospective study, and only endoscopists from RHWU participated. We should conduct a larger test among endoscopists from different hospitals to compare the performance of the CAD system and humans. We will also plan a multicentre prospective clinical trial to assess its performance in daily clinical routines. Second, the standard answer of this study is expert consensus instead of pathology. The model development was strongly dependent on the consensus of the three experts, which were humans. This study mainly focuses on endoscopic diagnosis. A comprehensive diagnostic system combining pathological and clinical results will be constructed in future studies. Additionally, to further improve our system, model selection, dataset revision, and cross-validation will be used for model training. Third, to make the comparison results fair, the endoscopists were told that the test cases were one of the four types of lesions. However, this might lead to a higher diagnostic accuracy of endoscopists than that in clinical diagnosis. A further comparison between endoscopists with and without AI in unclipped videos is needed to investigate the influence of AI. Fourth, lesions might be missed when multiple different lesions appear at the same time during examinations because of the distance from endoscopy and lesion size. ENDOANGEL-DBE needs further training using multiple lesion images for clinical use. This system will be improved to make diagnoses and assess their relevance to bleeding in future studies. Furthermore, we can develop a system with a positioning function in the future, which can pinpoint the location of lesions. The system can be trained for the assessment of lesion size and the detection of obscure small intestinal bleeding. It could also be used to write an automatic report, which will lessen the administrative workload of endoscopists.

Conclusions

In conclusion, we developed a novel computer-aided diagnostic system for small bowel lesions in DBE. This CAD system can assist especially novice endoscopists in increasing their diagnostic yield.