Background

Laryngopharyngeal cancer (LPC), including laryngeal cancer (LCA) and hypopharyngeal cancer, is the second most common malignancy among head and neck tumours, with more than 130,000 deaths reported in 2020 [1]. Laryngoscopy biopsy is the gold standard for diagnosing LPC [2, 3]. In-office transnasal flexible electronic endoscopy can intuitively examine the laryngopharynx, making it the most effective device for detecting LPC [4, 5]. The limited resolution and contrast of white light can lead to the neglect or missed diagnosis of superficial mucosal cancers, even by experienced endoscopists [6, 7]. This can lead to patients being diagnosed at a later stage and thus having to undergo a multimodal treatment approach, resulting in poor prognosis and reduced quality of life [8,9,10]. Furthermore, a precautionary biopsy is usually prescribed to avoid the missed diagnosis of early-stage cancer, resulting in overtreatment and emotional stress to patients [11]. Recently, endoscopic systems with narrow-band imaging (NBI), which can improve the clarity and identification of epithelial and subepithelial microvessels, have played a critical role in the early diagnosis of LPC with high specificity and sensitivity [12,13,14]. However, owing to the relatively long professional training and accumulation of clinical experience, this technology is at high risk of missing suspicious LPCs in endoscopy examinations in hospitals with inexperienced laryngologists, underdeveloped regions, and countries with large numbers of patients [15, 16].

Recently, artificial intelligence (AI) has shown great potential in assisting doctors in various medical fields with their diagnoses [17,18,19]. Particularly, deep learning techniques based on deep convolutional neural networks (DCNN) have demonstrated extraordinary capabilities for medical image classification, detection, and segmentation [20, 21]. Benefiting from its super-resolution performance on microscopic images, AI can automatically infer complex microscopic imaging structures (i.e., abnormalities in the extent and colour intensity of mucosal tubular branches) and identify quantitative pixel-level features [22], which are usually indistinguishable from the human eye. Several studies have demonstrated the feasibility and effectiveness of deep learning for lesion detection and the pathological classification of endoscopic images. Unfortunately, there are still several limitations to the existing research, particularly concerning laryngoscopy. Despite the real-time nature of endoscopy, current research is limited to detecting a single image [23, 24], and there is a lack of studies integrating AI into dynamic videos. Additionally, most existing studies focus on a single light source, including the application of white-light imaging (WLI) and NBI images [33,34,35,36]. Several preliminary studies have verified the feasibility of this method in the auxiliary diagnosis of LCA. Ren et al. established a CNN-based classifier to classify laryngeal disease [23]. Furthermore, Cho et al. applied a deep learning model to discriminate various laryngeal diseases except for malignancy [37]. They all reported high accuracy rates. However, in these two retrospective single-institutional studies, the validation set was a small subset random self of all images in the collection. This suggests that several images of one patient were distributed across both the training and validation sets, leading to an overestimation of the test results. The training and testing of our model adopted time-series sets, and all training, validation, and testing images were collected at different periods, which were completely independent and could simulate the datasets in prospective clinical trials with more objective and convincing results. **ong et al. developed a model based on a DCNN using WLI images to diagnose LCA with an accuracy of 0·867 [25]. Additionally, He et al. developed a CNN model using NBI scans to identify patients with LCA, with an AUC of 0·873 in an independent test set [38]. Their studies were based on the diagnosis of a single imaging mode, which may lead to the omission of the focal features of the lesion, weakening the performance of AI-assisted diagnosis. Furthermore, both studies were only applicable to the detection of still images, which limits their practicality in clinical applications. The clinical application of AI requires the ability to analyse and diagnose complex situations in real time. The video contains multiple angles of the lesion and more complex diagnostic settings closer to the actual clinical environment. A pilot study by Azam et al. used 624 video frames of LCA to develop a YOLO ensemble model to attempt the automatic detection of LCA in real time [24]. This study focused on the automatic segmentation of tumour lesions using only LCA video frames, achieving an accuracy of 0·66 in 57 testing images, and verified the real-time processing performance of the model on six video laryngoscopes. Due to the small sample size and lack of controls, these results and their feasibility in clinical application for auxiliary diagnosis of LCA should be treated cautiously. The system we developed analysed one video frame that required only 26 ms, with an average of 38 video frames that can be identified per second, achieving the performance requirements required for real-time detection. Furthermore, our approach achieved a diagnostic accuracy of 0·949 in an independent video test set with 551 videos, demonstrating real-time dynamic recognition ability. Therefore, our system is more reliable for diagnosing LPC in real time and has a higher clinical utility than previously reported models.

Our system achieved satisfactory diagnostic performance with high accuracy on both image-test sets (0·956 [95% CI 0·951–0·960]) and video-test sets (0·949 [95% CI 0·931–0·968]), which depended on the subsequent improvement to the U-Net. We extracted two features from WLI and NBI images, respectively, which independently represented different data types, and further fused the two features. Compared with the models simply using mixed images, the LPAIDS led to more accurate predictions either in WLI or NBI images. Furthermore, integrating the two features is based on linear layers, which uses less time than feature extraction from multimodal data. The fast integration ensures that the LPAIDS can meet demanding requirements in real time. The stability and robustness of the model were validated using five other independent external validation sets. Moreover, the diagnostic performance of our system was comparable to that of experts and higher than that of non-experts. We used the Cohen kappa coefficient to assess the stability between the system and the laryngologists. We found that the expert achieved significant intra-observer consistency (k = 0·948), which was higher than that of senior laryngologists (k: 0·755–0·811), laryngologist residents (k: 0·667–0·711), and trainees (k: 0·514–0·610).

Despite these promising results, some limitations remain. First, this was a retrospective study, which may have a certain degree of selection bias, and the excellent performance of the LPAIDS cannot entirely reflect actual clinical application. Time-series sets were used to avoid such problems in the study design. Additionally, we designed and prepared a multicentre prospective randomised controlled trial to verify the applicability of this system in a clinical setting. Second, our dataset mostly comprises high-quality laryngoscopy images, which may limit the scope of use of this system. However, our test set used images acquired by different endoscopy systems from various institutions, such as Olympus and **on, which account for most of the endoscopy market. We will collect more images of varying quality to enhance the generalisation ability of our system. Third, although we used a video test to demonstrate the real-time detection performance of the system, the clipped video only contained lesions, and the real-time application ability in actual clinical practice should be evaluated. We will further work on embedding the system into the endoscopic system to output prediction results while performing laryngoscopy and evaluating the model’s reliability.

Conclusion

We developed a DCNN-based system for the real-time detection of LPCs. The system could recognise WLI and NBI imaging modalities simultaneously, achieving high accuracy and sensitivity in independent image and video test sets. The diagnostic efficiency was equivalent to that of experts and better than non-experts. However, this study still needs multicentre prospective verification to provide high-level evidence for detecting LPC in actual clinical practice. We believe that LPAIDS has excellent potential for aiding the diagnosis of LPC and reducing the burden on laryngologists.