Background

Skin cancer accounts for 32.5% of all diagnosed malignancies, and it has a prevalence of 7.96 million cases occurring globally each year among the general population [1]. With respect to etiology, previous studies have demonstrated a deleterious association with chronic exposure to sunlight because the ultraviolet component induces deoxyribonucleic acid damage which later triggers malignant mutations to occur. Other possible contributors to skin cancer incidence may also include viral infection, drug usage and exposure to chemicals [2].

Pathologically, skin cancer is categorized into either melanoma or non-melanoma. Albeit relatively rare, three hundred thousand annual cases of melanoma are determined as highly malignant, with a reported mortality rate of 1.6 per 100,000 worldwide [1]. By contrast, non-melanoma cases, which comprise a number of pathologically-distinct entities such as basal cell carcinoma and intra-epithelial carcinoma (i.e., actinic keratosis and Bowen’s disease) [3], are less malignant considering Mohs micrographic surgery and a 5-year cure rate of 98.9% [4]. Sixty-five thousand victims die on average, per annum, worldwide due to non-melanoma incidence when combined with a delayed diagnosis factor [1]. Furthermore, non-melanoma skin cancers such as basal cell carcinoma show a trend of increasing cases [5] and are easily misdiagnosed [6]. The abovementioned evidence clearly shows the diagnosis of non-melanoma skin cancer is of similar importance to melanoma forms of skin cancer.

Currently, clinical examination and dermoscopic evaluation are major techniques for screening skin cancers [7]. These screening techniques are estimated to achieve 75–84% of diagnosis accuracy, indicating human error may remain accountable via these approaches [8, 9]. When taking into account the high prevalence and life-threatening risk of this disease, it is important to make a timely diagnosis for appropriate treatment to follow.

Artificial intelligence (AI) techniques are being employed to provide diagnostic assistance to dermatologists since most diagnoses rely principally on visual patterning recognition [10], a particular strength of such a technology. Machine learning is a sub-field of AI which refers to an effort to automate intellectual tasks normally performed by humans; and, deep learning is in turn a subset within machine learning [11]. A veritable plethora of attempts to utilize machine learning techniques aimed at supporting the accurate diagnosis of melanoma and non-melanoma types of skin cancer have already taken place [9, 12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. As such, a systematic reporting is deemed necessary for reliable interpretation and aggregation of these results. However, the comparison of pre-existing skin lesion classification evidence is difficult because differences may exist in the data types used or in the statistical quantities presented [35].

Until present time, synthetic evidence regarding the performance of AI techniques applied for the diagnosis of non-melanoma skin cancer remains insufficient [7, 10]. Without reliable evidence, the application of AI in the diagnosis of non-melanoma skin cancer is frequently obstructed. Furthermore, what important factors/strategies that may influence the performance of AI in the diagnosis of non-melanoma skin cancer are at times unclear.

In viewing the unfulfilled areas of knowledge, the purposes of this meta-analysis are therefore: 1) to meta-analyze the accuracy of diagnosis for non-melanoma skin cancer via machine learning and deep learning; and, 2) to examine potential covariates that can account for the heterogeneity found among these studies. The main contributions of this study are:

  • Summary of the performance of AI for diagnosing non-melanoma skin cancer with quantitative evidence so that AI’s utility assessment can be made with greater efficacy and objectivity.

  • Identification of potential covariates as they relate to AI performance since it may improve through an adoption of those strategies indicated by these identified covariates whenever building AI models.

  • Accumulation of knowledge of diagnostic test accuracy for AI in non-melanoma skin cancer takes place so that earlier and more accurate diagnosis of non-melanoma skin cancer is practical.

The remainder of this paper is structured as follows. Related work section introduces prior reviews on the topic of diagnostic test accuracy, focusing on how these reviews were planned and evaluated. Material and methods section presents the research method adopted in this study. Results section describes the analytical findings based on collected data, Discussion section interprets and describes the significance of the findings, and Conclusions section summarizes the findings of the current study.

Related work

Up until the most recent examples, a number of studies have started to review existing evidence related to AI techniques for skin-lesion classification [7, 10, 23, 35,36,37]. Several themes may be observed from Table 1. First, much evidence is qualitative in nature [10, 35,36,37], except for the study of Sharma et al. [7] and Rajpara et al. [23]. Without quantitative evidence, the performance of AI-based predictive models are not easily or objectively assessed. Second, few reviews [7, 10] have focused solely on non-melanoma forms of skin cancer, with such efforts being devoted to the review of evidence concerning melanoma [16, 23] or both [35, 37]. By focusing exclusively on non-melanoma skin cancer, a better understanding may yet be achieved. Third, most reviews include studies that have adopted machine learning and deep learning, with the exception of Brinker et al. [35]. Despite deep learning being widely considered as having better performance than machine learning, studies that adopted machine learning should also be included in order to have a more holistic understanding of AI performance in the diagnosis of melanoma and non-melanoma skin cancers. Finally, review components/metrics for assessing the performance of AI techniques are quite diversified. Classification methods, data source, and diagnostic accuracy are primary components of these reviews. Further, reviews that followed the Preferred Reporting Items for a Systematic Review and Meta-analysis statement (PRISMA) for Diagnostic Test Accuracy (DTA) commonly reported pooled diagnostic odds ratio, pooled positive/negative likelihood ratio, pooled sensitivity, and pooled specificity, while other reviews usually reported separate accuracy, area under receiver characteristic curve, F1-score, precision, sensitivity, or specificity by individual study. This study therefore follows PRISMA-DTA for reporting summary metrics of included studies for global assessment of AI performance for the diagnosis of non-melanoma skin cancer.

Table 1 Prior reviews on skin cancer diagnosis based on artificial intelligence

Methods

This study was conducted according to the PRISMA statement [38] (see Additional file 1: Appendix A for diagnostic test accuracy checklist and Additional file 2: Appendix B for diagnostic test accuracy abstracts checklist). The Institutional Review Board of E-Da Hospital (EMRP-108–128) approved the study-wide protocol.

Search strategy and selection process

A literature search, carried out 31st March, 2022, of Scopus, PubMed, ScienceDirect, SpringerLink, and Dimensions, by means of keyword combinations of the terms "basal cell carcinoma", "intra-epithelial carcinoma", "Bowen’s disease", "actinic keratosis", "skin lesion", "non-melanoma skin cancer", "artificial intelligence", "machine learning", and "deep learning".

Inclusion criteria was determined by: 1) studies investigating the accuracy of non-melanoma skin cancer; 2) studies written in English; and, 3) studies adopting machine-learning or deep-learning techniques. Studies were dis-qualified for inclusion, if: 1) they only investigated the incidence of melanoma skin cancer; 2) studies were irrelevant to our research purpose; and, 3) full texts were unavailable for purposes of examination. We located 134 potentially eligible articles, of which 95 were excluded with reason (see Fig. 1), and the remaining 39 articles being included in the eventual quantitative meta-analysis that was made.

Fig. 1
figure 1

Article selection process

Data extraction

From each study, we extracted the following information: Authorship, publication year, sample size, types of non-melanoma skin cancer described, whether data sources were publicly available, whether cross-validation procedures were undertaken, whether ensemble models were employed, and what type of artificial intelligence technique was employed (i.e., deep learning or machine learning). Only studies that adopted a neural network algorithm with more than one hidden layer were categorized as being part of the deep learning group, with others categorized as being part of the machine learning group for purposes of our study. For models based on deep learning, further recorded information including whether pre-trained models were utilized and whether image augmentation was implemented. Further, we extracted the original numbers of true/false positives and true/false negatives from each study to derive outcome measures, including summary sensitivity, specificity, and area under receiver operating characteristic curve, for purposes of diagnostic accuracy. Finally, if an article had classified more than one non-melanoma skin cancer simultaneously, we considered each of the non-melanoma skin cancers as a different study, with relevant data extracted based upon the above-listed procedures.

Methodological analysis

Regarding the quality of each of the included studies, we evaluated the risk of bias and applicability in accordance with the revised Quality Assessment of Diagnostic Studies (QUADAS-2) including four domains: sample selection, index test, reference standard, flow, and timing [30].

Statistical analysis

Following the suggestion of prior evidence [39], sensitivity and specificity were pooled with a bivariate model. Area under receiver operating characteristic curve, diagnostic odds ratio, positive likelihood ratio, and negative likelihood ratio were also estimated in this study. Forest plots were produced to depict variability amongst the studies up for consideration. Besides, summary receiver operating characteristic curves with 95% confidence intervals (CI) and 95% prediction intervals (PI) were adopted to assess the existence of a threshold effect among the included studies [40]. The R statistics [41] with lme4 [42] and mada [43] packages were used for diagnostic accuracy test meta-analysis.

Several meta-regressions with plausible covariates, including types of non-melanoma skin cancer (i.e., basal cell carcinoma and intra-epithelial carcinoma), whether data sources were publicly available (public or proprietary), whether cross-validation procedures were undertaken, whether ensemble models were adopted, types of AI technique employed (machine learning or deep learning), whether pre-trained deep learning models (e.g., DenseNet, ResNet, or AlexNet) were used (Yes or No), and whether image augmentation procedures were used by deep learning models (Yes or No) were undertaken to check for possible heterogeneity among studies. The significance level is set to 0.05 for present study.

Results

General study characteristics

Among the 39 included articles, 13 articles [38]. Common metrics for diagnostic test accuracy including area under receiver operating characteristic curve, sensitivity, specificity, diagnostic odds ratio, positive likelihood ratio and negative likelihood ratio were included. Furthermore, to account for the threshold effect, the pooled sensitivity and specificity was estimated based on a bivariate model [39]. Other metrics such as mean accuracy were not assessed in this study since prior evidence suggests that sensitivity and specificity are more sensible parameters to be analyzed in a meta-analysis, and they are clinically well known [80].

Just like most meta-analyses, our study has its limitations. First, the interpretation of summary sensitivity and specificity should be approached cautiously since heterogeneity among studies exists. Further, 72 studies were excluded due to insufficient quantitative information. Future diagnostic studies aimed at predicting non-melanoma skin cancers are suggested to include sufficient quantitative information for subsequent meta-analysis to better characterize and profile these studies. The covariates identified in this study are purely based from a statistical viewpoint [81], future research could consider the different design ideas of deep learning-based approaches or machine learning-based approaches to identify the incidence of other potential covariates. Finally, future meta-analysis may adopt emerging techniques [82,83,84,85] to cluster or classify models into different groups or categories, so that different insights are obtainable.

Conclusions

Our study aims to meta-analyze the diagnostic test accuracy of applying AI techniques to the diagnosis of non-melanoma type skin cancer which is already considered insufficient in review evidence. Without a better understanding of the performance of AI for the diagnosis of non-melanoma skin cancer, the potential of AI may not be fully realized. Furthermore, the results of this quantitative meta-analysis can provide a more objective synthesis of the AI performance for diagnosing non-melanoma skin cancer. Based on the findings of this study, the usefulness of AI can be assessed with greater facility and objectivity. Moreover, strategies for improving the performance of AI used for screening non-melanoma skin cancer are identifiable. A quick, safe, and non-invasive screening of non-melanoma skin cancers can thus be expected. By searching multiple online databases, 39 articles (67 studies) were included for purposes of meta-analysis. A bivariate meta-analysis of diagnostic test accuracy was undertaken to obtain summary sensitivity, specificity, and AUC. A moderate diagnostic performance of summary sensitivity, a strong summary specificity, and a strong AUC were all observed based according to a bivariate meta-analysis of diagnostic accuracy test. Types of non-melanoma skin cancer, whether data sources were publicly available, whether cross-validation procedures were undertaken, whether ensemble models were adopted, the types of AI technique employed, whether pre-trained deep-learning models were used, and whether image-augmentation procedures were all determined to partially explain some of the heterogeneity found among primary studies. Future studies may consider adopting the suggested techniques to have better predictive performance of AI for the effective diagnosis of non-melanoma skin cancer.