Background

The surge in cardiovascular diseases (CVDs) has become a global challenge with a steadily climbing trend of cardiovascular deaths from 12.1 million in 1990 to 18.6 million in 2019 [1, 2]. Risk prediction, a primary strategy in addressing this worldwide problem, has brought significant benefits to some developed countries through the improvement of the effectiveness of life intervention and reduction of economic burden [85,86,87,88,89,90,91,92,93,94,2: Table S6) [11, 30,31,32, 34, 38, 40, 41, 44,45,46, 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145]. In addition to the study design, statistical methods, model performance, risk of bias, AI ethics risk, replicability, as well as clinical implementation, application, and implication in both develo** and reporting assessments, the complexity and standardization of data acquisition and processing, required resources (such as software platforms, hardware, or technical professionals), and cost-effectiveness are also focal points in many develo** assessments. These provide a core framework for the construction of IVS.

Independent validation score

Most models were identified as “not recommended” (n = 281, 58%) or given a “warning” (n = 187, 38%). Only 10 (2%) were classified as “recommended,” and none were identified as “strongly recommended” as revealed by our IVS for all 486 models in Fig. 2. The recommended models are displayed in Additional file 2: Table S7. Insufficient transparency of models contributed the largest number of “not recommended” (n = 212), followed in turn by performance (n = 56), feasibility of reproduction (n = 12), and comprehensive reasons (n = 1).

Discussion

This systematic review is the first to encompass global AI studies of CVD prediction in the general population for more than 20 years, starting from the first article published in 2000 [72]. It presents the current status and broad trends in this field through a comprehensive search and careful selection of studies. We performed an extensive data extraction and thorough analysis of key characteristics in publications, including the predictors, populations, algorithms, performance, and bias. On top of this, we have developed a tool for evaluating replicability and applicability, to screen appropriate AI-Ms for independent external validation, addressing the key issues currently hindering the development of this field. The findings and conclusions are expected to provide references and help for algorithm developers, cohort researchers, healthcare professionals, and policy makers.

Principal findings

Our results revealed significant inefficiency in external validations and a lack of independent external validation for the existing models, indicating that researchers in the field of AI risk prediction were more inclined to put emphasis on new models develo**, instead of validating, although validation is crucial in determining clinical decisions [146]. According to the experience in the field of T-Ms research, these may lead to a large number of useless prediction models, thereby suggesting that more attention should be paid to external validation to avoid research waste and facilitate the translation of high-performing predictive models into clinical practice [147,148,149]. Based on the facts that most studies used data from only one cohort, we conjecture that limited data source may be one of the main reasons that restrict the implementation of external validations. Therefore, the multi-centers studies, especially multi-countries studies (only three were found in our review), should be encouraged to establish multi-source databases.

It is found that the majority of studies were conducted in Europe and North America, with only a few in the develo** countries from Asia and South America, and unfortunately none in Africa. The similar geographical trends have been confirmed in the conventional CVD prediction models through previous literature reviews [29, 150]. However, the prevalence of the CVD is dramatically increasing in those low- or middle-income countries, consequently contributing over three quarters of CVD deaths all over the world and causing great burden to the local medical system [151,152,153,154]. Considering the influence of ethnic heterogeneity on the prediction model [155], native AI-Ms tailored to these countries should be developed for local prevention of CVD.

Four classic indexes, age, sex, total cholesterol, and smoking status, were more frequently used in AI-Ms in all presented predictors (some papers not fully representing the used predictors), similar to T-Ms. However, more importantly, the following summary demonstrates that AI-Ms have triggered a profound revolution to predictors owing to its strong data computing capability. First, the median number of predictors in the AI-Ms was approximately 3 times greater than that in T-Ms as collated by Damen et al. [29]. Second, except for the classic predictors (e.g., demographics and family history, lifestyle, and laboratory measures), several new indexes have been involved in AI-Ms, mainly consisting of some multimode data that cannot be recognized and utilized by T-Ms at all (e.g., image factors and gene- or protein-related information). Third, the limitation of data range has been eliminated, as proven by the no fixed age range and sex-specific equation for the development of AI-Ms, which were important concerns in classic T-Ms. Fourth, AI models allow data re-input and utility. Researchers gathered data many times in the follow-up procedure in recurrent neural network (RNN) models, and these time series data were used to retrain the AI-Ms for further improvement of performance [55, 112]. Another interesting improvement is that the screening of predictors could be executed automatically by AI instead of classic log calculation [50, 52].

The systematic review of specific models is imperative for the head-to-head comparison of these models and the design of the relevant clinical trials [156, 157]. Our analysis of report quality was performed through reference to the TRIPOD statement and CHARMS-CHECKLIST, to inform readers regarding how the study was carried out [158]. Worryingly, we found that many articles did not report important research information, which not only significantly restrict the readability of articles largely but also may lead to the unwarranted neglect for the previous evidence through subsequent researches [159,160,161,162]. Therefore, we have to strongly recommend that each study should upload a statement of TRIPOD or upcoming TRIPOD-AI designed specifically for AI prediction models when the manuscripts were submitted [12, 163, 164].

According to PROBAST, a common evaluation method of risk of bias for traditional prediction models [165], all included AI-Ms were judged as high risk in our summary, mainly owing to ignorance or failure to report competing risk in the item of statistical analysis. Similar trends of high risk have been confirmed in many previous systematic reviews regarding AI-Ms for other diseases, although there are some differences in specific reasons, which involved more frequently sample size, calibration, missing data handling, and so on [12, 166,167,168]. This could potentially be another significant constraint on the independent external validation of models, in addition to the various issues mentioned earlier, which currently hinder the widespread adoption of AI-Ms for CVD clinical practice. Therefore, it is strongly suggested again that more attention should be focussed on statistical analysis, not only for authors in the research and writing process, but also for reviewers and editors during review and publication. Meanwhile, these widely high-risk judgment ratios prompt us to raise question whether the current criteria are too harsh for AI-Ms, because it is unclear whether some algorithms may offset competing risk due to their “black box” effect, and it should not be ignored that the classic method of EPV may not be suitable for the sample size calculation in some ML algorithms owing to their specific operation mechanism [169,170,171].

Best practice guidance and specific pathways for the translation of AI-healthcare research into routine clinical applications have been developed. Holmes et al. summarized the AI-TREE criteria [33], while Banerjee et al. created a pragmatic framework for assessing the validity and clinical utility of ML studies [11]. Building on this prior work and the experiences reported in studies involving AI risk prediction models for various diseases [75, 172,173,174], our insights gained during the validation process of existing AI models, as well as a combination of summary of existing AI research assessment guidelines or tools and experts’ suggestions, we have developed an IVS for screening independent external validation models. This tool is primarily intended for researchers involved in the validation process rather than developers during the implementation phase. In this scoring system, in addition to the two recognized criteria of transparency and risk assessment, the performance and clinical implication were included to determine their suitability for independent external validation, which to some extent, align with factors typically considered during the model development process, such as impact, cost-effectiveness, and AI-ethics [11, 33]. In assessing performance, we opted for the two most widely reported and strongly recommended indices for discrimination and calibration, namely the c index and calibration plot/table, instead of specificity or sensitivity, as they are not recommended by the TRIPOD and checklist guidelines [34, 35, 158]. Furthermore, the consistency of retrospective validation datasets and the challenges in acquiring prospective study data are key factors influencing external validation [75, 172,173,174], especially in the case of factors like imaging, biomarkers, genomics, which may also encounter issues such as lack of standardization and biased reporting [33]. Building upon the WHO's principles of model utility [42], the acquisition and handling of laboratory-based and emerging multimodal predictive factors’ acquisition and handling are essential assessment components in evaluating the feasibility of independent external validation.

Our IVS results have indicated that more than 95% of the models may not be suitable for independent external validation by other researchers, and as a result, may not provide any useful help for the following clinical application. Therefore, it is rather reasonable to explain why there have been no independent external validation researches in the field of CVD-AI prediction for over 20 years. In addition to the problem of model transparency, the following other four reasons also are considered to account for irreproducibility of the models, including increased difficulty in parameter acquisition and processing, uncertain expected performance, and low reliability owing to high risk. Therefore, it is strongly suggested that the assessment of model replicability should be performed in the process of project research, and a statement of IVS should be reported at the time of submission. However, even after screening, it is still necessary to comprehensively consider other factors, such as unquantifiable AI ethics issues, due to the emphasis on assessing technical feasibility and impact in the scoring system. It is also important to emphasize that the current scoring system remains theoretical and requires practical validation and adjustment, necessitating input and refinement from numerous scholars.

Challenges and opportunities

Despite over 20 years of development, the AI field of CVD prediction experienced a surge of articles in the past 5 years, accompanied by the aforementioned phenomena regarding the emphasis on development but validation, no independent validation studies, and a large number of new algorithms studied only once. This field has been concluded as being in an early stage of development, similar to the traditional Framingham model from the 1970s to 1990s [175, 176]. Different from T-Ms, however, the AI ones are quite hard to comprehend and implement for clinical researchers owing to their complexity and “black box”. Meanwhile, there appear continually new algorithms or new combinations of the existing (such as model averaging and stacked regressions), even there may be rather different ranking indexes in the same algorithm [160, 177]. Therefore, it is reasonable to speculate that new exploratory research will continue to dominate for the foreseeable future, which may be the inherent demand for this field, although the external validation of existing models was necessary to avoid research waste, as advocated strongly by many researchers [10, 11, 164].

Several pivotal problems limiting the development of this field still require to be emphasized again. First, the solution to study design and reporting defects, including insufficient external validation, geographical imbalance, inappropriate data sources, and deficiency in algorithm details, largely depends on improving scientific research consciousness and level of all researchers in this industry, which is a gradual process, and thereby uneven development and research waste will be difficult to stop in a short time. Second, another grim situation is how to improve model intelligibility, reproducibility, and replicability, which may far outweigh our understanding concluded in the studies of T-Ms, although some researchers have been making great efforts to explore underlying mechanisms of AI operation, with the increasingly intense expectations of a revolutionary breakthrough as soon as possible [178]. Additionally, it is urgent to establish an integral system of quality control and performance evaluation for the studies in this field. However, this requires a gradual development process, although the World Health Organization (WHO) and International Telecommunication Union (ITU) have established a Focus Group on Artificial Intelligence for Health (FG-AI4H), which has begun sha** guidelines and benchmarking process for health AI models through an international, independent, and standard evaluation framework to guide and standardize the industry development [179].

In addition to the challenges posed by the “black box” issue leading to non-interpretable problems, biases and fairness, technical safety, preservation of human autonomy, privacy, and data security are significant AI ethics concerns within this field [20, 180]. The development of trustworthy AI in healthcare has become a crucial responsibility worldwide [181]. For instance, the European Commission has enacted both the “Ethics Guidelines for Trustworthy AI” and the “Artificial Intelligence Act” [182, 183]. Similarly, in the USA, the creation of the National AI Initiative Office aims to promote the development and utilization of trustworthy AI in both the public and private sectors [184]. Although the articles in this review have devoted limited discussion to these topics, it is essential to note that the aforementioned aspects (including improvement of model transparency and interpretability, reduction in bias risk, enhancement of reproducibility, as well as placing additional emphasis on data and privacy protection), in addition to their scientific research roles, also play a crucial role in addressing AI ethics concerns. These efforts are beneficial for alleviating public concerns about AI ethics issues related to predictive models, thereby increasing trust and acceptance of the models. These aspects improve the balance between AI-assisted decision-making and the preservation of human autonomy, facilitating the clinical application and dissemination of the models. Therefore, we strongly recommend that AI ethics considerations be thoroughly integrated into the model development and validation processes.

For AI intervention studies, the relatively excellent guidelines for the design, implementation, reporting, and evaluation have been developed by the EQUATOR-network, including STARD-AI, CONSORT-AI, and SPIRIT-AI, as well as different scientific journals and associations [139, 142, 185,186,187]. These guidelines will also serve as a roadmap for the development of predictive AI. In practice, Banerjee et al. have designed a seven-domain, AI-specific checklist based on AHA QUADAS-2 CHARMS PROGRESS TRIPOD AI-TREE and Christodoulou, to evaluate the clinical utility and validity of predictive AI algorithms [11]. Oala et al. are building a tool of AI algorithm auditing and quality control for more effective and reliable application of ML systems in healthcare, hel** to manage dynamic workflows that may vary through used case and ML technology [188]. Collins et al. have begun to develop TRIPOD-AI and PROBAST-AI for AI prediction models [21, 163, 164]. Additionally, based on the results of our IVS analysis, we are planning an independent external validation study with multiple datasets to fill the gap in AI field of CVD prediction. These will be expected to propel this field into a new and mature stage of development.

Recommendations

Despite the increasing recommendations by healthcare providers and policymakers for the use of prediction models within clinical practice guidelines to inform decision-making at various stages in the clinical pathway [161, 189], we still suggest that experts in this field should put more emphasis on establishment and implementation of scientific research guidelines, for example, promoting ML4H supervision and management for AI prediction models [188]. Additionally, referring to the requirements for intervention AI statement, some AI-relevant information should be added into TRIPOD-AI, such as algorithm formulas, hyperparameter tuning, predictive performance, interpretability, sample size determination, and so on [186, 190]. Certain items in PROBAST need to be modified for AI prediction models, especially 2.3, 4.1, and 4.9, due to inappropriate standards or nonexistent coefficients in some algorithms. Items 4.6–4.8 should be renegotiated on the premise of fully considering the algorithm characteristics. Furthermore, algorithm auditing, overfitting control, sample size calculation, and identification of variables in image data should be added into PROBAST-AI.

In light of studies on conventional models, a greater responsibility falls upon AI algorithm developers, which include improving the transparency in reporting to facilitate model reproduction, and heightening the comprehensibility and enforceability of algorithms to users for wider clinical practice [191]. Furthermore, we should improve the transparency of reporting not only at the time of publication but also in the process of pre-submission, reviewing, or post-publication stages. Meanwhile, editors and reviewers should also play a key role in improving the quality of reporting.

Study limitations

The systematic review has several limitations. Firstly, similar to other studies [10, 11, 29], the papers not in English, without available full text, or published in other forms (for example, conferences, workshops, news reports, even the unpublished) were also excluded in our review, which may lead to an underestimation of the number of models and an imbalance in geographical contribution as mentioned above. Second, the potential impact of AI on healthcare might still be overestimated during the present procedure of retrospective literature analysis, owing to unavoidable publication bias and reporting bias, despite some measures that have been performed to reduce the omission of included literature [11, 192]. Furthermore, we did not evaluate the clinical usefulness aspects such as net benefit or impact study [159, 193, 194], which are outside our scope and require further investigation.

Conclusions

In summary, AI has triggered a promising digital revolution for CVD risk prediction. However, this field is still in its early stage, characterized by geographical imbalance, low reproducibility, a lack of independent external validation, a high risk of bias, a low standard-reaching rate of report quality, and an imperfect evaluation system. Additionally, the IVS method we designed may provide a practical tool for assessing model replicability. It is expected to contribute to independent external validation research and subsequent extensive clinical application. The development of AI CVD risk prediction may depend largely on the collaborative efforts of researchers, health policymakers, editors, reviewers, as well as quality controllers.