Abstract
Purpose
This study investigated the concordance of five different publicly available Large Language Models (LLM) with the recommendations of a multidisciplinary tumor board regarding treatment recommendations for complex breast cancer patient profiles.
Methods
Five LLM, including three versions of ChatGPT (version 4 and 3.5, with data access until September 3021 and January 2022), Llama2, and Bard were prompted to produce treatment recommendations for 20 complex breast cancer patient profiles. LLM recommendations were compared to the recommendations of a multidisciplinary tumor board (gold standard), including surgical, endocrine and systemic treatment, radiotherapy, and genetic testing therapy options.
Results
GPT4 demonstrated the highest concordance (70.6%) for invasive breast cancer patient profiles, followed by GPT3.5 September 2021 (58.8%), GPT3.5 January 2022 (41.2%), Llama2 (35.3%) and Bard (23.5%). Including precancerous lesions of ductal carcinoma in situ, the identical ranking was reached with lower overall concordance for each LLM (GPT4 60.0%, GPT3.5 September 2021 50.0%, GPT3.5 January 2022 35.0%, Llama2 30.0%, Bard 20.0%). GPT4 achieved full concordance (100%) for radiotherapy. Lowest alignment was reached in recommending genetic testing, demonstrating a varying concordance (55.0% for GPT3.5 January 2022, Llama2 and Bard up to 85.0% for GPT4).
Conclusion
This early feasibility study is the first to compare different LLM in breast cancer care with regard to changes in accuracy over time, i.e., with access to more data or through technological upgrades. Methodological advancement, i.e., the optimization of prompting techniques, and technological development, i.e., enabling data input control and secure data processing, are necessary in the preparation of large-scale and multicenter studies to provide evidence on their safe and reliable clinical application. At present, safe and evidenced use of LLM in clinical breast cancer care is not yet feasible.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
This early feasibility study demonstrates that publicly available Large Language Models (LLM), especially GPT4, increasingly align with the decision-making of a multidisciplinary tumor board regarding high complexity breast cancer treatment choices. It indicates that methodological advancement, i.e. the optimization of prompting techniques, and technological development, i.e. enabling data input control and secure data processing, are necessary in the preparation of large-scale and multicenter studies. At present, safe and evidenced use of LLM in clinical breast cancer care is not yet feasible. |
Introduction
In Germany, invasive breast cancer is the most prevalent cancer affecting women, with the annual national incidence exceeding 70,000 cases [1]. The implementation of a comprehensive nationwide mammography screening program between the years of 2005 to 2009 resulted in an initial peak in detected breast cancer cases. Subsequently, the increased efforts led to a consistent reduction in the incidence of advanced tumors and a gradual decline of primary disease. Despite these improvements, a high disease burden of breast cancer persists and given the aging demographic, a future increase in the incidence of breast cancer is anticipated. This development is accompanied by intensified shortage in healthcare professionals and care capacity [1, 2]. In addition, extensive research continuously expands the spectrum of treatment modalities, encompassing surgical interventions, endocrine therapy, radiotherapy, both neoadjuvant and adjuvant chemotherapy, and genetic testing for hereditary breast and ovarian cancer syndromes [3]. Moreover, the swift advancement in diagnostic and treatment technologies, including increasing adoption of next-generation sequencing, genetic arrays for the prediction of disease prognosis or chemotherapy benefit and the use of precision-targeted therapies such as antibody-drug conjugates, shape a transformative phase in gynecological oncology [4, 5]. This development is marked by abundance of evidenced knowledge and health data, which increasingly overwhelm practitioners in terms of complexity [6]. There is a growing optimism that technological innovations will bridge the gap between scientific possibilities and practical healthcare delivery by providing support to caregivers and will enable more individualized and effective treatment strategies in an environment with high volumes of data [7].
High expectations are set on artificial intelligence-based clinical decision support tools to augment doctoral intelligence in order to keep pace with this rapid development [8, 9]. Historically, the cumbersome digitization of German healthcare has led to a gap between technological capabilities and current practices, which keeps on widening [10]. A nationwide survey conducted by the Commission Digital Medicine of the German Association of Gynecology and Obstetrics (DGGG) revealed high heterogeneity in digital infrastructure within the field of gynecology, characterized by low interoperability and outdated systems, leading to dissatisfaction among healthcare providers [11]. In contrast, most gynecology specialists are optimistic that digitization could ease their growing workloads, enhance patient care, and foresee the adoption of smart algorithms to assist in patient treatment [12]. In the meantime, it has become a normality on the patient’s side to assess new symptoms digitally before visiting a doctor, i.e., using online-search engines and dedicated app-based symptom checkers [13,14,15]. This includes the recent widespread availability of Large Language Models (LLM), with tech-savvy individuals increasingly turning to public chatbots for health-related inquiries [9, 16]. This shift toward relying on easily accessible online resources, evolving from simple Google keyword searches to consulting advanced tools like ChatGPT, highlights a new reality that likewise demands to promote scientific evidence in the medical use of LLM.
The emergence of publicly available LLM in artificial intelligence has opened a new field in medical research, which still lacks the definition of methodological guard rails and best practices. Preliminary proof-of-concept analyses have indicated potential in using these models as supplementary tools in tumor boards [16,17,18,19,20,21]. In breast cancer care, few preliminary assessments have explored the accuracy of LLM in supporting decision-making through the evaluation of brief clinical scenarios [22, 23], but also high complexity cases [24]. The most recent literature increasingly challenges the consistency of LLM, highlighting significant changes in explanatory value over short intervals while emphasizing the necessity for their ongoing monitoring [31]. The quality of AI-generated abstracts has reached a level of medical appropriateness that leaves experts to find it challenging to distinguish them from specialist-written content in a blinded review process [32].
With regard to tumor board decision-making, Lukac et al. and Sorin et al. retrospectively compared the answers of GPT3.5 (version September 2021) to the past treatment recommendations of a single tumor board [22, 23]. The latter research represents initial explorations of this technology, rather than definitive benchmarks for evaluating the capabilities of ChatGPT3.5. Their experiments only included the LLM ChatGPT3.5, involved a constrained and unstructured collection of patient profiles with restricted health data, and they utilized a short and limited prompting strategy. Additionally, their assessments were based on a self-developed scoring system. Notably, the studies omitted genetic testing for most cases, which is a crucial factor in the characterization of breast cancer. Both preliminary assessments inferred from their findings that the advice given by language model-based systems could align with that of a tumor board, but refrain from definitive statements about the specific performance level of LLM in their conclusions. Our research builds on the findings of Lukac et al. and Sorin et al. and seeks to extend them in a systematic manner [22, 23]. Therefore, we confirmed GPT3.5’s potential for managing high complexity case by employing a standardized prompting model and using comprehensive health data profiles as described in the methodology [24]. This subsequent EFS provides further insights by comparing different LLM versions and monitoring development over time, with access to more data and technological upgrade. It matches a generic observation by Eriksen et al. of superior performance by GPT4 for diagnosing complex clinical cases and confirms this finding in the field of breast cancer care [33]. Furthermore, it confirms the most recently raised critique of LLM regarding a persisting challenge in answer consistency in the field of breast cancer treatment [25] This relates to the deterioration in GPT3.5’s accuracy over the observation period. It points toward the possible issue, that an extension of data access with uncontrolled sources used for decision-making does not necessarily lead to an improvement in LLM accuracy but could lead to confusion in the models.
Limitations and implications for methodological and technological development of LLM application in breast cancer care
By monitoring the evolution of LLM, this study shows that especially the update to the GPT4 algorithm enables an increasing alignment with the recommendations of the MTB. It indicates that technological applicability rapidly develops toward technological maturity to provide clinical decision support, even for complex decision-making in breast cancer care. Nevertheless, at present, the study also underlines that a clinical use of LLM is not yet feasible. Several unresolved regulatory hurdles and missing evidence on the peculiarities of clinical application should forbid their current use in clinical care. The current level of evidence regarding the use of LLM in breast cancer therapy leaves crucial questions unanswered, which can also be derived this study.
The initially chosen prompting model only required the LLM to indicate whether chemotherapy should be given or not. However, the recommendation of multi-gene assays to assess disease prognosis and predict chemotherapy benefit in patients was not queried. Due to the increasing use of such tests and the associated increasing clinical relevance, future prompting models should include a query relating to the need for multi-gene assays to assess the chemotherapy necessity. This finding underscores the methodological need to develop sophisticated prompting models that should be tailored to the specifics of the oncologic entity being investigated in order to improve the consistency in LLM answering.
Furthermore, the study uses the recommendations by a single MTB as gold standard for comparing concordance in LLM decision-making. Large-scale observational studies, conducted by several international study groups, have revealed notable disparities in breast cancer treatment choices and outcomes [34, 35]. There is often considerable scope for decision-making on available treatment options, such as varying intensities of chemotherapy regimens, which reflects the diversity in national standards and respective guidelines. This issue also explains the rather moderate results for DCIS profiles in this study. The LLM have consistently recommended endocrine therapy, as, for example, suggested in a meta-analysis by Yan et al. from 2020 [36]. In contrast, the MTB in the study decided against endocrine therapy in the DCIS cases, a decision that was taken in interdisciplinary discourse in the MTB and within the decision-making scope of the German guidelines. However, as a dedicated EFS with a small group of 20 cases, no conclusions should be drawn regarding the LLM accuracy for different cancer subtypes stages of the disease, i.e., precancerous or advanced metastasized illness, and treatment options. Hence, in order to ensure the evidence-based and safe use of LLM in breast cancer care, these open questions must be adequately addressed by further research. Subsequent studies should incorporate larger study populations and multicenter study designs to expand findings from a preclinical simulation environment into clinical care.
At a technological level, a lack of control over the sources used for decision-making and a lack of security in the processing of health data have so far prevented the use of LLM in clinical care. The deterioration in GPT3.5’s accuracy over the observation period, which appears to be connected to the extension of data access, underlines how uncontrolled and enlarged input of sources can contribute to confusion in the models. It remains unclear which sources the open LLM use for decision-making, a problem that can also be seen in the moderate DCIS results, as it cannot be derived from the LLM answering which evidence is used by the LLM to recommend endocrine therapy. In alignment with the Explainable AI approach, the technological application should offer the possibility of gaining control over the sources used for decision-making while ensuring security in the processing of personal health data, i.e., by limiting it to local servers.
Opportunities for breast cancer care
Considering the findings of the national survey conducted by the Commission Digital Medicine of the German Association of Gynecology and Obstetrics (DGGG), 61.4% of specialists either agree of strongly agree that intelligent algorithms will support clinicians to treat patients and the majority support the perception that this will improve patient care (65.1% agree of strongly agree) and help to reduce increasing workload (78.4% agree of strongly agree) [12]. These concerns are accompanied by the aforementioned, intensified care complexity due to the rapid increase in evidence-based knowledge and case load in gynecological oncology [4, 5]. In this perspective, easily accessible and user-friendly publicly available LLM may provide a prospective solution in breaking down prevailing barriers [37]. As presented in this study, a clinical use of LLM is not yet feasible. Nevertheless, the controlled and evidence-based adaptation of LLM , i.e., the optimization of prompting techniques or enabling data input control and secure data processing, offers potential that LLM could bring value to patients in clinical breast cancer care.
Conclusion
This early feasibility study demonstrates that publicly available LLM, especially GPT4, increasingly align with the decision-making of a multidisciplinary tumor board and confirms decision consistency to remain a major issue for the application of LLM in breast cancer care. The findings underline that clinical use of LLM is not yet feasible. Nevertheless, the study gathers insights that could inform potential modifications to the LLM application. Methodological advancement, i.e., the optimization of prompting techniques, and technological development, i.e., enabling data input control and secure data processing, are necessary in the preparation of large-scale and multi-centric studies. These will subsequently provide further essential evidence on the safe and reliable application of LLM in breast cancer care to maximize benefits for providers and patients alike.
Data availability
Datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Brustdrüse–C 50 (2023) In: Robert Koch Institut (ed) Krebs in Deutschland für 2019/2020, 14th edition, Berlin, pp 78–81. https://doi.org/10.25646/11357
European Commission (2021) Europe’s beating cancer plan. https://health.ec.europa.eu/system/files/2022-02/eu_cancer-plan_en_0.pdf. Accessed 20 Dec 2023
German Guideline Program in Oncology (German Cancer Society, German Cancer Ais, AWMF (2021) Interdisciplinary evidence-based pratice guideline for early detection, diagnosis, treatment and follow-up of breast cancer long version 4.4 AWMF registration number: 032/045OL. https://www.leitlinienprogramm-onkologie.de/fileadmin/user_upload/S3_Guideline_Breast_Cancer.pdf. Accessed 20 Dec 2023
Tarawneh TS, Rodepeter FR, Teply-Szymanski J et al (2022) Combined focused next-generation sequencing assays to guide precision oncology in solid tumors: a retrospective analysis from an institutional molecular tumor board. Cancers (Basel). https://doi.org/10.3390/cancers14184430
Santa-Maria CA, Wolff AC (2023) Antibody-drug conjugates in breast cancer: searching for magic bullets. J Clin Oncol 41(4):732–735. https://doi.org/10.1200/JCO.22.02217
Bhattacharya T, Brettin T, Doroshow JH et al (2019) AI meets exascale computing: advancing cancer research with large-scale high performance computing. Front Oncol 2(9):984. https://doi.org/10.3389/fonc.2019.00984
Barker AD, Lee JSH (2022) Translating “Big Data” in oncology for clinical benefit: progress or paralysis. Cancer Res 82:2072–2075. https://doi.org/10.1158/0008-5472.CAN-22-0100
Poon H (2023) Multimodal generative AI for precision health. NEJM AI. https://doi.org/10.1056/AI-S2300233
Goldberg C (2023) Patient portal. NEJM AI. https://doi.org/10.1056/AIp2300189
Rainer Thiel A, Deimel L, Schmidtmann D, et al (2018) Gesundheitssystem-Vergleich Fokus Digitalisierung #SmartHealthSystems Digitalisierungsstrategien im internationalen Vergleich. https://www.bertelsmann-stiftung.de/fileadmin/files/Projekte/Der_digitale_Patient/VV_SHS-Gesamtstudie_dt.pdf. Accessed 20 Dec 2023
Pfob A, Griewing S, Seitz K et al (2023) Current landscape of hospital information systems in gynecology and obstetrics in Germany: a survey of the commission Digital Medicine of the German Society for Gynecology and Obstetrics. Arch Gynecol Obstet 308:1823–1830. https://doi.org/10.1007/s00404-023-07223-1
Pfob A, Hillen C, Seitz K et al (2023) Status quo and future directions of digitalization in gynecology and obstetrics in Germany: a survey of the commission Digital Medicine of the German Society for Gynecology and Obstetrics. Arch Gynecol Obstet. https://doi.org/10.1007/s00404-023-07222-2
Millenson ML, Baldwin JL, Zipperer L et al (2018) Beyond Dr. Google: the evidence on consumer-facing digital tools for diagnosis. Diagnosis (Berl) 5(3):95–105. https://doi.org/10.1515/dx-2018-0009
Pergolizzi J Jr, LeQuang JAK, Vasiliu-Feltes I et al (2023) Brave new healthcare: a narrative review of digital healthcare in American medicine. Cureus. https://doi.org/10.7759/cureus.46489
Knitza J, Muehlensiepen F, Ignatyev Y et al (2022) Patient’s perception of digital symptom assessment technologies in rheumatology: results from a multicentre study. Front Public Health. https://doi.org/10.3389/fpubh.2022.844669
Betzler BK, Chen H, Cheng CY et al (2023) Large language models and their impact in ophthalmology. Lancet Digit Health 5(12):e917–e924. https://doi.org/10.1016/S2589-7500(23)00201-7
Buhr CR, Smith H, Huppertz T et al (2023) ChatGPT versus consultants: blinded evaluation on answering otorhinolaryngology case–based questions. JMIR Med Educ 9:e49183. https://doi.org/10.2196/49183
Massey PA, Montgomery C, Zhang AS (2023) Comparison of ChatGPT–3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 31(23):1173–1179. https://doi.org/10.5435/JAAOS-D-23-00396
Roos J, Kasapovic A, Jansen T et al (2023) Artificial intelligence in medical education: comparative analysis of ChatGPT, Bing, and medical students in Germany. JMIR Med Educ. https://doi.org/10.2196/46482
Takagi S, Watari T, Erabi A et al (2023) Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison study. JMIR Med Educ. https://doi.org/10.2196/48002
Schopow N, Osterhoff G, Baur D (2023) NLP applications in clinical practice: a comparative study and augmented systematic review with ChatGPT (Preprint). JMIR Med Inform. https://doi.org/10.2196/48933
Lukac S, Dayan D, Fink V et al (2023) Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch Gynecol Obstet 308:1831–1844. https://doi.org/10.1007/s00404-023-07130-5
Sorin V, Klang E, Sklair-Levy M et al (2023) Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer. https://doi.org/10.1038/s41523-023-00557-8
Griewing S, Gremke N, Wagner U et al (2023) Challenging ChatGPT 3.5 in senology—an assessment of concordance with breast cancer tumor board decision making. J Pers Med. https://doi.org/10.3390/jpm13101502
Chen L, Zaharia M, Zou J (2023) How is ChatGPT’s behavior changing over time? (preprint). arxiv. https://doi.org/10.48550/ar**v.2307.09009
U.S. Food and Drug Administration (2013) Investigational Device Exemptions (IDEs) for early feasibility medical device clinical studies, including certain First in Human (FIH) studies guidance for industry and food and drug administration staff. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/investigational-device-exemptions-ides-early-feasibility-medical-device-clinical-studies-including Accessed 5 Mar 2024
Innovative Health Initiative (2023) Improving patient access to innovative medical technologies in the European Union. https://heuefs.eu/wp-content/uploads/2024/01/HEU-EFS_consortium_press_release.pdf. Accessed 5 Mar 2024
Rao A, Pang M, Kim J et al (2023) Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 22(25):e48659. https://doi.org/10.2196/48659
Rao A, Kim J, Kamineni M et al (2023) Evaluating GPT as an adjunct for radiologic decision making_ GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol. https://doi.org/10.1016/j.jacr.2023.05.003
Haver HL, Ambinder EB, Bahl M et al (2023) Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology. https://doi.org/10.1148/radiol.230424
Choi HS, Song JY, Shin KH et al (2023) Develo** prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J 41:209–216. https://doi.org/10.3857/roj.2023.00633
Gao CA, Howard FM, Markov NS et al (2023) Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digit Med. https://doi.org/10.1038/s41746-023-00819-6
Eriksen AV, Möller S, Ryg J (2023) Use of GPT-4 to diagnose complex clinical cases. NEJM AI. https://doi.org/10.1056/aip2300031
van Walle L, Verhoeven D, Marotti L, Ponti A, Tomatis M, Rubio IT, EUSOMA Working Group (2023) Trends and variation in treatment of early breast cancer in European certified breast centres: an EUSOMA-based analysis. Eur J Cancer 192:113244. https://doi.org/10.1016/j.ejca.2023.113244
Derks MGM, Bastiaannet E, Kiderlen M et al (2018) Variation in treatment and survival of older patients with non-metastatic breast cancer in five European countries: a population-based cohort study from the EURECCA Breast Cancer Group. Br J Cancer 119:121–129. https://doi.org/10.1038/s41416-018-0090-1
Yan Y, Zhang L, Tan L et al (2020) Endocrine therapy for Ductal Carcinoma In Situ (DCIS) of the breast with Breast Conserving Surgery (BCS) and Radiotherapy (RT): a meta-analysis. Pathol Oncol Res 26:521–531. https://doi.org/10.1007/s12253-018-0553-y
Gottlieb S, Silvis L (2023) How to safely integrate large language models into health care. JAMA Health Forum 4(9):e233909. https://doi.org/10.1001/jamahealthforum.2023.3909
Acknowledgements
We would like to thank our colleague Julia Greenfield from the Institute of Digital Medicine for her support in the linguistic revision of the final manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL. The authors declare that no funds, grants or other support were received during the preparation of this manuscript. S.G. was supported by the Clinician Scientist program (SUCCESS-program) of Philipps-University of Marburg and the University Hospital of Giessen and Marburg.
Author information
Authors and Affiliations
Contributions
S Griewing Protocol, project development, data collection, data management, data analysis, manuscript writing, manuscript editing. J Knitza Project development, manuscript editing, data analysis. J Boekhoff Protocol, project development, data analysis. C HillenManuscript editing, data analysis. F LechnerProtocol, project development. U WagnerManucript editing, project development. S KuhnManucript editing, project development. M Wallwiener Manucript editing, project development. All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Sebastian Griewing, Johannes Knitza and Jelena Boekhoff. The first draft of the manuscript was written by Sebastian Griewing and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics approval
This is an observational study. The Philipps-University Marburg Research Ethics Committee has confirmed that no ethical approval is required (23–300 ANZ).
Consent to participate
Does not apply.
Consent to publish
Does not apply.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Griewing, S., Knitza, J., Boekhoff, J. et al. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch Gynecol Obstet 310, 537–550 (2024). https://doi.org/10.1007/s00404-024-07565-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00404-024-07565-4