Abstract
Within the domain of Natural Language Processing (NLP), Large Language Models (LLMs) represent sophisticated models engineered to comprehend, generate, and manipulate text resembling human language on an extensive scale. They are transformer-based deep learning architectures, obtained through the scaling of model size, pretraining of corpora, and computational resources. The potential healthcare applications of these models primarily involve chatbots and interaction systems for clinical documentation management, and medical literature summarization (Biomedical NLP). The challenge in this field lies in the research for applications in diagnostic and clinical decision support, as well as patient triage. Therefore, LLMs can be used for multiple tasks within patient care, research, and education. Throughout 2023, there has been an escalation in the release of LLMs, some of which are applicable in the healthcare domain. This remarkable output is largely the effect of the customization of pre-trained models for applications like chatbots, virtual assistants, or any system requiring human-like conversational engagement. As healthcare professionals, we recognize the imperative to stay at the forefront of knowledge. However, kee** abreast of the rapid evolution of this technology is practically unattainable, and, above all, understanding its potential applications and limitations remains a subject of ongoing debate. Consequently, this article aims to provide a succinct overview of the recently released LLMs, emphasizing their potential use in the field of medicine. Perspectives for a more extensive range of safe and effective applications are also discussed. The upcoming evolutionary leap involves the transition from an AI-powered model primarily designed for answering medical questions to a more versatile and practical tool for healthcare providers such as generalist biomedical AI systems for multimodal-based calibrated decision-making processes. On the other hand, the development of more accurate virtual clinical partners could enhance patient engagement, offering personalized support, and improving chronic disease management.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. Notably, NLP models can enable machines to understand, interpret, and generate human-like text or speech. Large Language Models (LLMs) are advanced NLP models within the category of pre-trained language models (PLMs), achieved through the scaling of model size, pretraining corpus, and computational resources [1]. Briefly, LLMs are developed through deep learning methodologies, particularly employing transformer architectures. They are neural network models implementing self-attention mechanisms for enabling the model to consider the entire context rather than being restricted to fixed-size windows, and multi-head attention to capturing contextual relationships in input sequences. In this process, recurrent and convolution layers are not required [2]. Other crucial components of the transformer architecture include encoder and decoder structures to respectively process the input sequence and generate the output sequence. Nevertheless, the architecture of a transformer can vary depending on its specific task and design. Some transformers are designed only with an encoder structure, while others may include both encoder and decoder components. For example, in tasks like language translation, where input and output sequences are involved, both encoder and decoder modules are required. Conversely, for language modeling or text classification, only an encoder may be used. Other key elements of a transformer architecture encompass feedforward neural networks for capturing complex, non-linear relationships in the data, and positional encodings to provide information about the positions of tokens in the sequence [41]. These approaches may offer alternative learning pathways and can be used for designing interactive tools for medical education, enhancing the learning experience [42]. LLMs can also be harnessed for generating case scenarios or quizzes, aiding medical students in practicing and refining their diagnostic and treatment planning skills within a secure and controlled environment [43]. The integration of LLMs in gamification processes represents another captivating perspective [44]. The enhancement of tools for interacting with the patient (virtual clinical partners) could lead to improving patient engagement, providing personalized support, and enhancing chronic disease management [45].
The key perspective revolves around addressing the limitations of LLMs, which encompass challenges such as misinformation, privacy issues, biases in training data, and the risk of misuse [46, 47]. The phenomenon of hallucination can dangerously propagate medical misinformation or introduce biases that have the potential to exacerbate health disparities. In a recent study, for example, Birkun and Gautan [48] showed that the advice provided by LLM chatbots (Bing, Microsoft Corporation, USA, and Bard, Google LLC, USA) for assisting a non-breathing victim lacked crucial details of resuscitation techniques and, at times, provided misleading and potentially harmful instructions. In another study, carried out to assess the accuracy of ChatGPT, Google Bard, and Microsoft Bing in distinguishing between a medical emergency and a non-emergency, the authors concluded that the examined tools need additional improvement to accurately identify different clinical situations [49]. Continuous verification of the output’s appropriateness is crucial. Significantly, in November 2022, Meta’s Galactica model was halted shortly after its release, just a few days after, due to the generation of inaccurate data [50]. The overarching goal is to ensure NLP assurance. This comprehensive process is incorporated at every stage of the NLP development lifecycle, aiming to validate, verify, and make outcomes trustworthy and explainable to non-experts. Additionally, it underscores ethical deployment, unbiased learning, and fairness toward users [51].
In the training phase, the accuracy of the output relies heavily on the choice of the reference dataset [1, Declarations Section
Data Availability
No datasets were generated or analysed during the current study.
References
Ouyang L, Wu J, Jiang X, Almeida, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022; 35:730–744.
Kalyan KS, Rajasekharan A, Sangeetha S. Ammu: a survey of transformer-based biomedical pretrained language models. Journal of biomedical informatics. 2022;126:103982.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention Is All You Need. 2017. ar**v:1706.03762.
Open AI. ChatGPT release note. Available at: https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_4799933861 Last Accessed: December 22, 2023.
Tian S, ** Q, Yeganova L, Lai P-T, Zhu Q, Chen X, Yang X, Chen, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health- ar**v:2306.10070. (2023).
Radford A, Narasimhan K. Improving Language Understanding by Generative Pre-Training. 2018. https://api.semanticscholar.org/CorpusID:49313245.
Cao Z, Wong K, Lin CT. Weak Human Preference Supervision for Deep Reinforcement Learning. IEEE Trans Neural Netw Learn Syst. 2021;32(12):5369–5378. doi: https://doi.org/10.1109/TNNLS.2021.3084198.
Rafailov R, Sharma A, Mitchell E, Ermon S, Manning CD, Finn C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ar**v:2305.18290 (2023).
Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ar**v:2101.00027 (2020).
Meta AI Request Form. Available at: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform Last Accessed: December 22, 2023.
Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus. 2023;15(6):e40895. doi: https://doi.org/10.7759/cureus.40895.
Microsoft Bing Blog. Available at: https://blogs.bing.com/search/november-2023/our-vision-to-bring-microsoft-copilot-to-everyone-and-more. Last Accessed: December 24, 2023.
ZDNET Information. Available at: https://www.zdnet.com/article/what-is-copilot-formerly-bing-chat-heres-everything-you-need-to-know/. Last Accessed: December 24, 2023.
Avanade Insight. Available at: https://www.avanade.com/en/blogs/avanade-insights/health-care/ai-copilot. Last Accessed: December 24, 2023.
OpenAI. GPT-4 Technical Report. ar**v:2303.08774 (2023).
The decoder. Available at: https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ Last Accessed: December 22, 2023
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233–1239. doi: https://doi.org/10.1056/NEJMsr2214184.
Bhayana R, Bleakney RR, Krishna S. GPT-4 in Radiology: Improvements in Advanced Reasoning. Radiology. 2023;307(5):e230987. doi: https://doi.org/10.1148/radiol.230987.
Jang D, Yun TR, Lee CY, Kwon YK, Kim CE. GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors. PLOS Digit Health. 2023;2(12):e0000416. doi: https://doi.org/10.1371/journal.pdig.0000416.
Guerra GA, Hofmann H, Sobhani S, Hofmann G, Gomez D, Soroudi D, Hopkins BS, Dallas J, Pangal DJ, Cheok S, Nguyen VN, Mack WJ, Zada G. GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions. World Neurosurg. 2023;179:e160-e165. doi: https://doi.org/10.1016/j.wneu.2023.08.042.
Scheschenja M, Viniol S, Bastian MB, Wessendorf J, König AM, Mahnken AH. Feasibility of GPT-3 and GPT-4 for in-Depth Patient Education Prior to Interventional Radiological Procedures: A Comparative Analysis. Cardiovasc Intervent Radiol. 2023 Oct 23. doi: https://doi.org/10.1007/s00270-023-03563-2.
Spies NC, Hubler Z, Roper SM, Omosule CL, Senter-Zapata M, Roemmich BL, Brown HM, Gimple R, Farnsworth CW. GPT-4 Underperforms Experts in Detecting IV Fluid Contamination. J Appl Lab Med. 2023;8(6):1092–1100. doi: https://doi.org/10.1093/jalm/jfad058.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–180. doi: https://doi.org/10.1038/s41586-023-06291-2.
Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S, Alon U, Dziri N, Prabhumoye S, Yang Y, et al. Self-refine: Iterative refinement with self-feedback. ar**v preprint ar**v:2303.17651 (2023).
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M, Wang A, Amin M, Lachgar S, Mansfield P, Prakash S, Green B, Dominowska E, Aguera y Arcas B, Tomasev N, Liu Y, Wong R, Semturs C, Mahdavi S. Towards Expert-Level Medical Question Answering with Large Language Models. ar**v:2305.09617v1 (2023).
Tu T, Azizi S, Driess D, Schaekermann M, Amin M, et al. Towards Generalist Biomedical AI. ar**v:2307.14334v1 (2023).
Hippocratic AI. Available at https://www.hippocraticai.com/. Last Accessed: December 24, 2023.
Hugging Face. MPT-B. Available at: https://huggingface.co/mosaicml/mpt-7b. Last Accessed: December 24, 2023.
Kauf C, Ivanova AA, Rambelli G, Chersoni E, She JS, Chowdhury Z, Fedorenko E, Lenci A. Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely. Cogn Sci. 2023;47(11):e13386. doi: https://doi.org/10.1111/cogs.13386.
Touvron H, Martin L, et al. LLaMA-2: Open Foundation and Fine-Tuned Chat Models. ar**v:2307.09288 (2023).
Ainslie J, Lee-Thorp J, de Jong M, Zemlyanskiy Y, Lebron, Sanghai S. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Singh Chaplot D, de las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux MA, Stock P, Le Scao T, Lavril T, Wang T, Lacroix T, El Sayed W. Mistral-7B ar**v:2310.06825.
An END-to-END guide on how to finetune a LLM(Mistral-7B) into a Medical Chat Doctor using Huggingface. Available at: https://medium.com/@SachinKhandewal/finetuning-mistral-7b-into-a-medical-chat-doctor-using-huggingface-qlora-peft-5ce15d45f581 Last Accessed: December 22, 2023.
Mistral AI. Available at: https://mistral.ai/news/mixtral-of-experts/ Last Accessed: December 24, 2023.
Nijkamp E, **e T, Hayashi H, Pang B, **a C, **ng C, Vig J, Yavuz S, Laban P, Krause B, Purushwalkam S, Niu T, Kryściński W, Murakhovs’ka L, Choubey PK, Fabbri A, Liu Y, Meng R, Tu L, Bhat M, Wu C-S, Savarese S, Zhou Y, Joty S, **ong C. XGen-7B Technical Report. ar**v:2309.03450.
Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y. A study of generative large language model for medical research and healthcare. NPJ Digit Med. 2023;6(1):210. doi: https://doi.org/10.1038/s41746-023-00958-w.
Cunningham H, Ewart A, Riggs L, Huben R, Sharkey R. Sparse Autoencoders Find Highly Interpretable Features in Language Models. ar** medical education and clinical management. Pak J Med Sci. 2023;39(2):605–607. doi: https://doi.org/10.12669/pjms.39.2.7653.
Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023;11(6):887. doi: https://doi.org/10.3390/healthcare11060887.
Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023;9:e46885. doi: https://doi.org/10.2196/46885.
Cascella M, Cascella A, Monaco F, Shariff MN. Envisioning gamification in anesthesia, pain management, and critical care: basic principles, integration of artificial intelligence, and simulation strategies. J Anesth Analg Crit Care. 2023;3(1):33. doi: https://doi.org/10.1186/s44158-023-00118-2.
Haque A, Chowdhury N-U-R. The Future of Medicine: Large Language Models Redefining Healthcare Dynamics. TechRxiv. November 22, 2023. doi: https://doi.org/10.36227/techrxiv.24354451.v2.
Gurrapu S, Kulkarni A, Huang L, Lourentzou I, Batarseh FA. Rationalization for explainable NLP: a survey. Front Artif Intell. 2023;6:1225093. doi: https://doi.org/10.3389/frai.2023.1225093.
Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst. 2023;47(1):33. doi: https://doi.org/10.1007/s10916-023-01925-4.
Birkun AA, Gautam A. Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice. Prehosp Disaster Med. 2023;38(6):757–763. doi: https://doi.org/10.1017/S1049023X23006568.
Zúñiga Salazar G, Zúñiga D, Vindel CL, Yoong AM, Hincapie S, Zúñiga AB, Zúñiga P, Salazar E, Zúñiga B. Efficacy of AI Chats to Determine an Emergency: A Comparison Between OpenAI’s ChatGPT, Google Bard, and Microsoft Bing AI Chat. Cureus. 2023;15(9):e45473. doi: https://doi.org/10.7759/cureus.45473.
MIT Technology Review. Why Meta’s latest large language model survived only three days online. https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/ Last Accessed: December 22, 2023.
Batarseh FA, Freeman L, Huang C-H. A survey on artificial intelligence assurance. J Big Data 2021;8,7. doi:https://doi.org/10.1186/s40537-021-00445-7.
Manathunga S, Hettigoda I. Aligning Large Language Models for Clinical Tasks. ar**v:2309.02884 (2023).
Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, Sigler C, Knödler M, Keller U, Beule D, Keilholz U, Leser U, Rieke DT. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023;6(11):e2343689. doi: https://doi.org/10.1001/jamanetworkopen.2023.43689.
Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1:206–215. doi: https://doi.org/10.1038/s42256-019-0048-x.
Madsen A, Reddy S, Chandar S. Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys. 2022;55(8):1–42. doi: https://doi.org/10.1145/3546577.
Tran D, Liu J, Dusenberry MW, Phan D, Collier M, Ren J, Han K, Wang Z, Mariet Z, Hu H, Band N, Rudner TJG, Singhal K, Nado Z, van Amersfoort J, Kirsch A, Jenatton R, Thain N, Yuan H, Buchanan K, Murphy K, Sculley D, Gal Y. Plex: towards reliability using pretrained large model extensions. Preprint at https://doi.org/10.48550/ar**v.2207.07411 (2022).
Brown T, et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020;33:1877–1901.
Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. Preprint at: https://doi.org/10.48550/ar**v.2104.08691 (2021).
Liang P. et al. Holistic evaluation of language models. Preprint at: https://doi.org/10.48550/ar**v.2211.09110 (2022).
Hippocratic AI. Available at https://www.hippocraticai.com/. Last Accessed: December 24, 2023
Funding
None.
Open access funding provided by Università degli Studi di Parma within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Contributions
MC, FS, OP:1) made substantial contributions to the conception of the work; acquisition, analysis, and interpretation of data; 2) drafted the work;3) approved the version to be published; 4) agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.JM, VB, EB:1) made substantial contributions to the conception and design of the work; analysis and interpretation of data; 2) revised the work critically for important intellectual content;3) approved the version to be published; and4) agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Ethical Approval
Not applicable.
Availability of Supporting data
Not applicable.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cascella, M., Semeraro, F., Montomoli, J. et al. The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives. J Med Syst 48, 22 (2024). https://doi.org/10.1007/s10916-024-02045-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10916-024-02045-3