Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

doi:10.2196/84120

Published on 01.Dec.2025 in Vol 27 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/84120, first published 15.Sep.2025.

Doctor reviewing patient chart and writing notes with stethoscope

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

Eun Jeong Gong^{1, 2, 3}

; Chang Seok Bang^{1, 2, 3}

; Jae Jun Lee^{3, 4}

; Gwang Ho Baik^{1, 2}

Article Authors Cited by (22) Tweetations Metrics

Journals

Lin X, Yang Y, Ren Y. Making Chatbots more human: deep reasoning large language models in ophthalmology. Frontiers in Medicine 2026;12 View
Spieser J, Balapour A, Meller J, Patra K, Shamsaei B. A Review of Multi-Agent AI Systems for Biological and Clinical Data Analysis. Methods and Protocols 2026;9(2):33 View
Zhu Q, Li Q, Zan Y, Lu Y, Xia L, Xia Y, Xu T. Patient-centered gastrointestinal function assessment technologies: a paradigm shift from traditional approaches to non-invasive innovations. Frontiers in Physiology 2026;17 View
Prause M. No skin in the game: why agentic AI requires principal-agent governance. AI and Ethics 2026;6(2) View
Eltaybani S. Knowledge Cut‐Off in Large Language Models: Implications for Critical Care Nursing. Nursing in Critical Care 2026;31(3) View
Lee W, Kim J, Leem J, Lee B, Lee S, Kim Y. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences 2026;16(7):3377 View
Mine Y, Taji T, Okazaki S, Takeda S, Shimoe S, Kaku M, Nikawa H, Kakimoto N, Murayama T. Beyond exam accuracy: Tracking a persistent-failure set reveals visual dental reasoning gaps in multimodal LLMs. Journal of Dentistry 2026;170:106675 View
Wang X, Yin C, He H, Guo J, Fu X, Bai F. Benchmarking public large language model responses to patient-facing inflammatory bowel disease questions: informational quality, transparency proxies, and readability. Frontiers in Public Health 2026;14 View
Keshav T, Chow D, Kippenberger T, Livezey J, Aranda M. Evaluating Large-Language Models Against Providers on Surgical Diagnostic Reasoning Tasks. Journal of Surgical Research 2026;322:259 View
Rajwal S, Pandey A, Zhang Z, Chen Y, Liu M, Das S, Rogers H, Sarker A, Xiao Y. Applications of Natural Language Processing and Large Language Models for Social Determinants of Health: Systematic Review. Journal of Medical Internet Research 2026;28:e83793 View
Bajwa M, Hoyt R, Knight D, Haider M. The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study. JMIRx Med 2026;7:e76822 View
Chang Y, Hsieh M, Ju P, Liu Y, Chang C. Clinical Plausibility in Large Language Model Robustness Testing for Medicine: A Scoping Review. Journal of Medical Systems 2026;50(1) View
Karunanayake N. When Chatbots Become Agents: The Next Phase of Healthcare AI. Journal of Medical Systems 2026;50(1) View
Yeh Y, Shih M, De Backer D, Celi L, See K, Fujii T, Ling L, Mongkolpun W, Hu H, Chen H, Chen W, Cholley B, Fong K, Ryu H, Na S, Egi M, Chan W, Chen K, Kamaleswaran R, Chuang Y, Yang C, Hsiao W, Lai S, Ku D, Jahan A, Martin G. The IMPACT framework for evaluating generative AI in critical care: development and multinational consensus validation. Annals of Intensive Care 2026;16:100078 View
Khosravi M, Zamaninasab Z, Khosravi F, Attar M, Arab‐Zozani M. Performance of Large Language Models in Answering Healthcare Delivery Questions: A Quantitative Cross‐Sectional Study. Health Science Reports 2026;9(6) View
Khosravi M, Dindar E, Sayar B. Evaluating large language model`s performance in answering principles of health course questions. Scientific Reports 2026;16(1) View
Karataş S, Öner S. Pre-deployment safety and governance assessment of LLM-based clinical decision support systems: A health technology assessment-oriented evaluation framework. Health Policy and Technology 2026;15(9):101281 View
Zhang W, Xu J, Dong T, Hao X, Yang Q, Zhang J, Han Y. Exploring large language models as a prescription decision support tool for rational antibiotic use: A dual-framework analysis using standardized examinations and real-world clinical cases. Exploratory Research in Clinical and Social Pharmacy 2026;23:100821 View
Ucdal M, Ekingen E, Kurtcebe A. Benchmark Performance of a Neurosymbolic Multi Model Large Language Reasoning Pipeline Versus Board Certified Specialists and Single Model Baselines on Septic Arthritis: A Five Center Prospective Benchmarking Study with Item Level and Question Subtype Analysis (Preprint). JMIR AI 2026 View
Zhang Z, Chen L, Lv Z, Lv H, Sheng W, Wei Z, Wang B, Shen Y, Tian Y, Hu J, Shen Z, Lv L. Discordance Between Textual Reasoning and Visual Interpretation in Large Language Models for Low Back Pain: An Evaluation of Reliability and Clinical Implications (Preprint). JMIR Medical Informatics 2026 View
Dong Y, Cheng J, Ding C, Lu R. Governing Clinical Readiness Claims Derived from Medical AI Benchmark Results. Journal of Medical Systems 2026;50(1) View

Conference Proceedings

Kumar A, Joshi S, Sachdeva S. 2026 International Conference on Signal Processing and Electronics Design (ICSPED). JsonUtil: An Open-Source RESTful JSON-Based Dynamic Form Generation Framework validation with OpenEHR ORBDA Benchmarking Dataset View

Citation

Please cite as:

Gong EJ, Bang CS, Lee JJ, Baik GH
Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks
J Med Internet Res 2025;27:e84120
doi: 10.2196/84120 PMID: 41325597 PMCID: 12706444

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Digital Health Reviews (3566) Clinical Informatics (2183) Natural Language Processing (1250) mHealth in a Clinical Setting (1110) Artificial Intelligence (4612)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn