Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis

Azizoğlu M, Klyuev S. A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o. Medical Records 2025;7(1):201 View
Wei Y, Zhang R, Zhang J, Qi D, Cui W. Research on Intelligent Grading of Physics Problems Based on Large Language Models. Education Sciences 2025;15(2):116 View
Zeng J, Sun K, Qin P, Liu S. Enhancing ophthalmology students’ awareness of retinitis pigmentosa: assessing the efficacy of ChatGPT in AI-assisted teaching of rare diseases—a quasi-experimental study. Frontiers in Medicine 2025;12 View
Acar A, Yanik E, Altin E, Kurtkaya Kocak O. Is artificial intelligence successful in the Turkish neurology board exam?. Neurological Research 2025;47(5):402 View
Hasei J, Nakahara R, Takeuchi K, Yoshida A, Itano T, Fujiwara T, Nakata E, Kunisada T, Ozaki T. Comparative analysis of a standard (GPT-4o) and reasoning-enhanced (o1 pro) large language model on complex clinical questions from the Japanese orthopaedic board examination. Journal of Orthopaedic Science 2025;30(3):565 View
Budler L, Chen H, Chen A, Topaz M, Tam W, Bian J, Stiglic G. A Brief Review on Benchmarking for Large Language Models Evaluation in Healthcare. WIREs Data Mining and Knowledge Discovery 2025;15(2) View
Bi C, Zheng X, Zhang Y, Zhou S, Song J, Shang H, Shen B. NDDRF 2.0: An update and expansion of risk factor knowledge base for personalized prevention of neurodegenerative diseases. Alzheimer's & Dementia 2025;21(5) View
Wu D, Liu N, Ma R, Wu P. Advancements in Herpes Zoster Diagnosis, Treatment, and Management: Systematic Review of Artificial Intelligence Applications. Journal of Medical Internet Research 2025;27:e71970 View
Wei J, Wang X, Huang M, Xu Y, Yang W. Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis. Journal of Medical Systems 2025;49(1) View
Yan Z, Fan K, Zhang Q, Wu X, Chen Y, Wu X, Yu T, Su N, Zou Y, Chi H, Xia L, Cao Q. Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology. World Journal of Urology 2025;43(1) View
Paruzel K, Ordak M. Assessment of ChatGPT-3.5 performance on the medical genetics specialist exam. Laboratory Medicine 2025;56(6):737 View
Hu D, Guo Y, Zhou Y, Flores L, Zheng K. A systematic review of early evidence on generative AI for drafting responses to patient messages. npj Health Systems 2025;2(1) View
Souto M, Fernandes A, Silva A, de Freitas Ribeiro L, de Medeiros Fernandes T. A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations. Frontiers in Artificial Intelligence 2025;8 View
Zhang A, Zhao E, Wang R, Zhang X, Wang J, Chen E. Multimodal large language models for medical image diagnosis: Challenges and opportunities. Journal of Biomedical Informatics 2025;169:104895 View
Gao F, He Y, Chen Q, Liu F. Evaluating Psychological Competency via Chinese Q&A in Large Language Models. Applied Sciences 2025;15(16):9089 View
Yang X, Chen W. The performance of ChatGPT on medical image-based assessments and implications for medical education. BMC Medical Education 2025;25(1) View
Armitage R. Potential for Editorial Committee Use of Large Language Models in Peer Review. Journal of Evaluation in Clinical Practice 2025;31(6) View
Armitage R. Artificial General Intelligence and Its Threat to Public Health. Journal of Evaluation in Clinical Practice 2025;31(6) View
Kim K, Kim B. Diagnostic Performance of Large Language Models in Multimodal Analysis of Radiolucent Jaw Lesions. International Dental Journal 2025;75(6):103910 View
Reshetnikov R, Tyrov I, Vasilev Y, Shumskaya Y, Vladzymyrskyy A, Akhmedzyanova D, Bezhenova K, Varyukhina M, Sokolova M, Blokhin I, Voytenko D, Mynko O, Kodenko M, Omelyanskaya O. Assessing the quality of large generative models for basic healthcare applications. Medical Doctor and Information Technologies 2025;(3):64 View
Chen H, Zeng D, Qin Y, Fan Z, Ng Yu Ci F, Klonoff D, Ji J, Zhang S, Amissah-Arthur K, Jiménez de Tavárez M, Masood S, Van Le P, Keane P, Sheng B, Wong T, Tham Y. Large language models and global health equity: a roadmap for equitable adoption in LMICs. The Lancet Regional Health - Western Pacific 2025;63:101707 View
Lu Q. Development of generative artificial intelligence in medical education: a bibliometric profiling. Frontiers in Education 2025;10 View
Altermatt F, Neyem A, Sumonte N, Villagrán I, Mendoza M, Lacassie H, Delfino A. Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam. BMC Medical Education 2025;25(1) View
Ito T, Ishibashi T, Hayashi T, Kojima S, Sogabe K. Large Language Models for the National Radiological Technologist Licensure Examination in Japan: Cross-Sectional Comparative Benchmarking and Evaluation of Model-Generated Items Study. JMIR Medical Education 2025;11:e81807 View
Dejean-Bouyer E, Kanlagna A, Thuau F, Perrot P, Lancien U. Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study. Journal of Educational Evaluation for Health Professions 2025;22:27 View
Alanazi H, Altalhi L, Alanazi N, Al Ghamdi R, Aboalela A, Shujaat S. Arabian Nights or English Days? Accuracy of Large Language Models in Answering Bilingual Dental Multiple‐Choice Questions. European Journal of Dental Education 2026;30(2):707 View
Banskota B, Bhusal R, Yadav P, Banskota A. Artificial intelligence in orthopaedic education, training and research: a systematic review. BMC Medical Education 2025;25(1) View
Pohlmann P, Glienke M, Sandkamp R, Gratzke C, Schmal H, Schoeb D, Fuchs A. Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions. Bioengineering 2025;12(12):1290 View
Genez S, Özer H, Buz Yaşar A, Yılmazsoy Y, Soydan T, Sarıoğlu A, Ersoy S. Evaluation of ChatGPT-5 for Automated ASPECTS Assessment on Non-Contrast CT in Acute Ischemic Stroke. Diagnostics 2025;15(24):3160 View
Lin P, Deng Q, Zhou Y. Towards responsible AI in education: A Delphi-AHP-based framework for evaluating educational large language models. Computers and Education: Artificial Intelligence 2026;10:100534 View
Groza T, Marcello A, Carlisle T, Lim W, Haendel M, Karnani N, Robinson P, Graessner H, Chong J, Baynam G, Jamuar S. A systematic assessment of large language models’ knowledge of rare diseases: How much do large language models know about rare disease?. Human Genetics and Genomics Advances 2026;7(1):100558 View
Фролов Е, Ермолаева Д, Мокшин К, Шемонаев Д, Фролов М, Жабицкий М. Модульная переработка нормативных текстов как метод повышения релевантности ответов большой языковой модели: пилотное исследование на примере клинических рекомендаций по артериальной гипертензии. International Journal of Open Information Technologies 2025;13(8):94 View
Zheng X, Bi C, Bo W, Zhang Y, Song J, Du J, Hu S, Feng J, Yang L, Shen B. DRPMKB1.0: A Comprehensive Knowledge Base for an AI-Oriented Drug Repositioning Prediction Model. Journal of Chemical Information and Modeling 2026;66(1):122 View
Saita K, Mine Y, Amano S. What the performance of multimodal LLMs on a national licensing exam teaches us about occupational therapy education. BMC Medical Education 2026;26(1) View
Sun L, Li Y, Kan H, Shu J, Xu H, Li C, Shi G, Wang Z, Wang X, Jin L. Open- and closed-source LLMs in medical and engineering education. Frontiers in Medicine 2026;12 View
Gu S, Yao D, Yao Y, Cen X, Yuan J. Evaluation of large language models in nutrition risk screening: a comparative analysis across 8 LLMs based on real-world EHR datasets. BMC Medical Informatics and Decision Making 2026;26(1) View
Zouakia Z, Logak E, Szymczak A, Jais J, Burgun A, Tsopra R. AI-Driven Objective Structured Clinical Examination Generation in Digital Health Education: Comparative Analysis of Three GPT-4o Configurations. JMIR Medical Education 2026;12:e82116 View
Gürses Ö, Ceylan İ. Consistency over accuracy: run-to-run stability of contemporary large language models on Turkish curriculum-aligned theoretical anatomy multiple-choice questions. BMC Medical Education 2026;26(1) View
Aliyeva A, Muradova A, Hashimli R, Müderris T. Multi‐model Artificial Intelligence Evaluation in Sudden Sensorineural Hearing Loss. Otolaryngology–Head and Neck Surgery 2026;174(4):980 View
Stelling H, Kraus A, Grieb G, Güler I. Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS). European Journal of Investigation in Health, Psychology and Education 2026;16(2):23 View
Ahmadfard M. Advancing open learning through interactive open educational resources. Discover Education 2026;5(1) View
Procop G, Schlinsog A, Earnest C, Nayar R, McCarthy T, Woodworth B, Glassy E. A comparison of the suitability of items generated by 4 large language models for pathology continuing certification. Am J Clin Pathol 2026;165(2) View
Koç A, Ataş A, Yosunkaya Ş, Vatansev H. Performance of large language models on sleep medicine certification examination: a comprehensive multi-model analysis. Frontiers in Medicine 2026;13 View
Chen R, Wu M, Tsai L, Chang S, Shen Hsiao S, Lo Y. Integrating a Large Language Model to Streamline Nursing Handover Documentation Across Multiple Hospitals in Taiwan: Development and Implementation Study. Journal of Medical Internet Research 2026;28:e81604 View
Kim T, Kim B. Comparative Performance of State-of-the-Art LLMs on the KDLE: A 2025 Benchmark Study. International Dental Journal 2026;76(3):109466 View
Daniel R, V N, BN S, Daniel A, R V. Integrating ChatGPT into knowledge-retrieval tutorials in undergraduate medical education: a prospective evaluation of higher-order learning and feasibility. Medical Education Online 2026;31(1) View
Lederer T, Herring W, Ammar L, Abella B, Apakama D, Abbott E, Shekhar A. Large Language Models (LLM) for Emergency Department Triage Based on Vital Signs. Emergency Care and Medicine 2026;3(1):9 View
Altunisik E, Ekmekyapar Firat Y, Cengiz E, Comruk G. From GPT-3.5 to GPT-5.2: a paired longitudinal evaluation of large language models in clinical neurology. Neurological Research 2026;48(4):522 View
Armitage R. Frontier large language models and clinical recognition of Category A bioterrorism agents: a cross-sectional analysis. Global Security: Health, Science and Policy 2026;11(1) View
Strasser L, Anschuetz W, Dennstädt F, Hastings J. Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study. JMIR Medical Education 2026;12:e81399 View
Çelik Y, Özer N, Esen Öksüzoğlu M, Macit Ş, Günal Okumuş H, Kaşak M, Efe A, Öztürk Y. Artificial Intelligence in Psychiatry Training: Comparative Insights from Nine Large Language Models Across Cultural and Exam Contexts. Psychiatric Quarterly 2026 View
Wang Y, Jiang Y, Jin W, Lin W, Xu Y, Wang J, Wang X, Fang Z. Benchmarking large language models for medical education: performance on the clinical laboratory technician qualification examination. Frontiers in Medicine 2026;13 View
Hack S, Craig J, Lin C, Fu C, Kwiatkowska M, Kocum P, Allevi F, Saibene A. Retrieval-augmented generative AI enhances clinical reasoning in odontogenic sinusitis versus maxillary sinus mucositis. European Archives of Oto-Rhino-Laryngology 2026;283(4):2353 View
Ronen A, Fein S, Orbach-Zinger S, Heesen P, Shpack O, Kashkush A, Iluz-Freundlich D, Binyamin Y, Lahav M, Sheffy N, Azem K. Large language models versus human examinee performance on Israeli anesthesiology board examinations. Scientific Reports 2026;16(1) View
Yacobson E, Schleifer Y, Bar-Dov Z, Rap S, Blonder R, Alexandron G. Benchmarking AI on Standard Chemistry Exams: LLMs Still Underperform Compared to High School Students. Journal of Science Education and Technology 2026 View
Tang Y, Chen J, Wang S, Karobari M. Performance benchmarking of LLMs on Chinese national medical licensing education: Cross-lingual and question-type effects. PLOS One 2026;21(4):e0346518 View
Zong H, Cha J, Wang J, Song Y, Zhao Y, Shi M, Shen B. A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations. Scientific Data 2026;13(1) View
Healy J, Kossoff J, Lee M, Hasford C. Human–AI collaboration in clinical reasoning: a UK replication and interaction analysis. Diagnosis 2026 View
Muasher-Kerwin C, Hughes M, Sanatizadeh A. Can GPT-5 Support Licensing Examination Preparation? Analysis of Accuracy, Reasoning, and Semantic Similarity Across Rehabilitation Disciplines. JMIR Rehabilitation and Assistive Technologies 2026;13:e91019 View
Fernandes da Silva F, Roeder E, Bruneti Severino J, Nespolo Berger M, Basei de Paula P, Ferreira D, Han Veiga M, de Moraes T, Lenci Marques G. Performance of Large Language Models on the Brazilian National Medical Education Examination: Comparative Benchmark Study. JMIR Medical Education 2026;12:e89839 View
Armitage R. Why public health needs to engage with existential risk studies: a call for collaboration. Considerations in Medicine 2026;4(1):e000060 View
Song J, Feng J, Zhang Y, Bi C, Zheng X, Xu Z, Du J, He M, Xiao M, Li X, Cao Q, Zhang C, Yang H, Wu R, Shen B, Al-Obaidi H. Augmenting large language models with clinical knowledge graph for personalized perioperative fluid therapy question answering. PLOS Digital Health 2026;5(6):e0001474 View
Chen H, Watanabe S, Orii R, Kaneko M, Yumoto K, Kashizaki F. Performance of Recent Large Language Models on the Japanese National Medical Licensing Examination: A Multimodal Accuracy and Response-Time Comparison. AI and Clinical Practice 2026;1(2):e102 View
Zhan X, Yu W, Cai J, Chen J, Amankwaa I. From knowledge to judgment: A three-year longitudinal analysis of artificial intelligence large language model performance on the Chinese national nurse licensing examination. PLOS One 2026;21(7):e0353059 View
Zhang H, Qu L, Bai H, Chen Y, Ji R, Cheng Z, Yang C. Beyond accuracy: evaluating the reliability of large language models for medical assessment. Frontiers in Artificial Intelligence 2026;9 View
Riemma G, Caniglia F, Casolari C, Maiorana A, Cozzolino M, Agrifoglio V, De Franciscis P, Cobellis L, Carotenuto R, Etrusco A. Can AI speak endo? A multi-platform evaluation of large language models against ESHRE endometriosis guidelines. Human Reproduction 2026 View

Books/Policy Documents

Zong H, Tao L, Li Z, Wu C, Liu Y, Zhang X. Health Information Processing. Evaluation Track Papers. View

Conference Proceedings

Hamna H, Bhat G, Mukherjee S, Lalani F, Hadfield E, Siddarth D, Bali K, Sitaram S. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings View

Citation

Please cite as:

Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis
J Med Internet Res 2024;26:e66114
doi: 10.2196/66114 PMID: 39729356 PMCID: 11724220

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Generative Language Models Including ChatGPT (1449) Digital Health Reviews (3566) e-Learning and Digital Medical Education (1553) Reviews in Medical Education (287) New Methods and Approaches in Medical Education (619) Learning and Education (171) Machine Learning (3099)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn