Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

doi:10.2196/60807

Published on 25.Jul.2024 in Vol 26 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/60807, first published 22.May.2024.

Young female medical student in white coat with stethoscope

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Mingxin Liu¹

; Tsuyoshi Okuhara²

; XinYi Chang³

; Ritsuko Shirabe²

; Yuriko Nishiie¹

; Hiroko Okada²

; Takahiro Kiuchi²

Article Authors Cited by (183) Tweetations (1) Metrics

Journals

Liu C, Ho C, Wu T. Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination. Healthcare 2024;12(17):1726 View
Semeraro F. AI-Powered clinical assessments: GPT-4o’s role in standardizing CPR skill evaluations. Resuscitation 2024:110411 View
Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, Furukawa E, Kiuchi T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. International Journal of Medical Informatics 2025;193:105673 View
Taniguchi M, Lindsey J. Performance of chatbots in queries concerning fundamental concepts in photochemistry. Photochemistry and Photobiology 2025;101(4):886 View
Yau J, Saadat S, Hsu E, Murphy L, Roh J, Suchard J, Tapia A, Wiechmann W, Langdorf M. Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study. Journal of Medical Internet Research 2024;26:e60291 View
Liu M, Okuhara T, Huang W, Ogihara A, Nagao H, Okada H, Kiuchi T. Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis. International Dental Journal 2025;75(1):213 View
Chen Y, Huang X, Yang F, Lin H, Lin H, Zheng Z, Liang Q, Zhang J, Li X. Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study. BMC Medical Education 2024;24(1) View
Bongco E, Cua S, Hernandez M, Pascual J, Khu K. The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis. Neurosurgical Review 2024;47(1) View
Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. Journal of Medical Internet Research 2024;26:e66114 View
Ferraz-Costa G, Griné M, Oliveira-Santos M, Teixeira R. Performance of ChatGPT in the Portuguese National Residency Access Examination. Acta Médica Portuguesa 2024;38(3):170 View
Sabaner M, Anguita R, Antaki F, Balas M, Boberg-Ans L, Ferro Desideri L, Grauslund J, Hansen M, Klefter O, Potapenko I, Rasmussen M, Subhi Y. Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review. Journal of Personalized Medicine 2024;14(12):1165 View
Camlet A, Kusiak A, Świetlik D. Application of Conversational AI Models in Decision Making for Clinical Periodontology: Analysis and Predictive Modeling. AI 2025;6(1):3 View
Yang H, Hu M, Most A, Hawkins W, Murray B, Smith S, Li S, Sikora A. Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education. Frontiers in Artificial Intelligence 2025;7 View
Qiu Y, Liu C. Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment. Global Medical Education 2025;2(1):135 View
Erdat E, Kavak E. Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions. BMC Cancer 2025;25(1) View
Chu H, Pasion E, Yeh S, Chu G. Assessing the Ethical and Professional Capabilities of AI: A Study of ChatGPT and Google Gemini versus PREview (Situational Judgement Test) for Medical Student Applicant. Journal of Clinical Question 2024;1(3) View
Meyer A, Wetsch W, Steinbicker A, Streichert T. Through ChatGPT’s Eyes: The Large Language Model’s Stereotypes and what They Reveal About Healthcare. Journal of Medical Systems 2025;49(1) View
Waaler P, Hussain M, Molchanov I, Bongo L, Elvevåg B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 2025;4:e69820 View
Zhu J, Jiang Y, Chen D, Lu Y, Huang Y, Lin Y, Fan P. High identification and positive‐negative discrimination but limited detailed grading accuracy of ChatGPT‐4o in knee osteoarthritis radiographs. Knee Surgery, Sports Traumatology, Arthroscopy 2025;33(5):1911 View
Tseng L, Lu Y, Tseng L, Chen Y, Chen H. Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study. JMIR Medical Education 2025;11:e58897 View
Wang J, Shue K, Liu L, Hu G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Scientific Reports 2025;15(1) View
Kopka M, von Kalckreuth N, Feufel M. Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine 2025;8(1) View
Kim K. Technology-enhanced learning in medical education in the age of artificial intelligence. Forum for Education Studies 2025;3(2):2730 View
Rodrigues Alessi M, Gomes H, Oliveira G, Lopes de Castro M, Grenteski F, Miyashiro L, do Valle C, Tozzini Tavares da Silva L, Okamoto C. Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study. JMIR AI 2025;4:e66552 View
Al Barajraji M, Barrit S, Ben-Hamouda N, Harel E, Torcida N, Pizzarotti B, Massager N, Lechien J. AI-Driven Information for Relatives of Patients with Malignant Middle Cerebral Artery Infarction: A Preliminary Validation Study Using GPT-4o. Brain Sciences 2025;15(4):391 View
Bolgova O, Shypilova I, Mavrych V. Large Language Models in Biochemistry Education: Comparative Evaluation of Performance. JMIR Medical Education 2025;11:e67244 View
Krumsvik R. GPT-4’s capabilities for formative and summative assessments in Norwegian medicine exams—an intrinsic case study in the early phase of intervention. Frontiers in Medicine 2025;12 View
Luo D, Liu M, Yu R, Liu Y, Jiang W, Fan Q, Kuang N, Gao Q, Yin T, Zheng Z. Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination. Scientific Reports 2025;15(1) View
Yang X, Xiao Y, Liu D, Deng H, Huang J, Zhou Y, Dai C, Wu J, Liu D, Liang M, Xu C. Cross language transformation of free text into structured lobectomy surgical records from a multi center study. Scientific Reports 2025;15(1) View
Hanss K, Sarma K, Glowinski A, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E. Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. Journal of Medical Internet Research 2025;27:e69910 View
Fushimi A, Terada M, Tahara R, Nakazawa Y, Iwase M, Shibayama T, Kotti S, Yamashita N, Iesato A. Assessing the quality of Japanese online breast cancer treatment information using large language models: a comparison of ChatGPT, Claude, and expert evaluations. Breast Cancer 2025;32(5):960 View
He F, Yang M, Liu J, Gong T, Ma J, Yang T, Zhao D, Li S, Tian D. Quality and reliability of pediatric pneumonia related short videos on mainstream platforms: cross-sectional study. BMC Public Health 2025;25(1) View
Huang S, Wen C, Bai X, Li S, Wang S, Wang X, Yang D. Exploring the Application Capability of ChatGPT as an Instructor in Skills Education for Dental Medical Students: Randomized Controlled Trial. Journal of Medical Internet Research 2025;27:e68538 View
Kuribara T, Hirayama K, Hirata K. Performance evaluation of large language models for the national nursing examination in Japan. DIGITAL HEALTH 2025;11 View
Fallah H, Biazar E, Rezaei M. Artificial Intelligence in Dental Education. The Journal of the American Dental Association 2025;156(6):434 View
Tan Y, Nah S, Saw S, Rajandram R, Ong T. Evaluating the performance of artificial intelligence chatbots in answering urology questions derived from guidelines or board examinations: A systematic review. Urological Science 2026;37(2):100 View
Kim M, Hwang G, Chang J, Chang S, Roh H, Park R. Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations. Journal of Medical Internet Research 2025;27:e69857 View
Mert S, Muir L, Fuchs B, Lucksch V, Vollbach F, Haas-Lützenberger E, Giunta R, Thierfelder N, Demmer W. Can artificial intelligence pass the written European Board of Hand Surgery exam?. Hand Surgery and Rehabilitation 2025;44(4):102197 View
Çolakoğlu Y, Ayten A, Sertkaya Ç, Toksal K, Karadağ S. Evaluation of Chat Generative Pretrained Transformer (ChatGPT) Performance in Answering Kidney Transplant Related Questions. The New Journal of Urology 2025;20(1):21 View
Bruneti Severino J, Nespolo Berger M, Basei de Paula P, Loures F, Todeschini S, Roeder E, Han Veiga M, Knopfholz J, Lenci Marques G. Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology's Certification Exam. International Journal of Cardiovascular Sciences 2025;38 View
Alkalbani A, Alrawahi A, Salah A, Haghighi V, Zhang Y, Alkindi S, Sheng Q. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025;16(6):489 View
Kaneyasu Y, Mine Y, Niitani Y, Taji T, Takeda S, Tokinaga R, Shigeishi H, Takemoto T, Kakimoto N, Murayama T, Ohta K. Analysis of multimodal large language models on visually-based questions in the Japanese National Examination for Dental Hygienists: A preliminary comparative study. Journal of Dental Sciences 2026;21(1):198 View
Ramos-Soto O, Aranguren I, Carrillo M M, Oliva D, Balderas-Mata S. Artificial intelligence in medical imaging diagnosis: are we ready for its clinical implementation?. Journal of Medical Imaging 2025;12(06) View
Alharbi L, Alrashoud R, Alotaibi B, Al Dera A, Alajlan R, AlHuthail R, Alessa D. Using Artificial Intelligence ChatGPT to Access Medical Information About Chemical Eye Injuries: Comparative Study. JMIR Formative Research 2025;9:e73642 View
Yao Z, Duan L, Xu S, Chi L, Sheng D. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Medical Informatics 2025;13:e69485 View
Ahmed Y, Ibrahim H, Khayal S. Evaluating advanced artificial intelligence in oncology education and clinical knowledge assessment. International Journal of Research in Medical Sciences 2025;13(7):2761 View
Bessa R, de Oliveira A, Bessa R, Sousa D, Alves R, Barbosa A, Carneiro A, Soares C, Teles A. Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates. Applied Sciences 2025;15(13):7134 View
Kim B, Shin W. Performance Evaluation of ChatGPT-4o on Korean Physical Therapist Licensing Examination. Physical Therapy Rehabilitation Science 2025;14(2):157 View
Duarte A, Siopa C, Chaves I. Inteligência Artificial na Prova Nacional de Acesso em Portugal: O Olhar da Psiquiatria. Acta Médica Portuguesa 2025;38(8):518 View
Mavrych V, Yousef E, Yaqinuddin A, Bolgova O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Medical Education Online 2025;30(1) View
Wei J, Wang X, Huang M, Xu Y, Yang W. Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis. Journal of Medical Systems 2025;49(1) View
Liu Z, Zuo H, Lu Y. The Impact of ChatGPT on Students' Academic Achievement: A Meta‐Analysis. Journal of Computer Assisted Learning 2025;41(4) View
Barrit S, Ranuzzi G, Fetzer S, Al Barajraji M, Hadwe S, Zanello M, Ortler M, O’Flaherty J, Massager N, Madsen J, Dibué M, Carron R. Specialized AI and neurosurgeons in niche expertise: a proof-of-concept in neuromodulation with vagus nerve stimulation. Acta Neurochirurgica 2025;167(1) View
Hu H, Wallace D, Boateng B. Medical Education Learning Specialists in the Age of Artificial Intelligence. Cureus 2025 View
Andrew A. A Meta-Analysis of ChatGPT’s Performance on Dermatology Specialty-Level (Board-Style) Certification Questions. Indian Dermatology Online Journal 2025;16(6):939 View
Aptyka H, Großschedl J, Hartelt T. Bugbear or surefire success? Secondary school students’ conceptual learning about evolution with ChatGPT. International Journal of Science Education 2025:1 View
Krumsvik R, Johansen M, Slettvoll V. Artificial intelligence, health empowerment, and the general practitioner scheme. DIGITAL HEALTH 2025;11 View
AlSamhori J, Alkafaween A, Al-Badawi A, Alhabashneh Z, Alelaumi A, Haddad B, Nashwan A. The role of ChatGPT in improving orthopedic Patient education in low-resource settings across various orthopedic specialties. The Journal of Precision Medicine: Health and Disease 2025;3:100017 View
Özer N, Balcı Y, Bölükbaşı G, İlhan B, Güneri P. Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator‐Generated Multiple‐Choice Questions in a Dental Exam. European Journal of Dental Education 2026;30(3):881 View
Sommer M, Arendasy M. Automatic- and Transformer-Based Automatic Item Generation: A Critical Review. Journal of Intelligence 2025;13(8):102 View
Richlitzki C, Mansoorian S, Käsmann L, Stoleriu M, Kovacs J, Sienel W, Kauffmann-Guerrero D, Duell T, Schmidt-Hegemann N, Belka C, Corradini S, Eze C. Assessing ChatGPT’s Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis. JMIR Cancer 2025;11:e69783 View
Krumsvik R. How capable is GPT-4 at answering exams and tests in Norwegian, and what implications could this have for education?. Nordic Journal of Digital Literacy 2025;20(2):113 View
Atahan M, Üner Ç, Aydemir M, Uzun M, Yalın M, Gölgelioğlu F. Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT‐4.0 in Orthopaedic Trauma Communication. Journal of Evaluation in Clinical Practice 2025;31(5) View
Choi S, Moon Y, Jung H. ChatGPT and human dietitian responses to diet-related questions on an online Q&A platform: A comparative study. DIGITAL HEALTH 2025;11 View
Othman A, Sharqawi A, MohammedAziz A, Ali W, Alatiyyah A, Mirah M. Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model. Healthcare 2025;13(17):2144 View
Tian L, Lu Y, Fei X, Lu J. Intelligent Head and Neck CTA Report Quality Detection with Large Language Models. Journal of Imaging Informatics in Medicine 2025;39(3):2727 View
Krumsvik R, Slettvoll V. Artificial intelligence and health empowerment in rural communities and landslide- or avalanche-isolated contexts: real case at a fictitious location. Frontiers in Digital Health 2025;7 View
Yang X, Chen W. The performance of ChatGPT on medical image-based assessments and implications for medical education. BMC Medical Education 2025;25(1) View
Zhang J, Sun Y, Rong Y, Li H, Jiang B, Zhao C, Liu H. Potential of AI Chatbots in Online Hair Transplantation Consultations: A Multi-metric Assessment of Three Models. Aesthetic Plastic Surgery 2025;49(21):6155 View
Jain N, Gottlich C, Fisher J, Winston T, Matullo K, Greenhill D. ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents. JBJS Open Access 2025;10(3) View
Li Z, Xu R, Gong X, Wang C, Liu J. The top 100 most-cited articles on large language models in medicine: A bibliometric analysis. DIGITAL HEALTH 2025;11 View
García-Rudolph A, Hernández-Pena E, del Cacho N, Teixido-Font C, Navarro-Berenguel M, Opisso E. Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies. Revista Española de Enfermedades Digestivas 2025 View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Ahmed H, Suhas D, D’Souza L, Jayaram P, Gupta A, Sache M. Evaluating the performance of five large language models in generating patient educational content for pediatric cardiothoracic procedures: a comparative study. General Thoracic and Cardiovascular Surgery 2026;74(3):227 View
Rai M, Ngaw M, Nannas N. Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity. Education Sciences 2025;15(10):1400 View
Lin Y, Luo Z, Ye Z, Zhong N, Zhao L, Zhang L, Li X, Chen Z, Chen Y. Applications, Challenges, and Prospects of Generative Artificial Intelligence Empowering Medical Education: Scoping Review. JMIR Medical Education 2025;11:e71125 View
Liu M, Okuhara T, Shirabe R, Nishiie Y, Xu Y, Okada H, Kiuchi T. Evaluating the Reliability and Accuracy of an AI-Powered Search Engine in Providing Responses on Dietary Supplements: Quantitative and Qualitative Evaluation. JMIR AI 2025;4:e78436 View
Idan D, Ben-Shitrit I, Volevich M, Binyamin Y, Nassar R, Nassar M, Abelson N, Zlotnik A, Einav S. Evaluating the performance of large language models versus human researchers on real world complex medical queries. Scientific Reports 2025;15(1) View
Boczkowski D, Dolata T, Radej D, Sawina P, Suleiman R, Latkowska A, Kowalczyk A, Loson-Kawalec M, Jaworski W, Wielochowska A, Olender M, Latkowska A, Dadynska P, Majchrowicz W, Stachowicz A. Assessment of the Efficacy of the Google Gemini 2.5 Pro Model in Solving the Polish State Specialization Exam in Pediatric Surgery. Cureus 2025 View
Dejean-Bouyer E, Kanlagna A, Thuau F, Perrot P, Lancien U. Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study. Journal of Educational Evaluation for Health Professions 2025;22:27 View
Bolgova O, Mavrych V. Evolution of AI in anatomy education study based on comparison of current large language models against historical ChatGPT performance. Scientific Reports 2025;15(1) View
Tzanis E, Adams L, Akinci D’Antonoli T, Bressem K, Cuocolo R, Kocak B, Malamateniou C, Klontzas M. Agentic systems in radiology: Principles, opportunities, privacy risks, regulation, and sustainability concerns. Diagnostic and Interventional Imaging 2026;107(1):7 View
Angulo C, Martín-Noguerol T, Paulano-Godino F, De Caso García L, Luna A. Performance Comparison Between Two Versions of a Commercial Artificial Intelligence System for Chest Radiograph Interpretation: A Multicenter Study. Journal of Imaging Informatics in Medicine 2025 View
Sridharan K, Sivaramakrishnan G. Large language models as educational collaborators: developing non-conventional teaching aids in pharmacology & therapeutics. BMC Medical Education 2025;25(1) View
Cammaroto G, Mira F, Favier V, Nunes H, de Castro J, Carsuzaa F, Lechien J, Chiesa Estomba C, Iannella G, Vaira L, Calvo-Henriquez C, Cheong R, de Apodaca P, Lentini M, Barillari M, Maniaci A. Experts V/S AI´s 2.0: Comparative evaluation of AI models and expert consensus in obstructive sleep apnea assessment. European Archives of Oto-Rhino-Laryngology 2026;283(1):509 View
Salbas A, Yogurtcu M. Performance of Large Language Models on Radiology Residency In-Training Examination Questions. Academic Radiology 2026;33(2):337 View
Tarhan M, Sahin Ozdemir M. Comparison of the accuracy and reliability of ChatGPT-4o and Gemini in answering HIV-related questions. BMC Infectious Diseases 2025;25(1) View
Inojosa H, Ramezanzadeh A, Gasparovic-Curtini I, Wiest I, Kather J, Gilbert S, Ziemssen T. Education Research: Can Large Language Models Match MS Specialist Training?. Neurology Education 2025;4(4) View
Zhu S, Xie Y, Tang Y, Yu Z, Zhao R, Dong X. New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models. European Journal of Pediatrics 2025;184(12) View
Meretukov D, Grechukhina K, Evdokimov V, Didych D, Kondratieva S, Rakitina O, Gordeev A, Shilo P, Khatkov I, Zhukova L. Deriving Real-World Evidence from Non-English Electronic Medical Records in Hormone Receptor-Positive Breast Cancer Using Large Language Models. Cancers 2025;17(23):3836 View
Ros-Arlanzón P, Gutarra-Ávila R, Arrarte-Esteban V, Bertomeu-González V, Hernández-Blasco L, Masiá M, Navarro-Canto L, Nieto-Navarro J, Abarca J, Sempere A. When AI models take the exam: large language models vs medical students on multiple-choice course exams. Medical Education Online 2025;30(1) View
Lipinski M, Kareemi H, Elder J, Thoma B. Has generative AI made our medical exams obsolete?. Canadian Journal of Emergency Medicine 2025;27(12):944 View
Meyer A, Schömig E, Streichert T. ChatGPT and reference intervals: a comparative analysis of repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o. Frontiers in Artificial Intelligence 2025;8 View
Meyer A, Karay Y, Steinbicker A, Streichert T, Overbeek R. Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation. JMIR Formative Research 2025;9:e77357 View
Hidalgo Guevara J, Hidalgo Guevara A. Effect of Generative Artificial Intelligence Use on Diagnostic Learning in Medical Students: A Quasi-Experimental Study. Salud, Ciencia y Tecnología 2025;5:1564 View
Brochu B, Cobler-Lichter M, Arcieri T, Shah N, Delamater J, Reyes A, Sussman M, Lineen E, Sands L, Hui V, Rodgers S, Thorson C. Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination. Global Surgical Education - Journal of the Association for Surgical Education 2025;5(1) View
Kaleci A, Şahinbaş B, Ağadayı E, Çelikkaya S, Altun A, Kardan E. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası 2025;24(74):135 View
Kaganovski A, Kasi A, Grinberg A, Kozlov M, Patel R, Kwon M, Lai C, Dersu I. Evaluating patient-facing eye disease information: ChatGPT-5 vs. Pfizer health answers. AJO International 2026;3(1):100215 View
Saita K, Mine Y, Amano S. What the performance of multimodal LLMs on a national licensing exam teaches us about occupational therapy education. BMC Medical Education 2026;26(1) View
Alqahtani H. Assessment of artificial intelligence chatbots in responding to dental occlusion questions: a comparative study. BMC Oral Health 2025;26(1) View
de Boer H, Young G, Bouwer H, Heath K. ChatGPT’s performance on a specialist forensic pathology examination: implications for forensic pathologists and non-specialists. International Journal of Legal Medicine 2026;140(3):1717 View
Rios-Garcia W, Silva-Jiménez S, Gálvez-Rodríguez E, Alberca-Naira Y, Via-y-Rada-Torres A, Rios-Garcia A. Assessment of ChatGPT-5 as an Artificial Intelligence Tool for Exploring Emerging Dimensions of Clinical Simulation: A Proof-of-concept Study. Journal of Medical Systems 2026;50(1) View
Strasser L, Anschuetz W, Dennstädt F, Hastings J. Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study. JMIR Medical Education 2026;12:e81399 View
Zouakia Z, Logak E, Szymczak A, Jais J, Burgun A, Tsopra R. AI-Driven Objective Structured Clinical Examination Generation in Digital Health Education: Comparative Analysis of Three GPT-4o Configurations. JMIR Medical Education 2026;12:e82116 View
Albaloul O, Killian C. Evaluating ChatGPT accuracy in answering a basketball common content knowledge test in physical education. Education and Information Technologies 2026;31(6):1793 View
Sun L, Li Y, Kan H, Shu J, Xu H, Li C, Shi G, Wang Z, Wang X, Jin L. Open- and closed-source LLMs in medical and engineering education. Frontiers in Medicine 2026;12 View
Cassar P, Galea F, Ferry P. Evaluating the Clinical Decision-Making Accuracy of Artificial Intelligence in Common Geriatric Syndromes Using Evidence-Based Guidelines. Cureus 2026 View
Zhu J, Hao D, Yong R. Assessing the educational quality of YouTube videos on celiac plexus blocks: Expert review and AI-based evaluation. Interventional Pain Medicine 2026;5(1):100740 View
Wani T, Liem M, Prasad N, Robinson K, Nexhip A, Tassos M, Gjorgioski S, Khan U, Boyd J, Riley M. Susceptibility of Assessment Types to AI-Generated Content in Digital Health and Health Information Management Education: Quasi-Experimental Pilot Study. JMIR Medical Education 2026;12:e82988 View
Li A, Bi X, Chen S, Hu J, Shi Y. Exploring the potential of large language models in healthcare: a focus on cardiovascular disease analysis. Health Information Science and Systems 2026;14(1) View
Obeid J, Bobier C, Gillham A, Omelianchuk A, Hurst D. Artificial intelligence in medical ethics education: a descriptive study of eight models in multiple choice question generation. BMC Medical Education 2026;26(1) View
Chen M, Wu Y, Ma J, Jia X, Gao C, Zhao F, Qiao Y. Independent and collaborative performance of large language models and healthcare professionals in diagnosis and triage. npj Digital Medicine 2026;9(1) View
Liu M, Okuhara T, Dai Z, Zhao M, Yin W, Okada H, Furukawa E, Kiuchi T. Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination. BMC Medical Informatics and Decision Making 2026;26(1) View
Koga S. Comment on ‘Limited performance of ChatGPT-4v and ChatGPT-4o in image-based core radiology cases’. Clinical Imaging 2026;132:110746 View
Wang Q, Zou H, Zhang H, Huang Y, Tian J, Cheng W. A Survey on Medical Competence Evaluation Benchmarks for Large Language Models. Health Care Science 2026;5(1):4 View
Eskandar K. Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation. Société Internationale d’Urologie Journal 2026;7(1):11 View
Stelling H, Kraus A, Grieb G, Güler I. Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS). European Journal of Investigation in Health, Psychology and Education 2026;16(2):23 View
Piussi R, Schneiderman J, Yu Y, Samuelsson K, Hamrin Senorski E. Human versus GPT-4 in qualitative analysis: A comparative reanalysis of patient interview data following anterior cruciate ligament injury rehabilitation. The Knee 2026;60:104388 View
Wang S, Chi X, Hao Q, Wang H, Tao H, Xiao J, Wu C, Deng J, Xu H, Sun R. Large language models in Chinese anesthesiology residency examinations: a comparative analysis of performance, reliability and clinical reasoning. BMC Medical Education 2026;26(1) View
Liu M, Okuhara T, Shirabe R, Nishiie Y, Chang X, Okada H, Kiuchi T. Validity and reliability of ChatGPT's responses on dietary supplements in Japan: A quality assessment and content analysis. PEC Innovation 2026;8:100461 View
Khanal A, Chataut S, Neupane A, Raut S, Ghimire U, Rana S, Bhatta J. A Study on the Performance of SOTA LLMs on Nepalese IOE Entrance Examination. European Journal of Applied Science, Engineering and Technology 2026;4(2):1 View
Koç A, Ataş A, Yosunkaya Ş, Vatansev H. Performance of large language models on sleep medicine certification examination: a comprehensive multi-model analysis. Frontiers in Medicine 2026;13 View
RAMADAN S, CALVIERI C, GRACIA-RAMOS A, ROSELLI FERRARI G, PEPE M, BIONDI-ZOCCAI G. An umbrella review encompassing 42 systematic reviews on medical applications of ChatGPT. Minerva Medica 2026;117(1) View
Aswin K, Arun R, Davood U. Rethinking prompts, reimagining conclusions: Key elements for future large language model-based transfusion medicine education studies. Transfusion Clinique et Biologique 2026;33(2):142 View
Gustafsson J, Lehtonen-Smeds E, Pakkasjärvi N. Integrating Large Language Models Into Trauma Education for Medical Students: Randomized Controlled Pilot Trial. JMIR Medical Education 2026;12:e79134 View
Zhang J, Huang L, Zhu X, Sun Y, Guo Y, Rong Y, Liu H. Accuracy of Generative AI Chatbots in Answering Plastic Surgery Examination Questions: A Comparative Evaluation of ChatGPT‐4o, Gemini Advanced, and DeepSeek‐R1. Journal of Evidence-Based Medicine 2026;19(1) View
Károlyi M, Wilzeck V, Tramèr L, Groenhoff L, von Spiczak J, Bigvava T, Alkadhi H, Manka R. Etiologic classification of suspected MINOCA using cardiovascular magnetic resonance reports: a comparison of a large language model and human readers. The International Journal of Cardiovascular Imaging 2026;42(7):1395 View
Chen X, Zhou H, Yi H, You M, Liu W, Wang L, Qin Z, Li H, Zhang X, Guo Y, Li S, Hu Y, Xiong Q, Li R, Fan L, Lao Q, Fu W, Li J, Li K. Grounding large language models in clinical diagnostics. Nature Communications 2026;17(1) View
Wang L, Jiang Y. Large Language Model–Powered Diagnostic Co-Pilot (“CapyEngine”) for Mental Disorders: Development, Evaluation, and Future Optimization Study. JMIR AI 2026;5:e70017 View
Truyts C, Rabelo A, Souza G, Lages D, Pereira A, Flato U, Reis E, Vieira J, Silveira P, Junior E. Zero-shot performance of selected large language and multimodal models on the 2023 Brazilian Portuguese medical residency exam. Scientific Reports 2026;16(1) View
Kim S, Ji M, Kim C, Yun Y, Lee G, Yoon C, Kim M. Comparative performance of large language models in answering cornea and cataract surgery questions for resident training. BMC Ophthalmology 2026;26(1) View
Bérar A, Allain J, Bouvet R. Could ChatGPT and co. replace forensic experts? A comparative study on medical liability expertise. International Journal of Legal Medicine 2026;140(4):2533 View
Lai N, Lim Y, Win M, Bhargava P, Thomas P, Ong Q. The Effectiveness of Artificial Intelligence in Undergraduate Health Professions Education: Systematic Review and Meta-Analysis of Randomized Controlled Trials. JMIR Medical Education 2026;12:e88933 View
Hao J, Zhang C. Benchmarking LLM decision support in inflammatory aneurysms: DeepSeek-R1, DeepSeek-V3 and ChatGPT-4o. Clinical and Experimental Medicine 2026;26(1) View
Hodzic S, Stevic A, Matthes J. Generative AI in practice: An umbrella review of risks, benefits, ethics, and future directions across major domains. Technology in Society 2026;87:103331 View
Parker M, Zavala-Cerna M. What don’t you understand? Using large language models to identify and characterize student misconceptions about challenging topics. Education and Information Technologies 2026 View
Karakoyun Z, Yörük M, Özdemir M, Koşar M. Evaluating the clinical decision-making performance of large language models in clinically oriented thoracic anatomy scenarios: a comparative evaluation study. BMC Medical Education 2026;26(1) View
Liu S, Liu J. Large Language Model–Based Analysis of Statin Therapy Discussions and Sentiment on Social Media: Cross-Sectional Observational Study. Journal of Medical Internet Research 2026;28:e85057 View
Abdou A, Mistry N, Campbell D, Krishnan S. A framework for generative AI-driven extraction of clinical user needs in pediatric device development. Frontiers in Digital Health 2026;8 View
Ren K, Weng Q, Chen Q, Li H, Xie D, Zeng C, Wei J, Lei G, Wang Y. The application of large language models in orthopedic postgraduate education: potentials, challenges, and future prospects. Journal of Orthopaedic Surgery and Research 2026;21(1) View
Du C, Pan Y, Ng C, Ding Y, Pan J, Xue W, Yao X, Huang J. Three large language models demonstrate competitive performance in Traditional Chinese Medicine national medical licensing examinations over two years. Scientific Reports 2026;16(1) View
Zong H, Cha J, Wang J, Song Y, Zhao Y, Shi M, Shen B. A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations. Scientific Data 2026;13(1) View
Güler I, Muir L, Grieb G, Moog P, Kraus A, Stelling H. Performance and reliability of large language models on the European Board of Hand Surgery examination: a multi-model evaluation study. Journal of Hand Surgery (European Volume) 2026 View
Soubh N, Rasenack E, Haarmann H, Wiedmann F, Zabel M, Schmidt C, Suliman R, Bergau L. Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation. Journal of Medical Internet Research 2026;28:e86692 View
Doubleday A, Cheverko C, Bolgova O, Mavrych V, Mohamed F, Westrick J, Juarez L, Rush E, Solka K, Byram J, Beacker R, Gomez V, Ganeng B, Hoffman L, Roach V, Brown K, DeVaul N, Garnett C, Herriott H, Lufler R, Mussell J, Balta J, Pascoe M, Middleton J, Duffy S, Stephens G, Wilson A. Temporal trends in large language model (LLM) accuracy: A meta-analysis of multiple-choice question performance in dentistry and dental education. Journal of Dentistry 2026;171:106724 View
Wegerif R. Dialogic Intelligence: Rethinking What Education Is for in the Age of AI / Inteligencia dialógica: repensando el propósito de la educación en la era de la IA. Journal for the Study of Education and Development: Infancia y Aprendizaje 2026 View
Klimov A, Karelin A, Liapustin S, Rudnitsky S, Tolstova M, Shamonin A, Subbotin V. Using large language models for solving tests in anesthesiology and intensive care: a comparative study. Annals of Critical Care 2026;(2):176 View
Sarantopoulos A, Pana Z, Larentzakis A, Kondylis S, Maina A, Ziogas N, Ntourakis D. ChatGPT-4.0 and Medical Students: A Recognition-Gated Comparative Evaluation on Image-Based Medical Examinations. Journal of Medical Education and Curricular Development 2026;13 View
Vistari L, Yuliasri I, - Y, - A, Lumbantoruan M. Exploring Digital Transformation Readiness: Unveiling Hidden Patterns Through K-Means Clustering. Mimbar Ilmu 2025;30(3):656 View
Lu M, Cheng J, Gopalan V. Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions. Anatomical Sciences Education 2026 View
Bolgova O, Mavrych V, Almidani E, Alshareef T, Kemahlı S. A Comparative Analysis of AI-Language Models’ MCQ Performance versus Medical Students Across Different Pediatric Topics. Advances in Medical Education and Practice 2026;Volume 17:1 View
Amankwaa I, Odoom A, Kasim A, Kobiah E, Diebieri M, Boateng E, Gyamfi S, Hales C. Performance of large language models on nursing licensure examinations: A systematic review and meta-analysis. Nurse Education Today 2026;165:107154 View
Dashti M, Khosraviani F, Meyari A, Amirzade-Iranaq M, Chaurasia A, Hefzi D, Ghadimi N, Tichy A, Khurshid Z, Schwendicke F. Accuracy of Large Language Models in Answering Dental Examination Questions: A Systematic Review and Meta-Analysis. International Dental Journal 2026;76(4):109609 View
Cheverko C, Mavrych V, Bolgova O, Mohamed F, Westrick J, Juarez L, Rush E, Solka K, Doubleday A, Byram J, Becker R, Gomez V, Ganeng B, Hoffman L, Roach V, Brown K, DeVaul N, Garnett C, Herriott H, Lufler R, Mussell J, Balta J, Pascoe M, Middleton J, Duffy S, Stephens G, Wilson A. The performance of ChatGPT and other large language models on multiple‐choice questions in biomedical disciplines: A meta‐analysis. Anatomical Sciences Education 2026 View
Erdağ M, Çalışkan T, Dal A, Canleblebici M, Balbaba M, Yıldırım H. Evaluation of large language models and ophthalmology trainees on ophthalmology questions from the Turkish medical specialty examination: A cross sectional study. Anadolu Kliniği Tıp Bilimleri Dergisi 2026;31(2):246 View
Schönberg N, Deschler D, Hauer J, Zeumer M. A Comparative Evaluation of Large Language Models on Pediatric Board-Style Examinations. Hospital Pediatrics 2026;16(6):e417 View
Chung J, Lin R, Dunn E, Kim G, Choi C, Mo K, Fang W, Lee D. The Accuracy of ChatGPT in Classifying Lumbar Spondylolisthesis and Compression Fractures. Journal of the American Osteopathic Academy of Orthopedics 2026;X(1) View
Stephenson E, Robinson S, Bascombe K, Okorie M. Secure AI-assisted angoff standard-setting for single best answer questions: A non-inferiority validation study. Medical Teacher 2026:1 View
Carrillo-Larco R. PeruMedQA: A Stress Evaluation Using Ten Large Language Models to Answer Medical Exams. Medical Science Educator 2026;36(3):1091 View
Albaloul O, Alajmi A, Oster N, Killian C. Systematic review of qualitative studies exploring K-12 teachers’ perceptions and experiences of using ChatGPT. Discover Artificial Intelligence 2026;6(1) View
Niu Z, Tang D, Chen J, Zhang P, Zhu C. Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns. Frontiers in Digital Health 2026;8 View
Joachim M, Rushinek H, Laviv A. Evaluating the performance of general vs retrieval-augmented generation large language models on oral and maxillofacial surgery board examinations. International Journal of Oral and Maxillofacial Surgery 2026 View
Zeng Y, Hu X, Liu W, Deng K, Zhou M, Wang Y, Ma L, Liu Q, Meng H. Large language models as data-driven engines for benchmarking preventive and clinical knowledge in Chinese dental examinations. Frontiers in Oral Health 2026;7 View
Wang Z, Qin Y, Wu J. Performance stability despite iteration: evaluating DeepSeek and ChatGPT on Chinese medical licensing examinations. Frontiers in Medicine 2026;13 View
Kirchberger M. A 2-Tiered Rescue Protocol to Mitigate Difficulty-Based Failures of ChatGPT (GPT-5) and Gemini on the German M2 Medical Examination: Evaluation Study. JMIR Formative Research 2026;10:e86999 View
Genc O, Durgun S, Un B. Can ChatGPT graduate as a civil engineer? Exploring the role of prompt design in model behaviour. Engineering, Construction and Architectural Management 2026:1 View
Voloshyna O. The Use of Artificial Intelligence Technologies in Higher Medical Education: Benefits, Possible Risks and Ways to Improve. Lviv clinical bulletin 2026;(2 (54)):44 View
Wang B, Chen X, Yao S. Birds of a feather? How perceived similarity shapes responses to AI versus human health messages. Communication Research Reports 2026:1 View
Sheikhalishahi S, Rafiei F, Hosseini S, Haddadi A, Sadeghipour S. Benchmarking large language models on persian surgical subspecialty board examinations: a comparative study of ChatGPT-4o, ChatGPT-5, and Gemini 2.5 Flash. Scientific Reports 2026;16(1) View
Çiftçi M. New Horizons in Digital Health: The Role of ChatGPT in Knowledge Quality and Patient Education in AI-Assisted Menopause Counseling. Cukurova Anestezi ve Cerrahi Bilimler Dergisi 2026;9(2):273 View
Liu W, Huang X, Zhan M, Ye F, Yang Q, Luo H. Large language model‐driven Socratic questioning for endodontic case analysis: A randomised controlled trial. International Endodontic Journal 2026 View
Grünebaum A, Dudenhausen J, Chervenak F. Clinical artificial intelligence competence in obstetrics and gynecology: patient safety, physician accountability, and responsible use. American Journal of Obstetrics and Gynecology 2026 View
Zheng H, Zare Z, Li M, Pan Y, Ren S, Cui H, Li Y. Evaluation of ChatGPT-4o in oral and maxillofacial surgery examinations: a comparative study of performance on U.S. dental decks and chinese dental licensing examination practice questions. BMC Oral Health 2026;26(1) View
Ayhan B, Yoğurt S. Comparative performance of ChatGPT-5.2 and Gemini 3 Pro in orthopedics questions of the medical specialization examination. Anatolian Current Medical Journal 2026;8(4):747 View
Yılmaz H, Duman E, Şahin-Demirci K. Academics’ Experiences and Perceptions of ChatGPT in Nutrition and Dietetics Education: A Qualitative Study. The Journal of Nutrition 2026:101739 View
Benazzouz R, Benyagoub M, Boufatah Y, Sadeki F, Benazzouz M, Ould Setti M. Evaluation of AI support for medical training in resource-constrained settings: performance of GPT-5 Pro, Gemini 2.5 Pro, and DeepSeek V3 on real examination questions. Journal of Medical Education Development 2026;19(2):4 View
Therán León J, Díaz Cruz L, Díaz Cruz G. Benchmarking de modelos de lenguaje en cardiología: contaminación, calibración y reproducibilidad. REC: CardioClinics 2026 View

Books/Policy Documents

Xiao D, Gao C, Luo Z, Liu C, Shen S. Knowledge Science, Engineering and Management. View
Taranikanti V, Vuthaluru S. Mastering Problem-Based Learning in Health Profession Programs. View
Cox E. Artificial Intelligence in Healthcare and Biomedical Visualization. View
El Ghazi S, Charef N, Qarmiche N, Bourkhime H, Omari M, El Fakir S, Otmani N. Smart Medical, IoT & Artificial Intelligence. View

Conference Proceedings

Chen X, Xu L. 2024 5th International Conference on Information Science and Education (ICISE-IE). Effectiveness of ChatGPT in education: a meta-analysis View
Setälä M, Sikström P, Heilala V, Kärkkäinen T. 2025 International Conference on Education Technology and Computers (ICETC). Assessment of Evolving Large Language Models in Upper Secondary Mathematics View

Citation

Please cite as:

Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis
J Med Internet Res 2024;26:e60807
doi: 10.2196/60807 PMID: 39052324 PMCID: 11310649

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Generative Language Models Including ChatGPT (1443) Digital Health Reviews (3557) Natural Language Processing (1242) Reviews in Medical Education (282) Chatbots and Conversational Agents (1145)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn