Published on in Vol 26 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/60807, first published .
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Journals

  1. Liu C, Ho C, Wu T. Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination. Healthcare 2024;12(17):1726 View
  2. Semeraro F. AI-Powered clinical assessments: GPT-4o’s role in standardizing CPR skill evaluations. Resuscitation 2024:110411 View
  3. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, Furukawa E, Kiuchi T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. International Journal of Medical Informatics 2025;193:105673 View
  4. Taniguchi M, Lindsey J. Performance of chatbots in queries concerning fundamental concepts in photochemistry. Photochemistry and Photobiology 2025;101(4):886 View
  5. Yau J, Saadat S, Hsu E, Murphy L, Roh J, Suchard J, Tapia A, Wiechmann W, Langdorf M. Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study. Journal of Medical Internet Research 2024;26:e60291 View
  6. Liu M, Okuhara T, Huang W, Ogihara A, Nagao H, Okada H, Kiuchi T. Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis. International Dental Journal 2025;75(1):213 View
  7. Chen Y, Huang X, Yang F, Lin H, Lin H, Zheng Z, Liang Q, Zhang J, Li X. Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study. BMC Medical Education 2024;24(1) View
  8. Bongco E, Cua S, Hernandez M, Pascual J, Khu K. The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis. Neurosurgical Review 2024;47(1) View
  9. Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. Journal of Medical Internet Research 2024;26:e66114 View
  10. Ferraz-Costa G, Griné M, Oliveira-Santos M, Teixeira R. Performance of ChatGPT in the Portuguese National Residency Access Examination. Acta Médica Portuguesa 2024;38(3):170 View
  11. Sabaner M, Anguita R, Antaki F, Balas M, Boberg-Ans L, Ferro Desideri L, Grauslund J, Hansen M, Klefter O, Potapenko I, Rasmussen M, Subhi Y. Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review. Journal of Personalized Medicine 2024;14(12):1165 View
  12. Camlet A, Kusiak A, Świetlik D. Application of Conversational AI Models in Decision Making for Clinical Periodontology: Analysis and Predictive Modeling. AI 2025;6(1):3 View
  13. Yang H, Hu M, Most A, Hawkins W, Murray B, Smith S, Li S, Sikora A. Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education. Frontiers in Artificial Intelligence 2025;7 View
  14. Qiu Y, Liu C. Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment. Global Medical Education 2025 View
  15. Erdat E, Kavak E. Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions. BMC Cancer 2025;25(1) View
  16. Chu H, Pasion E, Yeh S, Chu G. Assessing the Ethical and Professional Capabilities of AI: A Study of ChatGPT and Google Gemini versus PREview (Situational Judgement Test) for Medical Student Applicant. Journal of Clinical Question 2024;1(3):82 View
  17. Meyer A, Wetsch W, Steinbicker A, Streichert T. Through ChatGPT’s Eyes: The Large Language Model’s Stereotypes and what They Reveal About Healthcare. Journal of Medical Systems 2025;49(1) View
  18. Waaler P, Hussain M, Molchanov I, Bongo L, Elvevåg B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 2025;4:e69820 View
  19. Zhu J, Jiang Y, Chen D, Lu Y, Huang Y, Lin Y, Fan P. High identification and positive‐negative discrimination but limited detailed grading accuracy of ChatGPT‐4o in knee osteoarthritis radiographs. Knee Surgery, Sports Traumatology, Arthroscopy 2025;33(5):1911 View
  20. Tseng L, Lu Y, Tseng L, Chen Y, Chen H. Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study. JMIR Medical Education 2025;11:e58897 View
  21. Wang J, Shue K, Liu L, Hu G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Scientific Reports 2025;15(1) View
  22. Kopka M, von Kalckreuth N, Feufel M. Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine 2025;8(1) View
  23. Kim K. Technology-enhanced learning in medical education in the age of artificial intelligence. Forum for Education Studies 2025;3(2):2730 View
  24. Rodrigues Alessi M, Gomes H, Oliveira G, Lopes de Castro M, Grenteski F, Miyashiro L, do Valle C, Tozzini Tavares da Silva L, Okamoto C. Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study. JMIR AI 2025;4:e66552 View
  25. Al Barajraji M, Barrit S, Ben-Hamouda N, Harel E, Torcida N, Pizzarotti B, Massager N, Lechien J. AI-Driven Information for Relatives of Patients with Malignant Middle Cerebral Artery Infarction: A Preliminary Validation Study Using GPT-4o. Brain Sciences 2025;15(4):391 View
  26. Bolgova O, Shypilova I, Mavrych V. Large Language Models in Biochemistry Education: Comparative Evaluation of Performance. JMIR Medical Education 2025;11:e67244 View
  27. Krumsvik R. GPT-4’s capabilities for formative and summative assessments in Norwegian medicine exams—an intrinsic case study in the early phase of intervention. Frontiers in Medicine 2025;12 View
  28. Luo D, Liu M, Yu R, Liu Y, Jiang W, Fan Q, Kuang N, Gao Q, Yin T, Zheng Z. Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination. Scientific Reports 2025;15(1) View
  29. Yang X, Xiao Y, Liu D, Deng H, Huang J, Zhou Y, Dai C, Wu J, Liu D, Liang M, Xu C. Cross language transformation of free text into structured lobectomy surgical records from a multi center study. Scientific Reports 2025;15(1) View
  30. Hanss K, Sarma K, Glowinski A, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E. Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. Journal of Medical Internet Research 2025;27:e69910 View
  31. Fushimi A, Terada M, Tahara R, Nakazawa Y, Iwase M, Shibayama T, Kotti S, Yamashita N, Iesato A. Assessing the quality of Japanese online breast cancer treatment information using large language models: a comparison of ChatGPT, Claude, and expert evaluations. Breast Cancer 2025;32(5):960 View
  32. He F, Yang M, Liu J, Gong T, Ma J, Yang T, Zhao D, Li S, Tian D. Quality and reliability of pediatric pneumonia related short videos on mainstream platforms: cross-sectional study. BMC Public Health 2025;25(1) View
  33. Huang S, Wen C, Bai X, Li S, Wang S, Wang X, Yang D. Exploring the Application Capability of ChatGPT as an Instructor in Skills Education for Dental Medical Students: Randomized Controlled Trial. Journal of Medical Internet Research 2025;27:e68538 View
  34. Kuribara T, Hirayama K, Hirata K. Performance evaluation of large language models for the national nursing examination in Japan. DIGITAL HEALTH 2025;11 View
  35. Fallah H, Biazar E, Rezaei M. Artificial Intelligence in Dental Education. The Journal of the American Dental Association 2025;156(6):434 View
  36. Tan Y, Nah S, Saw S, Rajandram R, Ong T. Evaluating the performance of artificial intelligence chatbots in answering urology questions derived from guidelines or board examinations: A systematic review. Urological Science 2025 View
  37. Kim M, Hwang G, Chang J, Chang S, Roh H, Park R. Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations. Journal of Medical Internet Research 2025;27:e69857 View
  38. Mert S, Muir L, Fuchs B, Lucksch V, Vollbach F, Haas-Lützenberger E, Giunta R, Thierfelder N, Demmer W. Can artificial intelligence pass the written European Board of Hand Surgery exam?. Hand Surgery and Rehabilitation 2025;44(4):102197 View
  39. Çolakoğlu Y, Ayten A, Sertkaya Ç, Toksal K, Karadağ S. Evaluation of Chat Generative Pretrained Transformer (ChatGPT) Performance in Answering Kidney Transplant Related Questions. The New Journal of Urology 2025;20(1):21 View
  40. Bruneti Severino J, Nespolo Berger M, Basei de Paula P, Loures F, Todeschini S, Roeder E, Han Veiga M, Knopfholz J, Lenci Marques G. Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology's Certification Exam. International Journal of Cardiovascular Sciences 2025;38 View
  41. Alkalbani A, Alrawahi A, Salah A, Haghighi V, Zhang Y, Alkindi S, Sheng Q. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025;16(6):489 View
  42. Kaneyasu Y, Mine Y, Niitani Y, Taji T, Takeda S, Tokinaga R, Shigeishi H, Takemoto T, Kakimoto N, Murayama T, Ohta K. Analysis of multimodal large language models on visually-based questions in the Japanese National Examination for Dental Hygienists: A preliminary comparative study. Journal of Dental Sciences 2025 View
  43. Ramos-Soto O, Aranguren I, Carrillo M M, Oliva D, Balderas-Mata S. Artificial intelligence in medical imaging diagnosis: are we ready for its clinical implementation?. Journal of Medical Imaging 2025;12(06) View
  44. Alharbi L, Alrashoud R, Alotaibi B, Al Dera A, Alajlan R, AlHuthail R, Alessa D. Using Artificial Intelligence ChatGPT to Access Medical Information About Chemical Eye Injuries: Comparative Study. JMIR Formative Research 2025;9:e73642 View
  45. Yao Z, Duan L, Xu S, Chi L, Sheng D. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Medical Informatics 2025;13:e69485 View
  46. Ahmed Y, Ibrahim H, Khayal S. Evaluating advanced artificial intelligence in oncology education and clinical knowledge assessment. International Journal of Research in Medical Sciences 2025;13(7):2761 View
  47. Bessa R, de Oliveira A, Bessa R, Sousa D, Alves R, Barbosa A, Carneiro A, Soares C, Teles A. Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates. Applied Sciences 2025;15(13):7134 View
  48. Kim B, Shin W. Performance Evaluation of ChatGPT-4o on Korean Physical Therapist Licensing Examination. Physical Therapy Rehabilitation Science 2025;14(2):157 View
  49. Duarte A, Siopa C, Chaves I. Inteligência Artificial na Prova Nacional de Acesso em Portugal: O Olhar da Psiquiatria. Acta Médica Portuguesa 2025;38(8):518 View
  50. Mavrych V, Yousef E, Yaqinuddin A, Bolgova O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Medical Education Online 2025;30(1) View
  51. Wei J, Wang X, Huang M, Xu Y, Yang W. Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis. Journal of Medical Systems 2025;49(1) View
  52. Liu Z, Zuo H, Lu Y. The Impact of ChatGPT on Students' Academic Achievement: A Meta‐Analysis. Journal of Computer Assisted Learning 2025;41(4) View
  53. Barrit S, Ranuzzi G, Fetzer S, Al Barajraji M, Hadwe S, Zanello M, Ortler M, O’Flaherty J, Massager N, Madsen J, Dibué M, Carron R. Specialized AI and neurosurgeons in niche expertise: a proof-of-concept in neuromodulation with vagus nerve stimulation. Acta Neurochirurgica 2025;167(1) View
  54. Hu H, Wallace D, Boateng B. Medical Education Learning Specialists in the Age of Artificial Intelligence. Cureus 2025 View
  55. Andrew A. A Meta-Analysis of ChatGPT’s Performance on Dermatology Specialty-Level (Board-Style) Certification Questions. Indian Dermatology Online Journal 2025;16(6):939 View
  56. Aptyka H, Großschedl J, Hartelt T. Bugbear or surefire success? Secondary school students’ conceptual learning about evolution with ChatGPT. International Journal of Science Education 2025:1 View
  57. Krumsvik R, Johansen M, Slettvoll V. Artificial intelligence, health empowerment, and the general practitioner scheme. DIGITAL HEALTH 2025;11 View
  58. AlSamhori J, Alkafaween A, Al-Badawi A, Alhabashneh Z, Alelaumi A, Haddad B, Nashwan A. The role of ChatGPT in improving orthopedic Patient education in low-resource settings across various orthopedic specialties. The Journal of Precision Medicine: Health and Disease 2025;3:100017 View
  59. Özer N, Balcı Y, Bölükbaşı G, İlhan B, Güneri P. Examining the Role of Artificial Intelligence in Assessment: A Comparative Study of ChatGPT and Educator‐Generated Multiple‐Choice Questions in a Dental Exam. European Journal of Dental Education 2025 View
  60. Sommer M, Arendasy M. Automatic- and Transformer-Based Automatic Item Generation: A Critical Review. Journal of Intelligence 2025;13(8):102 View
  61. Richlitzki C, Mansoorian S, Käsmann L, Stoleriu M, Kovacs J, Sienel W, Kauffmann-Guerrero D, Duell T, Schmidt-Hegemann N, Belka C, Corradini S, Eze C. Assessing ChatGPT’s Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis. JMIR Cancer 2025;11:e69783 View
  62. Krumsvik R. How capable is GPT-4 at answering exams and tests in Norwegian, and what implications could this have for education?. Nordic Journal of Digital Literacy 2025;20(2):113 View
  63. Atahan M, Üner Ç, Aydemir M, Uzun M, Yalın M, Gölgelioğlu F. Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT‐4.0 in Orthopaedic Trauma Communication. Journal of Evaluation in Clinical Practice 2025;31(5) View
  64. Choi S, Moon Y, Jung H. ChatGPT and human dietitian responses to diet-related questions on an online Q&A platform: A comparative study. DIGITAL HEALTH 2025;11 View
  65. Othman A, Sharqawi A, MohammedAziz A, Ali W, Alatiyyah A, Mirah M. Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model. Healthcare 2025;13(17):2144 View
  66. Tian L, Lu Y, Fei X, Lu J. Intelligent Head and Neck CTA Report Quality Detection with Large Language Models. Journal of Imaging Informatics in Medicine 2025 View
  67. Krumsvik R, Slettvoll V. Artificial intelligence and health empowerment in rural communities and landslide- or avalanche-isolated contexts: real case at a fictitious location. Frontiers in Digital Health 2025;7 View
  68. Yang X, Chen W. The performance of ChatGPT on medical image-based assessments and implications for medical education. BMC Medical Education 2025;25(1) View
  69. Zhang J, Sun Y, Rong Y, Li H, Jiang B, Zhao C, Liu H. Potential of AI Chatbots in Online Hair Transplantation Consultations: A Multi-metric Assessment of Three Models. Aesthetic Plastic Surgery 2025 View
  70. Jain N, Gottlich C, Fisher J, Winston T, Matullo K, Greenhill D. ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents. JBJS Open Access 2025;10(3) View
  71. Li Z, Xu R, Gong X, Wang C, Liu J. The top 100 most-cited articles on large language models in medicine: A bibliometric analysis. DIGITAL HEALTH 2025;11 View
  72. García-Rudolph A, Hernández-Pena E, del Cacho N, Teixido-Font C, Navarro-Berenguel M, Opisso E. Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies. Revista Española de Enfermedades Digestivas 2025 View
  73. Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
  74. Ahmed H, Suhas D, D’Souza L, Jayaram P, Gupta A, Sache M. Evaluating the performance of five large language models in generating patient educational content for pediatric cardiothoracic procedures: a comparative study. General Thoracic and Cardiovascular Surgery 2025 View
  75. Rai M, Ngaw M, Nannas N. Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity. Education Sciences 2025;15(10):1400 View
  76. Lin Y, Luo Z, Ye Z, Zhong N, Zhao L, Zhang L, Li X, Chen Z, Chen Y. Applications, Challenges, and Prospects of Generative Artificial Intelligence Empowering Medical Education: Scoping Review. JMIR Medical Education 2025;11:e71125 View
  77. Liu M, Okuhara T, Shirabe R, Nishiie Y, Xu Y, Okada H, Kiuchi T. Evaluating the Reliability and Accuracy of an AI-Powered Search Engine in Providing Responses on Dietary Supplements: Quantitative and Qualitative Evaluation. JMIR AI 2025;4:e78436 View
  78. Idan D, Ben-Shitrit I, Volevich M, Binyamin Y, Nassar R, Nassar M, Abelson N, Zlotnik A, Einav S. Evaluating the performance of large language models versus human researchers on real world complex medical queries. Scientific Reports 2025;15(1) View
  79. Boczkowski D, Dolata T, Radej D, Sawina P, Suleiman R, Latkowska A, Kowalczyk A, Loson-Kawalec M, Jaworski W, Wielochowska A, Olender M, Latkowska A, Dadynska P, Majchrowicz W, Stachowicz A. Assessment of the Efficacy of the Google Gemini 2.5 Pro Model in Solving the Polish State Specialization Exam in Pediatric Surgery. Cureus 2025 View
  80. Dejean-Bouyer E, Kanlagna A, Thuau F, Perrot P, Lancien U. Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study. Journal of Educational Evaluation for Health Professions 2025;22:27 View
  81. Bolgova O, Mavrych V. Evolution of AI in anatomy education study based on comparison of current large language models against historical ChatGPT performance. Scientific Reports 2025;15(1) View
  82. Tzanis E, Adams L, Akinci D’Antonoli T, Bressem K, Cuocolo R, Kocak B, Malamateniou C, Klontzas M. Agentic systems in radiology: Principles, opportunities, privacy risks, regulation, and sustainability concerns. Diagnostic and Interventional Imaging 2025 View
  83. Angulo C, Martín-Noguerol T, Paulano-Godino F, De Caso García L, Luna A. Performance Comparison Between Two Versions of a Commercial Artificial Intelligence System for Chest Radiograph Interpretation: A Multicenter Study. Journal of Imaging Informatics in Medicine 2025 View
  84. Sridharan K, Sivaramakrishnan G. Large language models as educational collaborators: developing non-conventional teaching aids in pharmacology & therapeutics. BMC Medical Education 2025;25(1) View
  85. Cammaroto G, Mira F, Favier V, Nunes H, de Castro J, Carsuzaa F, Lechien J, Chiesa Estomba C, Iannella G, Vaira L, Calvo-Henriquez C, Cheong R, de Apodaca P, Lentini M, Barillari M, Maniaci A. Experts V/S AI´s 2.0: Comparative evaluation of AI models and expert consensus in obstructive sleep apnea assessment. European Archives of Oto-Rhino-Laryngology 2025 View
  86. Salbas A, Yogurtcu M. Performance of Large Language Models on Radiology Residency In-Training Examination Questions. Academic Radiology 2025 View
  87. Tarhan M, Sahin Ozdemir M. Comparison of the accuracy and reliability of ChatGPT-4o and Gemini in answering HIV-related questions. BMC Infectious Diseases 2025;25(1) View
  88. Inojosa H, Ramezanzadeh A, Gasparovic-Curtini I, Wiest I, Kather J, Gilbert S, Ziemssen T. Education Research: Can Large Language Models Match MS Specialist Training?. Neurology Education 2025;4(4) View
  89. Zhu S, Xie Y, Tang Y, Yu Z, Zhao R, Dong X. New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models. European Journal of Pediatrics 2025;184(12) View
  90. Meretukov D, Grechukhina K, Evdokimov V, Didych D, Kondratieva S, Rakitina O, Gordeev A, Shilo P, Khatkov I, Zhukova L. Deriving Real-World Evidence from Non-English Electronic Medical Records in Hormone Receptor-Positive Breast Cancer Using Large Language Models. Cancers 2025;17(23):3836 View
  91. Ros-Arlanzón P, Gutarra-Ávila R, Arrarte-Esteban V, Bertomeu-González V, Hernández-Blasco L, Masiá M, Navarro-Canto L, Nieto-Navarro J, Abarca J, Sempere A. When AI models take the exam: large language models vs medical students on multiple-choice course exams. Medical Education Online 2025;30(1) View

Books/Policy Documents

  1. Xiao D, Gao C, Luo Z, Liu C, Shen S. Knowledge Science, Engineering and Management. View
  2. Taranikanti V, Vuthaluru S. Mastering Problem-Based Learning in Health Profession Programs. View

Conference Proceedings

  1. Chen X, Xu L. 2024 5th International Conference on Information Science and Education (ICISE-IE). Effectiveness of ChatGPT in education: a meta-analysis View