Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

Kim S, Wihl J, Schramm S, Berberich C, Rosenkranz E, Schmitzer L, Serguen K, Klenk C, Lenhart N, Zimmer C, Wiestler B, Hedderich D. Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study. European Radiology 2025;35(9):5252 View
Azizoglu M, Escolino M, Kamci T, Klyuev S, Perez Bertolez S, Risteski T, Elhalaby I, Borkar N, Esposito C, Okur M, Lacher M, Mutanen A, Shehata S, Chiarenza F, Davenport M. Generative Artificial Intelligence Accuracy in Interpreting Forest Plots in Pediatric Surgery Meta-analyses: A Perspective From Pediatric Surgery Meta-analysis Study Group (PESMA). Journal of Pediatric Surgery 2025;60(7):162359 View
Zhou S, Xu Z, Zhang M, Xu C, Guo Y, Zhan Z, Fang Y, Ding S, Wang J, Xu K, Xia L, Yeung J, Zha D, Cai D, Melton G, Lin M, Zhang R. Large language models for disease diagnosis: a scoping review. npj Artificial Intelligence 2025;1(1) View
Boltaboyeva A, Baigarayeva Z, Imanbek B, Ozhikenov K, Getahun A, Aidarova T, Karymsakova N. A Review of Innovative Medical Rehabilitation Systems with Scalable AI-Assisted Platforms for Sensor-Based Recovery Monitoring. Applied Sciences 2025;15(12):6840 View
Mavrych V, Yousef E, Yaqinuddin A, Bolgova O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Medical Education Online 2025;30(1) View
Othman A, Sharqawi A, MohammedAziz A, Ali W, Alatiyyah A, Mirah M. Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model. Healthcare 2025;13(17):2144 View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
Okayo O, Panwal N, Ihuarulam O, Nzunde M, Nyimwadang F, Oladosu T, Osunde. A. Transforming Medical Laboratory Science with Vision-Language Models: A Focus on Microscopy in Microbiology, Hematology, and Histopathology. Oncology, Nuclear Medicine and Transplantology 2025;1(2):onmt008 View
Nguyen V, Vuong T, Nguyen V, Ma, H. Benchmarking large-language-model vision capabilities in oral and maxillofacial anatomy: A cross-sectional study. PLOS One 2025;20(10):e0335775 View
Dundas N, Law T, Brender T, Mills H, Espejo E, A. Heintz T, Wallace A, Cobert J. All That Shines Is Not Gold: Maintaining Scientific Rigor When Evaluating, Interpreting, and Reviewing Studies Using Large Language Models. Anesthesiology 2026;144(2):272 View
Wu S, Xu C, Xue Z, Huang Y, Xu G, Cui Y, Ma J, Ma R, Xie C. Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight. Frontiers in Earth Science 2026;13 View
Xin J, He X. Evaluating Large Language Models as Medical Consultation Tools for Double Eyelid Surgery: A Cross-Language Study in English and Chinese. Aesthetic Plastic Surgery 2026;50(5):1706 View
El Natour D, Abou Alfa M, Chaaban A, Assi R, Dally T, Bou Dargham B. Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study. JMIR AI 2026;5:e76928 View
Yao Z, Zhao Y, Mitra A, Levy D, Druhl E, Tsai J, Yu H. SynthEHR-eviction: enhancing eviction SDoH detection with LLM-augmented synthetic EHR data. npj Digital Medicine 2026;9(1) View
Suh P, Suh C. Do General-Purpose Multimodal Large Language Models Really See Radiologic Images or Rely on Text?. Korean Journal of Radiology 2026;27(4):297 View
Ramsthaler F, Verhoff M. KI-basierte Bilder zu forensischen Demonstrationszwecken. Rechtsmedizin 2026;36(2):74 View
Strasser L, Anschuetz W, Dennstädt F, Hastings J. Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study. JMIR Medical Education 2026;12:e81399 View
Hack S, Craig J, Lin C, Fu C, Kwiatkowska M, Kocum P, Allevi F, Saibene A. Retrieval-augmented generative AI enhances clinical reasoning in odontogenic sinusitis versus maxillary sinus mucositis. European Archives of Oto-Rhino-Laryngology 2026;283(4):2353 View
Meshram H, Bhagat C, Puri S, Gadireddy S, Modasia B, Batheja V, Mathur R. Effect of large language model on diagnostic accuracy and clinical completeness among nephrology fellows managing transplant infection. International Urology and Nephrology 2026 View
Bulut M, Reyhan A. A comparison of GPT-4V’s capability in optical coherence tomography images of age-related macular degeneration with expert assessments. BMC Ophthalmology 2026;26(1) View
Huy L, Anh L, Quang M, Phuc L, Trung N, Thang V, Nam L, Dung T. Performance of GPT-5 and Gemini 2.5 Pro on the Orthopaedic In-Training Examination. Orthopedic Reviews 2026;18 View
Sarantopoulos A, Pana Z, Larentzakis A, Kondylis S, Maina A, Ziogas N, Ntourakis D. ChatGPT-4.0 and Medical Students: A Recognition-Gated Comparative Evaluation on Image-Based Medical Examinations. Journal of Medical Education and Curricular Development 2026;13 View
Lu M, Cheng J, Gopalan V. Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions. Anatomical Sciences Education 2026 View
Kirchberger M. A Two-Tiered Rescue Protocol to Mitigate Difficulty-Based Failures of ChatGPT 5 and Gemini on the German M2 Medical Exam: Evaluation Study (Preprint). JMIR Formative Research 2025 View
Zhu Z, Zhao Y, Li L, Wang X, Zhang Y, Zhao X. Artificial Intelligence Performance Under Different Conditions in Answering China's Standardized Training Examination for Resident Physician in Radiology: A Comparative Analysis. Health Care Science 2026 View
Aydogan C, Kazaz I, Hoşbul T. From Gram Stain to Decision Support: Performance of Multimodal Large Language Models in Blood Culture Microscopy. Journal of Imaging Informatics in Medicine 2026 View

Conference Proceedings

Lv Y, Yu Q, Wang Z, Liang Y, Wang F, Li S. 2025 7th International Conference on Artificial Intelligence Technologies and Applications (ICAITA). CoupletEval: A Novel Benchmark for Assessing Chinese Linguistic Proficiency in Large Language Models View

Citation

Please cite as:

Yang Z, Yao Z, Tasmin M, Vashisht P, Jang WS, Ouyang F, Wang B, McManus D, Berlowitz D, Yu H
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
J Med Internet Res 2025;27:e65146
doi: 10.2196/65146 PMID: 39919278 PMCID: 11845889

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Generative Language Models Including ChatGPT (1419) Learning and Education (168) Artificial Intelligence (4542) AI Language Models in Health Care (696) Applications of AI (863)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn