Background

JMIR

J Med Internet Res

Journal of Medical Internet Research

1438-8871

JMIR Publications

Toronto, Canada

v27i1e65146

39919278

10.2196/65146

Original Paper

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

Eysenbach

Gunther

Coristine

Andrew

Kim

Su-Hwan

Thies

Bill

Heston

Thomas F

Yang

Zhichao

PhD 1

https://orcid.org/0000-0002-2797-4257

Yao

Zonghai

MS 1

https://orcid.org/0000-0002-5707-8410

Tasmin

Mahbuba

MS 1

https://orcid.org/0000-0003-1884-8838

Vashisht

Parth

MS 1

https://orcid.org/0009-0002-5556-7197

Jang

Won Seok

RN, MS 2

https://orcid.org/0009-0001-5439-7299

Ouyang

Feiyun

PhD 2

https://orcid.org/0000-0002-7061-7351

Wang

Beining

BS 3

https://orcid.org/0009-0006-5209-4848

McManus

David

MD 4

https://orcid.org/0000-0002-9343-6203

Berlowitz

Dan

MD, MPH 5 6

https://orcid.org/0000-0002-8783-5611

Hong

PhD 1 2

Miner School of Computer & Information Sciences University of Massachusetts Lowell

1 University Ave

Lowell, MA, 01854

United States 1 508 612 7292 Hong_Yu@uml.edu

6 7

https://orcid.org/0000-0001-9263-5035

1 College of Information and Computer Science University of Massachusetts Amherst

Amherst, MA

United States 2 Miner School of Computer & Information Sciences University of Massachusetts Lowell

Lowell, MA

United States 3 Shanghai Medical College Fudan University

Shanghai

China 4 Department of Medicine University of Massachusetts Chan Medical School

Worcester, MA

United States 5 Department of Public Health University of Massachusetts Lowell

Lowell, MA

United States 6 Center for Biomedical and Health Research in Data Sciences University of Massachusetts Lowell

Lowell, MA

United States 7 Center for Healthcare Organization and Implementation Research VA Bedford Health Care System

Bedford, MA

United States

Corresponding Author: Hong Yu Hong_Yu@uml.edu

2025

7 2 2025

e65146

20 8 2024 16 10 2024 6 11 2024 26 11 2024

©Zhichao Yang, Zonghai Yao, Mahbuba Tasmin, Parth Vashisht, Won Seok Jang, Feiyun Ouyang, Beining Wang, David McManus, Dan Berlowitz, Hong Yu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 07.02.2025.

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Background

Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored.

Objective

This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V’s newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings.

Methods

This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V’s accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians.

Results

For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately.

Conclusions

GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.

artificial intelligence natural language processing large language model LLM ChatGPT GPT GPT-4V USMLE Medical License Exam medical image interpretation United States Medical Licensing Examination NLP

Introduction

Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence (AI) since its inception [1]. The adoption of electronic health record systems by hospitals in the United States has resulted in an unprecedented amount of digital data associated with patient encounters. Computer-assisted clinical diagnostic support systems (CDSSs) endeavor to enhance clinicians’ decisions with patient information and clinical knowledge [2]. There is burgeoning interest in CDSS for enhanced imaging [3] in various disciplines such as breast cancer detection [4], COVID detection [5], diagnosing congenital cataracts [6], and hidden fracture location [7]. For a decision to be trustworthy for clinicians, CDSS should not only make the prediction but also provide accurate explanations [8-10]. However, most previous imaging CDSSs only highlight areas deemed significant by AI [11-14], providing limited insight into the explanation of the diagnosis [15].

Recent advances in large language models (LLMs) have encouraged much discussion in health care. State-of-the-art LLMs include GPT-3.5 Turbo, a chatbot released by OpenAI in October 2022, and its successor, GPT-4, released in March 2023. The success of GPT-3.5 Turbo and GPT-4 is attributed to their conversational ability and their performance, which have approached or matched human-level competence in cognitive tasks, spanning various domains including medicine [16]. Both GPT-3.5 Turbo and GPT-4 have achieved commendable results in the United States Medical Licensing Examination (USMLE), leading to discussions about the readiness of LLM applications for integration into clinical [17-19] and educational [20-22] environments.

One limitation of GPT-3.5 Turbo and GPT-4 is that they can only read and generate text and are unable to process other data modalities, such as images. This limitation, known as “single modality,” is a common issue among many LLMs [23]. Advancements in multimodal LLMs promise enhanced capabilities and integration with diverse data sources [24-26]. OpenAI’s GPT-4V is a state-of-the-art multimodal LLM equipped with image processing and understanding ability [27]. However, the ability of GPT-4V to answer medical questions with images with explanations has not been comprehensively evaluated. In this study, we aimed to expose hidden flaws in GPT-4V’s ability to interpret clinical images by thoroughly evaluating its performance on medical licensing examination questions involving image interpretation. For GPT-4V to be useful to medical professionals, it should not only provide correct responses but also offer accurate explanations for its reasoning, especially in complex multimodal clinical scenarios [28].

Methods Overview

This cross-sectional study aimed to expose the hidden flaws of GPT-4V in clinical image interpretation by comparing the performance between GPT-4V, GPT-4, and GPT-3.5 Turbo in answering medical licensing examination questions. This study also investigates the quality of GPT-4V explanation in answering these questions. The overview of the study is shown in Figure 1. This study was conducted in October 2023.

Figure 1

A summary of the image question selection process and prompt to large language models.

Ethical Considerations

The requirement for ethical approval and informed consent was waived by the institutional review board at the VA Bedford Health Care System because no patient data were used. The experiments were performed in accordance with the Declaration of Helsinki.

Medical Exams and a Patient Case Report Collection

We obtained study questions from 3 sources. USMLE consists of 3 steps required to obtain a medical license in the United States. USMLE assesses a physician’s ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care. Step 1 assesses foundational scientific concepts essential for medical practice, Step 2 clinical knowledge (CK) evaluates the application of clinical science for supervised patient care, and Step 3 tests the medical knowledge required for unsupervised practice. Step 1, Step 2 CK, and Step 3 of the USMLE sample exam released from the National Board of Medical Examiners consist of 119, 120, and 137 questions respectively. We accessed these questions from publicly available links [29]. Each question contained multiple options to choose from. We then selected all questions with images, resulting in 19, 14, and 18 questions from Step 1, Step 2 CK, and Step 3. Medical subdomains include but are not limited to radiology, dermatology, orthopedics, ophthalmology, cardiology, and general surgery.

The sample exam only included limited questions with images. Thus, we further collected similar questions from AMBOSS, a widely used question bank for medical students, which provides students’ performance on the exam. The performance enabled us to assess the comparative effectiveness of the model. For each question, AMBOSS associated an expert-written hint to tip the student to answer the question and a difficulty level that ranges from 1-5. Levels 1, 2, 3, 4, and 5 represent the easiest 20%, 20%-50%, 50%-80%, 80%-95%, and 95%-100% of questions respectively [30]. Hints are designed to guide students to the correct answer. They are typically formatted as a short paragraph that describes the image. We manually checked that no hint had disclosed the answer directly. In addition to the gold standard choice, each answer is associated with a detailed explanation by AMBOSS. They were developed through an internal peer-review process involving more than 50 physicians who achieved high scores in the exam. We used a commercial license to access the questions. Since AMBOSS is not publicly available and its licensing terms restrict the automatic website scraping of its proprietary content, they are not in the CommonCrawl data set used to train GPTs [31]. We randomly selected and manually downloaded 10 questions from each of the 5 difficulty levels. We repeated this process for Step 1, Step 2 CK, and Step 3. This resulted in a total number of 150 questions.

In addition, we collected questions from the Diagnostic Radiology Qualifying Core Exam (DRQCE) [32], which is an image-rich exam to evaluate a candidate’s foundational knowledge and clinical judgment across practice domains of diagnostic radiology, which is offered after 36 months of residency training. Since DRQCE is proprietary, we used a commercial license to access the 26 questions with images of 54 questions in the preparation exam offered by the American Board of Radiology. In total, we had 227 questions with images from the 3 aforementioned sources.

To illustrate GPT-4V’s potential as an imaging diagnostic support tool and further expose its limitations, we used part of a patient case report [33] to resemble a typical “curbside consultation” between medical professionals [34]. In this case, the patient’s admission info, such as history of present illness, labs, and images of the case report will be presented to both a physician and GPT-4V. The physician can then work with GPT-4V through question answering, for example, by asking GPT-4V to help interpret images, for the final clinical diagnosis.

How to Answer Image Questions Using GPT-4V Prompts

GPT-4V took image and text data as inputs to generate textual outputs. Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task [35]. Specifically, we prompted GPT-4V by first adding the image, then appending context (ie, patient information) and questions, and finally providing multiple-choice options, each separated by a new line. An example user prompt and GPT-4V response are shown in Figure S1 in Multimedia Appendix 1. When multiple subimages existed in the image, we uploaded multiple subimages to GPT-4V. We did not append a hint to the end of the question, unless other specified. The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision. When using GPT-3.5 Turbo and GPT-4 models that cannot handle image data, images were omitted from the prompt. These models were accessed through OpenAI application programming interfaces. Responses were collected from the September 25, 2023, version of models.

Evaluation Metrics

For answer accuracy, we evaluated the model’s performance by comparing the model’s choice with the correct choice provided by the exam board or question bank website. We defined accuracy as the ratio of the number of correct choices to the total number of questions.

We also evaluated the quality of the explanation by preference from 3 health care professionals (1 medical doctor with 35 years of experience in internal medicine, 1 registered ward nurse with 2 years of experience, and 1 third-year medical school student). For each question from the AMBOSS data set (n=150), we first asked the health care professionals to choose their preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie without knowing the correctness of GPT-4V’s answers. The exclusion of correctness is to avoid bias in their preference of explanations. In addition, the source of the explanations was blinded to the health care professionals, ensuring that their judgments were not influenced by knowing whether an explanation came from GPT-4V or an expert.

In addition, we also asked health care professionals to evaluate the GPT-4V explanation from a sufficient and comprehensive perspective [36,37]. They determined if the information exists in the explanation, that consists of (1) image interpretation: GPT-4V tried to interpret the image in the explanation, and such interpretation is sufficient to support its choice; (2) question information: explanations contained information related to the textual context (ie, patient information) of the question, and such information was essential for GPT-4V’s choice; (3) comprehensive explanation: the explanation included comprehensive reasoning for all possible evidence (eg, symptoms, lab results) that leads to the final answer.

Finally, for each question answered incorrectly, we asked health care professionals to check if the explanation contained any errors that consisted of (1) image misunderstanding (if the sentence in the explanation showed an incorrect interpretation of the image; eg, GPT-4V said that a bone in the image was for the hand, but it was in fact the foot); (2) text hallucination (if the sentence in the explanation contained made-up information [38]; eg, claiming Saxenda was insulin); (3) reasoning error (if the sentence did not properly infer knowledge in either image or text to an answer; eg, GPT-4V reasoned that a patient took a trip within the last 3 months and therefore diagnosed the patient as having Chagas disease, despite the clinical knowledge that Chagas disease usually develops 10-20 years after infection); or (4) nonmedical error (GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board [27]).

In this study, we asked an internal medicine doctor with 35 years of experience to articulate a detailed rating guideline above. Our study has shown that the medical student and nurse, both of whom participated independently, agreed with the doctor’s ratings of 95% and 86%, respectively. This high agreement ratio underscores the effectiveness of the standardized guidelines in ensuring consistent evaluation across varying levels of expertise.

Statistical Analysis

Chi-square tests and pairwise comparisons with Bonferroni corrections were used for the performance metrics of GPT-3.5 Turbo, GPT-4, and GPT-4V on visual question answering exams. GPT-4V’s accuracies on the AMBOSS data set were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05. All analysis was conducted in Python software (version 3.10.11; Python Software Foundation).

Results Overall Answer Accuracy

For all questions in the USMLE sample exam (including ones without images), GPT-4V achieved an accuracy of 88.2%, 90.8%, and 92.7% among Step 1, Step 2 CK, and Step 3 of USMLE questions, respectively. In comparison, GPT-3.5 Turbo and GPT-4 achieved an accuracy of 55.1% and 81.5% in Step 1, 59.1% and 80.8% in Step 2 CK, and 60.9% and 88.3% in Step 3, respectively (Table 1). GPT-4V outperformed GPT-4 and GPT-3.5 Turbo by 11.3% (95% CI 11.5%-11.1%; P<.001) and 32% (95% CI 32.3%-31.7%; P<.001). The score of GPT-4V passes the standard for the USMLE (about 60%). The performance of GPT-4V across different subdomains is shown in Table S1 in Multimedia Appendix 1.

Table 1

Performance of GPT-3.5 Turbo, GPT-4, and GPT-4V on a USMLE sample exam from the National Board of Medical Examiners without hints.

Exam name and agents		Performance
		Questions with image, n (%)	All questions, n (%)
USMLE^a sample exam-Step 1^b
	Sample size	19	119
	GPT-3.5 Turbo	8 (42.1)	66 (55.1)
	GPT-4	12 (63.2)	97 (81.5)
	GPT-4V	16 (84.2)	105 (88.2)
USMLE sample exam-Step 2 clinical knowledge^c
	Sample size	14	120
	GPT-3.5 Turbo	7 (50)	71 (59.1)
	GPT-4	9 (64.3)	97 (80.8)
	GPT-4V	12 (85.7)	109 (90.8)
USMLE sample exam-Step 3^d
	Sample size	18	137
	GPT-3.5 Turbo	9 (50)	73 (60.9)
	GPT-4	12 (66.7)	121 (88.3)
	GPT-4V	16 (88.9)	127 (92.7)
DRQCE^e sample exam^f
	Sample size	26	54
	GPT-3.5 Turbo	5 (19.2)	31 (57.4)
	GPT-4	7 (26.9)	35 (64.8)
	GPT-4V	19 (73.1)	48 (88.9)

^aUSMLE: United States Medical Licensing Examination.

^b19 questions with images and 119 questions in total in Step 1.

^c14 questions with images and 120 questions in total in Step 2 CK.

^dThere were 18 questions with images and 137 questions in total in Step 3.

^eDRQCE: Diagnostic Radiology Qualifying Core Exam.

^fThere were 26 questions with images and 54 questions in total in DRQCE.

For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, and 88.9% in Step 1, Step 2 CK, and Step 3 of USMLE questions, respectively. It outperformed GPT-3.5 Turbo and GPT-4 by 42.1% (8/19; 95% CI 36.8%-47.4%; P<.001) and 21.1% (4/19; 95% CI 7.8-34.2%; P=0.01) in Step 1, 35.7% (5/14; 95% CI 3.1%-39.7%; P=.03) and 21.4% (3/14; 95% CI 4.7%-38.1%; P=.02) in Step 2 CK, 38.9% (7/18; 95% CI 32.2%-45.7%; P<.001) and 22.2% (4/18; 95% CI 5.5%-38.9%; P=.02) in Step 3, respectively. Similarly, GPT-4V achieved an accuracy of 73.1%, outperforming GPT-3.5 Turbo by 53.9% (14/26; 95% CI 41.6%-66.2%; P<.001) and GPT-4 by 46.2% (12/26; 95% CI 29.8%-62.5%; P<.001) in DRQCE (Table 1). This highlights the superior ability of GPT-4V to interpret clinical images compared with earlier versions.

Impact of Difficulty Level and Use of Hints

When asking GPT-4V questions without a hint, it achieved an accuracy of 60%, 64%, and 66% for AMBOSS Step 1, Step 2 CK, and Step 3, respectively (Table 2). GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step 1, Step 2 CK, and Step 3, respectively. When asking GPT-4V questions with a hint, it achieved an accuracy of 84%, 86%, and 88% for AMBOSS Step 1, Step 2 CK, and Step 3, respectively. Figure S2 in Multimedia Appendix 1 is an example where GPT-4V switched the answer from incorrect to correct when a hint was provided. GPT-4V predictions on the entire AMBOSS data set with images are reported in Table S2 in Multimedia Appendix 1 (n=646). Conclusions drawn from automatic evaluation align with our findings presented in Table 2 (n=150).

Table 2

Performance of GPT-4V on AMBOSS.

AMBOSS steps and hint availability			GPT-4V accuracy on AMBOSS, %
			Overall (n=50)		1 (n=10)		2 (n=10)		3 (n=10)		4 (n=10)		5 (n=10)
Step 1
	Without hint	60		70		70		30		70		60
	Expert hint	84		80		80		80		90		90
Step 2 clinical knowledge
	Without hint	64		80		70		70		50		50
	Expert hint	86		100		90		100		70		70
Step 3
	Without hint	66		80		90		60		50		50
	Expert hint	88		90		90		90		90		80

Figure 2 shows a decreasing trend in GPT-4V’s performance in the AMBOSS data set when the difficulty of questions increased (P=.04) without a hint. However, with the hint, the performance of GPT-4V plateaued across 5 difficulty levels. Importantly, the accuracies of both GPT-4V, with or without a hint, in general, outperformed the accuracies of medical students, and the gap between the performance of GPT-4V and medical students increased when the difficulty increased. On the most difficult questions, GPT-4V with hint outperformed medical students by 60% (18/30, 95% CI 56.8%-63.1%; P<.001), and GPT-4V without hint outperformed medical students by 26.7% (8/30, 95% CI 24.2%-29.3%; P<.001). The findings show that while GPT-4V outperforms medical students in accuracy, its performance is largely dependent on context-based hints, reflecting a fundamental flaw in image reasoning.

Figure 2

Performance of GPT-4V and students on 150 AMBOSS questions with different difficulty levels.

Quality of Explanation

We evaluated the user’s preference among GPT-4V-generated explanations and expert-generated explanations. When GPT-4V answered incorrectly, our results show that health care professionals overwhelmingly preferred expert explanations as shown in Table 3. In total, 47 preferred experts and 0 preferred GPT-4V. When GPT-4V answered correctly, the quality of GPT-4V-generated explanations was close to expert-generated explanations: out of 95 votes, 19 preferred experts, 15 preferred GPT-4V, and 61 preferred either. The preference for expert explanations in incorrect answers highlights key weaknesses in GPT-4V’s ability to interpret clinical images accurately and offer dependable reasoning.

Table 3

Health care professionals preferred explanations for 150 AMBOSS questions.

AMBOSS steps and correctness of GPT-4V (without hint) responses		Health care professionals’ preference
		Prefer expert	Ties	Prefer GPT-4V
Step 1
	Correct	4	23	3
	Incorrect	16	4	0
Step 2 clinical knowledge
	Correct	10	15	7
	Incorrect	18	0	0
Step 3
	Correct	5	23	5
	Incorrect	13	4	0

We further evaluated the quality of the GPT-4V generated explanation by verifying if the explanation includes image and question text interpretation in Table S3 in Multimedia Appendix 1. When examining the 95 correct answers, 84.2% (n=80) of the responses contained an interpretation of the image, while 96.8% (n=92) aptly captured the information presented in the question. On the other hand, for the 55 incorrect answers, 92.8% (n=51) interpreted the image, and 89.1% (n=49) depicted the question’s details. In terms of comprehensiveness, GPT-4V offered a comprehensive explanation in 79% (n=75) of correct responses. In contrast, only 7.2% (n=4) of the wrong responses had a comprehensive explanation that led to the GPT-4V’s choice.

We also evaluated the explanations of incorrect responses by GPT-4V image and grouped them into 4 categories, that are image misunderstanding, text hallucination, reasoning error, and nonmedical error. Among GPT-4V responses with wrong answers (n=55), we found that 76.3% (n=42) of responses included misunderstanding of the image, 45.5% (n=25) of responses included logic error, 18.2% (n=10) of responses included text hallucination, and no responses included nonmedical errors.

A Case Study of Curbside Consultation

We present a clinical case study involving a 45-year-old woman with hypertension and altered mental status. As shown in Figure S3 in Multimedia Appendix 1, a collaborative design of GPT-4V allows communication between GPT-4V and physicians. In this scenario, when asked to interpret a CT scan, GPT-4V initially provided an irrelevant answer. GPT-4V needed 5 additional physician-guided prompts to list potential diagnoses, including primary aldosteronism, hypertension, and Cushing’s syndrome. For instance, when the physician specifically prompted, “If I suspect Cushing’s syndrome due to ectopic ACTH secretion, what would be the next steps to evaluate this patient to determine the source of the hormonal abnormality?” and pointed to a specific area on the CT scan, GPT-4V was then able to respond correctly. This interaction indicates that GPT-4V struggles to autonomously interpret medical images, requiring continuous and specific prompts for accurate interpretation, which underscores its flaws in independent image reasoning.

Discussion Principal Findings

Recent advancements in medical question-answering systems have leveraged domain-specific transformer models. Early models such as PubMedBERT [39] with 100 million parameters score around 38.3% in USMLE. The introduction of larger models marked a substantial improvement. JMLR [40] with 13 billion parameters, Med-Palm [41] with 540 billion parameters, and GPT-4 achieves 62.5%, 86.2%, and 90.2% respectively. However, previous works only tested these models on text-only questions without images [20,42-44] or questions in non-English languages [45,46]. Unlike previous works that focus primarily on accuracy [47,48], we emphasize explanation quality as a crucial metric for assessing the model’s clinical applicability. In particular, we evaluated GPT-4V’s ability to interpret medical images (a new feature) to highlight hidden flaws in clinical image interpretation.

We found that GPT-4V outperformed both GPT-3.5 Turbo and GPT-4 (Table 1). When evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of 90.7% outperforming GPT-3.5 Turbo (58.5%) and GPT-4 (83.8%). In comparison, medical students can pass the USMLE exam with more than 60% accuracy, indicating that the GPT-4V performed at a level similar to or above a medical student in the final year of study. The accuracy of GPT-4V highlights its grasp over biomedical and clinical sciences, essential for medical practice, and showcases its ability in patient management and problem-solving skills [49]. Other studies further demonstrated the potential for clinical routines, such as summarizing radiology reports [50] and differential diagnosis [51,52].

For medical exam questions with images, we found that GPT-4V achieved an accuracy of 62%, which was equivalent to the 70th-80th percentile with AMBOSS medical students. This finding indicates that GPT-4V has the capability to integrate information from both text and images to answer questions, making it a promising tool for answering clinical questions based on images. However, our evaluation also reveals hidden flaws in its image interpretation, particularly in its inconsistency and the need for extensive context to provide accurate answers.

Another important finding is that GPT-4V significantly outperformed medical students for questions considered difficult for the students. Specifically, our results, as shown in Figure 2, show that while medical students’ performance linearly decreased when the difficulty of questions increased, GPT-4V’s performance stayed relatively stable. When expert hints were provided, GPT-4V’s performance stayed plateau among questions in all difficult levels. This consistent performance indicates that GPT-4V effectively addresses questions that medical students find challenging. Its advanced capabilities suggest potential as an educational assistant, particularly for complex topics. Under the supervision of teachers’ hints, medical students could benefit from its advanced capabilities to understand and analyze complex medical questions.

There may be multiple factors that contribute to GPT-4V’s performance on difficult questions. Instrument methods (eg, item response theory [53]) have been typically used for the construction and evaluation of measurement scales and tests. For example, item response theory uses a statistical model that links an individual person’s responses to individual test items (questions on a test) to the person’s ability to correctly respond to the items and the items’ features. Therefore, medical examination test sets have been specifically selected and tailored to medical students’ performance with the intended distribution where the performance decreases when the difficulty level increases. Although more evaluation is needed to draw the conclusion that GPT-4V substantially outperformed medical students in difficult questions, our results at least show that GPT-4V performed differently.

On the other hand, we found that GPT-4V’s performance was inconsistent among different medical subdomains. As shown in Table S1 in Multimedia Appendix 1, GPT-4V achieved high accuracy on subdomains such as immunology (5/5, 100%), otolaryngology (6/6, 100%), and pulmonology (6/8, 75%), and low accuracy on others such as anatomy (1/4, 25%), emergency medicine (1/4, 25%), and pathology (5/10, 50%). This suggests that while GPT-4V shows potential in some specialties or subdomains, it may require further development to be reliable across the board. The uneven performance highlights the need for tailored approaches to enhancing the model’s capabilities where it falls short.

Another advantage of GPT-4V is its ability to explain its image content. Previous studies have shown limited use of current CDSS as most of them offered limited decision explanations and thus gained limited trust among physicians (unlike their colleagues) [54-57]. In contrast, GPT-4V has the potential to improve the effectiveness and credibility of CDSS by providing explanations preferred by experts. As our results indicate, the quality of explanations generated by GPT-4V, when answering correctly, is close to that of expert-generated explanations. Although in more complex scenarios (such as in our curbside consult setting), GPT-4V currently requires continuous highly specialized guidance, which temporarily prevents it from enhancing physician work efficiency, this feature still has the potential to encourage physicians to adopt and use GPT-4V more confidently and broadly.

In terms of explanation quality, we found that more than 80% of responses from GPT-4V provided an interpretation of the image, regardless of whether the responses were correct or not. This suggests that GPT-4V consistently takes into account the image while generating responses. Figure S1 in Multimedia Appendix 1 illustrates an example of a high-quality explanation that uses images to answer a hard question. In this example, more than 70% of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough. To differentiate them, GPT-4V correctly interpreted the x-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infarction rather than pneumonia [58]. To show the need for an x-ray as mentioned in the explanation, we removed the image from the input, and GPT-4V switched the answer to bacterial pneumonia while also acknowledging the possibility of pulmonary infarction. This change in response demonstrated the high quality of the GPT-4V explanation, as its explanation about x-rays was not fictional and it truly needed the x-ray to answer this question.

On the other hand, we found that the quality of generated explanations was poor when GPT-4V answered incorrectly. Manual analyses by health care professionals concluded that image misunderstanding was the primary reason why GPT-4V answered incorrectly. Out of 55 wrong responses, 42 (76.3%) were due to misunderstanding of the image. In comparison, only 18.2% (10/55) of the mistakes were attributed to text misinterpretation. Clearly, GPT-4V’s proficiency in processing images was considerably lagging behind its text-handling capability. This gap in capability suggests that GPT-4V’s advancements in image understanding remain nascent and require significant refinement to align with its text analysis capabilities. To circumvent its image interpretation issue, we additionally prompted GPT-4V with a short hint that described the image. We found that 40% (22/55) of responses switched to the correct answer. One potential future direction involves strengthening GPT-4V's domain-specific knowledge by integrating extensive clinical datasets into its training. For example, employing domain-adaptive pretraining methods—such as those used in MEDITRON [59], which leverages medical guidelines and specialized clinical corpora—could significantly improve the model’s understanding of medical concepts, leading to more precise and contextually relevant explanations. In addition, incorporating retrieval-augmented generation based on domain-specific corpora [40] would enable the model to access and retrieve pertinent clinical information during inference, grounding its explanations in verified data. This could improve factual accuracy and reduce the likelihood of incorrect or unsupported responses. Together, these strategies aim to bolster the model's capacity to provide high-quality, accurate explanations, thereby enhancing its overall reliability and usefulness in clinical applications.

Creating these image-related hints requires clinical expertise, limiting the use of GPT-4V as a CDSS. In our case study, when GPT-4V delivered an irrelevant response, the physician needed to come up with correct hints for GPT-4V. These findings reveal a key limitation: GPT-4V’s reliance on external guidance from experts to interpret complex image content effectively, thereby exposing its inability to operate independently in clinical scenarios. Efforts improving GPT-4V on images include multimodal LLMs with reinforcement learning from human feedback to align the outputs of LLMs with physicians’ intentions and expectations. This alignment is critical not only for enhancing the accuracy and relevance of the responses but also for integrating GPT-4V seamlessly into clinical environments where time is of the essence [60].

Another significant drawback of GPT-4V involved its tendency to produce factually inaccurate responses, a problem often referred to as the hallucination effect, which is prevalent among many LLMs such as GPT-4V [38]. We found that more than 18% of GPT-4V explanations contain hallucinations, potentially misleading or distracting physicians, particularly the less experienced medical students and residents. This finding emphasizes the need for robust evaluation and correction mechanisms to minimize hallucinations, which are critical to ensure GPT-4V’s reliability and safety in clinical practice. One future direction is to integrate GPT-4V and a probabilistic model with CI and citations from credible sources to show the reliability of the response [40,61,62]. The confidence score could also help prioritize the list of differential diagnoses, making it clearer to the physician which conditions are more probable. Thereby reducing the risk of confusion and enhancing the reliability of the CDSS response when additional physician review is warranted [15].

Limitations

This study has several limitations. First, our findings are constrained in their applicability due to the modest sample size. We gathered 227 questions from a total of 28 subdomains or specialties that included images, which might not comprehensively represent all medical disciplines. The small number of questions in each subdomain may not be sufficient to conclude that GPT-4V’s performance is inconsistent between medical subdomains. Second, the exams used to test GPT-4V are written in English. Future work could explore other languages. Third, the models used for evaluation were from September 2023, and frontier models may have evolved since then, potentially impacting the results. Fourth, we cannot guarantee that OpenAI strictly adhered to licensing terms when determining which content was included or excluded from their training sets; therefore, even though AMBOSS is not publicly available and its licensing terms restrict the automatic website scraping of its proprietary content, GPT may have already seen the data during training, potentially impacting the results. Finally, while GPT-4V has demonstrated proficiency in medical license examination, its CDSS ability remains untested. Future work could explore continued training GPT-4V in the medical domain for better CDSS integration. Medical exams provide options, but such options would rarely be provided by physicians during CDSS. Our study highlights the inherent limitations in GPT-4V’s image interpretation abilities, particularly without expert guidance. We showed that GPT-4V can reduce errors with expert hints, but in more realistic clinical environments, it required continuous highly specialized guidance to make partially correct diagnoses and subsequent examination recommendations, revealing limitations in its autonomous decision-making capabilities. Therefore, more cases with clinician questions should be explored to confirm our findings before clinical integration. Extrapolating the efficacy of GPT-4V to broader clinical applications requires appropriate benchmarks and further research.

Regarding ethical considerations, deploying AI systems for medical advice poses significant ethical implications, especially in medical education and clinical decision-making. Incorrect AI-generated explanations risk disseminating misinformation that could misguide medical professionals, impacting patient safety and treatment outcomes. This is particularly concerning when AI is used in training settings, as it could shape the decision-making abilities of future healthcare providers in potentially harmful ways. Integrating AI into clinical workflows also raises broader societal concerns. While AI has the potential to enhance healthcare efficiency, it could alter patient care dynamics and physician roles. Overreliance on AI may reduce direct physician-patient communication, eroding trust and undermining the relationship-building essential for effective care. Physicians might also become too dependent on AI, potentially compromising their clinical judgment and their ability to critically assess AI-generated insights. Thus, integrating AI in a manner that complements human expertise (supporting rather than replacing health care providers) is vital. Moreover, current benchmarks, including the one in our study, do not fully assess an AI’s capabilities for real-world clinical decision-making. Although some LLMs perform well on benchmarks, they lack the comprehensive clinical skills and nuanced understanding required to navigate complex medical scenarios effectively. Viewing these AI models as tools that assist rather than replace clinicians is crucial to ensuring their safe and beneficial use in health care. A responsible approach is needed when deploying AI for medical advice, one that ensures ethical standards are maintained. Issues such as privacy, bias, and the broader implications of AI in society must guide the development and implementation of these systems. By enhancing data diversity, ensuring privacy, and fostering a transparent understanding of AI’s role, we can work toward ethical advancements in health care that enhance outcomes without compromising human oversight or patient trust. Future work should focus on developing AI technologies that are fully aligned with health care professionals, maintaining a collaborative and ethically sound approach to their integration.

Conclusion

In this study, GPT-4V demonstrated remarkable overall accuracy on the medical licensing examination and provided high-quality explanations when correct. The evaluation of questions with images (a relatively novel feature for GPT models) allowed us to expose hidden flaws in GPT-4V’s image interpretation abilities, offering a unique insight into its strengths and weaknesses. Its performance on image-related questions ranged from 60% to 88%, while physician misdiagnosis rates can be as high as 40% [63,64]. GPT-4V substantially outperformed medical students on difficult questions, but we observed severe issues in its explanations and reasoning, including hallucinations, errors, and misinterpretations. These findings reveal significant challenges in GPT-4V’s ability to independently interpret and reason through complex image-based questions, which is crucial for clinical applications. Despite its strong performance on multiple-choice questions, GPT-4V may still encounter comprehension or explanation errors. When assisted by human experts, GPT-4V reduced some errors with image-related hints. However, in realistic curbside consult settings, continuous and highly specialized prompting was still required, making it time-consuming and limiting its utility as a clinical decision support system in real-world clinical practice. Table 4 lists the summary of key findings.

Table 4

Summary of key findings.

Metric	Findings
Accuracy of image-based questions	GPT-4V achieved 84.2% in Step 1, 85.7% in Step 2 CK^a, 88.9% in Step 3, and 73.1% in DRQCE^b, outperforming GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%).
Explanation quality	When GPT-4V provided correct answers, its explanations were almost on par with those given by domain experts. However, for incorrect responses, the explanation quality was often lacking: 18.2% included inaccurate information, 45.5% involved inferencing mistakes, and 76.3% reflected misinterpretations of images.
The impact of human expert hints	There is a decreasing trend in GPT-4V’s performance in the AMBOSS dataset when the difficulty of questions increased) without hint. However, with the hint, the performance of GPT-4V plateaued.
Performance of GPT-4V on most difficult questions	GPT-4V with hints outperformed medical students by 60%, and GPT-4V without hints outperformed medical students by 26.7%.

^aCK: clinical knowledge.

^bDRQCE: Diagnostic Radiology Qualifying Core Exam.

Overall, our findings emphasize the need for a more comprehensive evaluation of GPT-4V’s multimodal capabilities, especially in clinical image interpretation, before considering its integration into clinical decision support systems. Future randomized clinical trials will help further verify the actual utility of GPT-4V and promote more extensive and profound integration of AI in the medical domain.

Multimedia Appendix 1

Supplementary figures and tables.

Abbreviations

artificial intelligence

CDSS

clinical diagnostic support system

clinical knowledge

DRQCE

Diagnostic Radiology Qualifying Core Exam

LLMs

large language models

USMLE

United States Medical Licensing Examination

This work was funded by the grant R01MH125027 from the National Institute of Mental Health of the National Institutes of Health. The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. The contents of this paper do not represent the views of the National Institutes of Health.

Data Availability

The data sets generated during and/or analyzed during this study are available in the USMLE [29], AMBOSS [30], and DRQCE [32] repositories.

HY initialized the conceptualization of the project. Yang Z and Yao Z designed the study. Yang Z, Yao Z, MT, and PV implemented the methods. Yang Z, Yao Z, WJ, FO, BW, and DB performed the data analysis. Yang Z and Yao Z interpreted the results with substantial input from DM, DB, and HY. All authors contributed to manuscript preparation.

None declared.

Shortliffe

Cimino

Biomedical Informatics: Computer Applications in Health Care and Biomedicine 2014 12 02

New York

Springer

Sutton

Pincock

Baumgart

Sadowski

Fedorak

Kroeker

An overview of clinical decision support systems: benefits, risks, and strategies for success

NPJ Digit Med 2020 3 17

10.1038/s41746-020-0221-y

32047862

221

PMC7005290

Rajpurkar

Lungren

The current and future state of AI interpretation of medical images

N Engl J Med 2023 388 21 1981 1990

10.1056/NEJMra2301725

37224199

Aggarwal

Sounderajah

Martin

Ting

DSW

Karthikesalingam

King

Ashrafian

Darzi

Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis

NPJ Digit Med 2021 4 1 65

10.1038/s41746-021-00438-z

33828217

10.1038/s41746-021-00438-z

PMC8027892

Wang

Lin

Wong

COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images

Sci Rep 2020 10 1 19549

10.1038/s41598-020-76550-z

33177550

10.1038/s41598-020-76550-z

PMC7658227

Long

Lin

Liu

Wang

Jiang

Lin

Chen

Cao

Wang

Liu

Chen

Liu

An artificial intelligence platform for the multihospital collaborative management of congenital cataracts

Nat Biomed Eng 2017 1 2 0024

10.1038/s41551-016-0024

Rayan

Reddy

Kan

Zhang

Annapragada

Binomial classification of pediatric elbow fractures using a deep learning multiview approach emulating radiologist decision making

Radiol Artif Intell 2019 1 1 e180015

10.1148/ryai.2019180015

33937781

PMC8017418

Bussone

Stumpf

O?Sullivan

The role of explanations on trust and reliance in clinical decision support systems

2015

International Conference on Healthcare Informatics. Published online

2015 October 23

USA

10.1109/ichi.2015.26

Panigutti

Beretta

Giannotti

Pedreschi

Understanding the impact of explanations on advice-taking: a user study for AI-based clinical Decision Support Systems

2022

CHI Conference on Human Factors in Computing Systems. Published online 2022

2022 April 29

USA

10.1145/3491102.3502104

Gaube

Suresh

Raue

Lermer

Koch

Hudecek

MFC

Ackery

Grover

Coughlin

Frey

Kitamura

Ghassemi

Colak

Non-task expert physicians benefit from correct explainable AI advice when reviewing X-rays

Sci Rep 2023 13 1 1383

10.1038/s41598-023-28633-w

36697450

10.1038/s41598-023-28633-w

PMC9876883

Singh,

Mohammed

Zelek

Lakshminarayanan

Interpretation of deep learning using attributions: application to ophthalmic diagnosis

In: Optical Engineering + Applications 2020 11511 11

Eitel

Ritter

Inuzuki

Reyes

Syeda-Mahmood

Testing the robustness of attribution methods for convolutional neural networks in MRI-based alzheimer?s disease classification

Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support. Springer International Publishing 2019

New York

Springer International Publishing

Papanastasopoulos

Samala

Chan

Hadjiiski

Paramagul

Helvie

Neal

Explainable AI for medical imaging: deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI 2020-02-16

https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11314/2549298/Explainable-AI-for-medical-imaging--deep-learning-CNN-ensemble/10.1117/12.2549298.short

Shamout

Shen

Kaku

Park

Makino

Jastrzębski

Witowski

Wang

Zhang

Dogra

Cao

Razavian

Kudlowitz

Azour

Moore

Lui

Aphinyanaphongs

Fernandez-Granda

Geras

An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

NPJ Digit Med 2021 4 1 80

10.1038/s41746-021-00453-0

33980980

10.1038/s41746-021-00453-0

PMC8115328

Shen

Heacock

Elias

Hentel

Reig

Shih

Moy

ChatGPT and other large language models are double-edged swords

Radiology 2023 307 2 e230163

10.1148/radiol.230163

36700838

OpenAI Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

Almeida

Altenschmidt

Altman

Anadkat

Avila

Babuschkin

Balaji

Balcom

Baltescu

Bao

Bavarian

Belgum

Bello

Berdine

Bernadett-Shapiro

Bogdonoff

Boiko

Boyd

Brakman

Brockman

GPT-4 Technical Report

ArXiv. 2023;abs/2303.0 2023

10.48550/arXiv.2303.08774

Goodman

Patrinely

Stone

Zimmerman

Donald

Chang

Berkowitz

Finn

Jahangir

Scoville

Reese

Friedman

Bastarache

van der Heijden

Wright

Carter

Alexander

Choe

Chastain

Zic

Horst

Turker

Agarwal

Osmundson

Idrees

Kiernan

Padmanabhan

Bailey

Schlegel

Chambless

Gibson

Osterman

Wheless

Johnson

Accuracy and reliability of chatbot responses to physician questions

JAMA Netw Open 2023 6 10 e2336483

10.1001/jamanetworkopen.2023.36483

37782499

2809975

PMC10546234

Decker

Trang

Ramirez

Colley

Pierce

Coleman

Bongiovanni

Melton

Wick

Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures

JAMA Netw Open 2023 6 10 e2336997

10.1001/jamanetworkopen.2023.36997

37812419

2810364

PMC10562939

Ayers

Poliak

Dredze

Leas

Zhu

Kelley

Faix

Goodman

Longhurst

Hogarth

Smith

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

JAMA Intern Med 2023 183 6 589 596

10.1001/jamainternmed.2023.1838

37115527

2804309

PMC10148230

Kung

Cheatham

Medenilla

Sillos

De Leon

Elepaño

Madriaga

Aggabao

Diaz-Candido

Maningo

Tseng

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

PLOS Digit Health 2023 2 2 e0000198

10.1371/journal.pdig.0000198

36812645

PDIG-D-22-00371

PMC9931230

Thirunavukarasu

Hassan

Mahmood

Sanghera

Barzangi

El Mukashfi

Shah

Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care

JMIR Med Educ 2023 9 e46599

10.2196/46599

37083633

v9i1e46599

PMC10163403

Cooper

Rodman

AI and medical education - A 21st-century pandora's box

N Engl J Med 2023 389 5 385 387

10.1056/NEJMp2304993

37522417

Khader

Müller-Franzes

Wang

Han

Tayebi Arasteh

Haarburger

Stegmaier

Bressem

Kuhl

Nebelung

Kather

Truhn

Multimodal deep learning for integrating chest radiographs and clinical parameters: a case for transformers

Radiology 2023 309 1 e230806

10.1148/radiol.230806

37787671

Zhang

Usuyama

Hanwen Xu

Jaspreet Bagga

Robert Tinn

Sam Preston,

Rajesh Rao

Mu Wei

Naveen Valluri

Cliff Wong

Andrea Tupini

Yu Wang

Matt Mazzola,

Swadheen Shukla

Lars Liden,

Jianfeng Gao

Matthew P. Lungren

Tristan Naumann

Sheng Wang

Hoifung Poon

Large-scale domain-specific pretraining for biomedical vision-language processing

ArXiv. 2023;abs/2303.0 2023

10.48550/arXiv.2303.00915

Azizi

Driess

Schaekermann

Amin

Chang

Carroll

Lau

Tanno

Ktena

Palepu

Mustafa

Chowdhery

Liu

Kornblith

Fleet

Mansfield

Prakash

Wong

Virmani

Semturs

Mahdavi

Green

Dominowska

Arcas

BAY

Barral

Webster

Corrado

Matias

Singhal

Florence

Karthikesalingam

Natarajan

Towards generalist biomedical AI

NEJM AI 2024

10.1056/aioa2300138

Cao

Sun

Huang

Shen

Towards generic anomaly detection and understanding: large-scale visual-linguistic model (GPT-4V) takes the lead

ArXiv. 2023;abs/2311.0 2023

10.48550/arXiv.2311.02782

Yang

Lin

Jianfeng Wang

Lin

Liu

Wang

The dawn of LMMs: preliminary explorations with GPT-4V(ision)

ArXiv. 2023;abs/2309.1 2023

10.48550/arXiv.2309.17421

Jin

Chen

Zhou

Cheung

Chen

Summers

Rousseau

Landsman

Baxter

Al'Aref

Chen

Brejt

Chiang

Peng

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

NPJ Digit Med 2024 7 1 190

10.1038/s41746-024-01185-7

39043988

10.1038/s41746-024-01185-7

PMC11266508

USMLE

Sample Test Questions Step 1 2023

2024-12-17

https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_Items.pdf

AMBOSS

AMBOSS Question difficulty 10/15/12023 2021-06-09

https://support.amboss.com/hc/en-us/articles/360035679652-Question-difficulty

Brown

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss,

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Hesse

Chen

Sigler

Litwin

Language models are few-shot learners

arXiv:2005.14165 2005

10.5860/choice.189890

American Board of Radiology

Qualifying (Core) Exam 2024-05-20

https://www.theabr.org/diagnostic-radiology/initial-certification/core-exam

Pallais

Fenves

Glomski

Case 18-2018: A 45-year-old woman with hypertension, fatigue, and altered mental status

N Engl J Med 2018 378 24 2322 2333

10.1056/nejmcpc1802825

Lee

Bubeck

Petro

Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine

N Engl J Med 2023 388 13 1233 1239

10.1056/nejmsr2214184

OPENAI

GPT-4V(ision) System Card 2023-09-25

https://api.semanticscholar.org/CorpusID:263218031

Chang

Zhang

Jaakkola

Rethinking cooperative rationalization: introspective extraction and complement control

arXiv:1910.13294 2019

10.18653/v1/d19-1420

Zaidan

Eisner

Piatko

Using ?Annotator Rationales? to improve machine learning for text categorization

2007

The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

2024 December 16

New York

10.3115/1614164.1614179

Lee

Frieske

Ishii

Bang

Madotto

Fung

Survey of hallucination in natural language generation

ACM Comput. Surv 2023 55 12 1 38

10.1145/3571730

Tinn

Cheng

Lucas

Usuyama

Liu

Naumann

Gao

Poon

Domain-Specific language model pretraining for biomedical natural language processing

ACM Trans. Comput. Healthcare 2021 3 1 1 23

10.1145/3458754

Wang

Yang

Yao

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

ArXiv. 2024;abs/2402.1 2024

10.48550/arXiv.2402.17887

Singhal

Azizi

Mahdavi

Wei

Chung

Scales

Tanwani

Cole-Lewis

Pfohl

Payne

Seneviratne

Gamble

Kelly

Babiker

Schärli

Chowdhery

Mansfield

Demner-Fushman

Agüera Y Arcas

Webster

Corrado

Matias

Chou

Gottweis

Tomasev

Liu

Rajkomar

Barral

Semturs

Karthikesalingam

Natarajan

Large language models encode clinical knowledge

Nature 2023 620 7972 172 180

10.1038/s41586-023-06291-2

37438534

10.1038/s41586-023-06291-2

PMC10396962

Bhayana

Krishna

Bleakney

Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations

Radiology 2023 307 5 e230582

10.1148/radiol.230582

37191485

Gilson

Safranek

Huang

Socrates

Chi

Taylor

Chartash

How does ChatGPT perform on the United States medical licensing examination (USMLE)? the implications of large language models for medical education and knowledge assessment

JMIR Med Educ 2023 9 e45312

10.2196/45312

36753318

v9i1e45312

PMC9947764

Brin

Sorin

Vaid

Soroush

Glicksberg

Charney

Nadkarni

Klang

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Sci Rep 2023 13 1 16492

10.1038/s41598-023-43436-9

37779171

10.1038/s41598-023-43436-9

PMC10543445

Nakao

Miki

Nakamura

Kikuchi

Nomura

Hanaoka

Yoshikawa

Abe

Capability of GPT-4V(ision) in the Japanese national medical licensing examination: evaluation study

JMIR Med Educ 2024 10 e54393

10.2196/54393

38470459

v10i1e54393

PMC10966435

Takagi

Koda

Watari

The performance of ChatGPT-4V in interpreting images and tables in the Japanese medical licensing exam

JMIR Med Educ 2024 10 e54283

10.2196/54283

38787024

v10i1e54283

PMC11148840

Zhou

Ong

Kennedy

Kazam

Hentel

Flanders

Shih

Peng

Evaluating GPT-V4 (GPT-4 with Vision) on detection of radiologic findings on chest radiographs

Radiology 2024 311 2 e233270

10.1148/radiol.233270

38713028

Brin

Sorin

Barash

Konen

Glicksberg

Nadkarni

Klang

Assessing GPT-4 multimodal performance in radiological image analysis

Eur Radiol 2024

10.1007/s00330-024-11035-5

39214893

10.1007/s00330-024-11035-5

The Federation of State Medical Boards (FSMB) and the National Board of Medical Examiners® (NBME®)

Step 3 - United States Medical Licensing Examination 2024-12-17

https://www.usmle.org/step-exams/step-3

Elkassem

Smith

Potential use cases for ChatGPT in radiology reporting

AJR Am J Roentgenol 2023 221 3 373 376

10.2214/AJR.23.29198

37095665

Hirosawa

Harada

Yokose

Sakamoto

Kawamura

Shimizu

Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study

Int J Environ Res Public Health 2023 20 4 3378

10.3390/ijerph20043378

36834073

ijerph20043378

PMC9967747

Shea

Lee

CMY

WCT

Luk

DWA

Wong

SSW

Use of GPT-4 to analyze medical records of patients with extensive investigations and delayed diagnosis

JAMA Netw Open 2023 6 8 e2325000

10.1001/jamanetworkopen.2023.25000

37578798

2808251

PMC10425828

Lalor

Learning latent parameters without human response patterns: item response theory with artificial crowds

arXiv:1908.11421 2019

10.18653/v1/d19-1434

Liberati

Ruggiero

Galuppo

Gorli

González-Lorenzo

Maraldi

Ruggieri

Polo Friz

Scaratti

Kwag

Vespignani

Moja

What hinders the uptake of computerized decision support systems in hospitals? A qualitative study and framework for implementation

Implement Sci 2017 12 1 113

10.1186/s13012-017-0644-2

28915822

10.1186/s13012-017-0644-2

PMC5602839

Strohm

Hehakaya

Ranschaert

Boon

WPC

Moors

EHM

Implementation of artificial intelligence (AI) applications in radiology: hindering and facilitating factors

Eur Radiol 2020 30 10 5525 5532

10.1007/s00330-020-06946-y

32458173

10.1007/s00330-020-06946-y

PMC7476917

Van Cauwenberge

Van Biesen

Decruyenaere

Leune

Sterckx

"Many roads lead to Rome and the artificial Intelligence only shows me one road": an interview study on physician attitudes regarding the implementation of computerised clinical decision support systems

BMC Med Ethics 2022 23 1 50

10.1186/s12910-022-00787-8

35524301

10.1186/s12910-022-00787-8

PMC9077861

Jones

Thornton

Wyatt

Artificial intelligence and clinical decision support: clinicians' perspectives on trust, trustworthiness, and liability

Med Law Rev 2023 31 4 501 520

10.1093/medlaw/fwad013

37218368

7176027

PMC10681355

Patel

Ward

Kadoch

Cham

Radiographic features of pulmonary embolism: hampton's hump

Postgrad Med J 2014 90 1065 420 421

10.1136/postgradmedj-2013-132097

24894313

postgradmedj-2013-132097

Chen

Cano

Romanou

Bonnet

Matoba

Salvi

Pagliardini

Fan

Köpf

Mohtashami

Sallinen

Sakhaeirad

Swamy

Krawczuk

Bayazit

Marmet

Montariol

Hartley

Jaggi

Bosselut

MEDITRON-70B: scaling medical pretraining for large language models

arXiv:2311.16079 2023

10.48550/arXiv.2311.16079

Kim

Schramm

Berberich

Rosenkranz

Schmitzer

Serguen

Klenk

Lenhart

Zimmer

Wiestler

Hedderich

Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study 2024

2024-02-06

https://www.medrxiv.org/content/10.1101/2024.02.05.24302099v1

Jiang

ChatGPT in radiology: evaluating proficiencies, addressing shortcomings, and proposing integrative approaches for the future

Radiology 2023 308 1 e231335

10.1148/radiol.231335

37432082

Sallam

ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns

Healthcare (Basel) 2023 11 6 887

10.3390/healthcare11060887

36981544

healthcare11060887

PMC10048148

Gandhi

Kachalia

Thomas

Puopolo

Yoon

Brennan

Studdert

Missed and delayed diagnoses in the ambulatory setting: a study of closed malpractice claims

Ann Intern Med 2006 145 7 488 496

10.7326/0003-4819-145-7-200610030-00006

17015866

145/7/488

Berlin

Radiologic errors, past, present and future

Diagnosis (Berl) 2014 1 1 79 84

10.1515/dx-2013-0012

29539959

/j/dx.2014.1.issue-1/dx-2013-0012/dx-2013-0012.xml