Background

JMIR

J Med Internet Res

Journal of Medical Internet Research

1438-8871

JMIR Publications

Toronto, Canada

v27i1e64486

40305085

10.2196/64486

Review

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

Xiaomeng

Lotfinia

Mahshad

Zhang

Xiao-Meng

Wang

Ling

MD 1 2

https://orcid.org/0000-0002-1970-0862

Jinglin

BD 2

https://orcid.org/0009-0004-4680-1548

Zhuang

Boyang

MD 3

https://orcid.org/0009-0009-2162-7076

Huang

Shasha

BD 4

https://orcid.org/0009-0003-3554-5828

Fang

Meilin

BD 2

https://orcid.org/0009-0006-3499-9168

Wang

Cunze

BD 2

https://orcid.org/0000-0002-7751-1242

Wen

BD 1

https://orcid.org/0009-0005-0542-7489

Zhang

Mohan

BD 2

https://orcid.org/0009-0001-1786-0176

Gong

Shurong

MD 5

The Third Department of Critical Care Medicine Fuzhou University Affiliated Provincial Hospital Shengli Clinical Medical College, Fujian Medical University

No.134 Dongjie Road

Fuzhou, Fujian, 350001

China 86 15060677447 shurong_gong@fjmu.edu.cn

https://orcid.org/0000-0003-1746-8198

1 Fuzhou University Affiliated Provincial Hospital Shengli Clinical Medical College Fujian Medical University

Fuzhou

China 2 School of Pharmacy Fujian Medical University

Fuzhou

China 3 Fujian Center For Drug Evaluation and Monitoring

Fuzhou

China 4 School of Pharmacy Fujian University of Traditional Chinese Medicine

Fuzhou

China 5 The Third Department of Critical Care Medicine Fuzhou University Affiliated Provincial Hospital Shengli Clinical Medical College, Fujian Medical University

Fuzhou, Fujian

China

Corresponding Author: Shurong Gong shurong_gong@fjmu.edu.cn

2025

30 4 2025

e64486

18 7 2024 14 10 2024 4 2 2025 3 4 2025

©Ling Wang, Jinglin Li, Boyang Zhuang, Shasha Huang, Meilin Fang, Cunze Wang, Wen Li, Mohan Zhang, Shurong Gong. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 30.04.2025.

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Background

Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent.

Objective

This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.

Methods

In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy.

Results

The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification.

Conclusions

Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios.

Trial Registration

PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245

large language models LLM clinical research questions accuracy network meta-analysis PRISMA

Introduction

Recent research has demonstrated the considerable success of large language models (LLMs) in a multitude of natural language tasks, including automatic summarization (the generation of a condensed version of a passage of text), machine translation (the automatic translation of text from one language to another), and question-and-answer systems (the construction of a system to automatically answer questions based on a passage of text) [1]. In this context, with the development of big biomedical data and artificial intelligence, the emergence of flexible natural language processing models such as ChatGPT provides a number of new possibilities for health care and biomedical research and has the potential to be a turning point in the field [2-4].

Although LLMs have shown great potential in the medical field, medicine is a demanding field, it is associated with life, and its complexity as well as specificity mean that any application must meet extremely high standards of accuracy. Controversy remains about whether LLMs can be applied to the medical field. Mu and He [5] reviewed the potential applications and challenges of ChatGPT in health care, noting that a lack of understanding of medical knowledge and specialized medical backgrounds hinder the ability of ChatGPT to delve into the complexity of medical concepts and terminology. Consequently, the capacity of ChatGPT to address specific medical queries, diagnose ailments, or furnish precise medical recommendations is restricted. Another study noted that the role of LLMs in health care may be limited by the presence of bias in training materials, their tendency to “hallucinate,” and ethical and legal considerations when LLMs provide inaccurate advice that leads to patient harm, as well as patient privacy issues [6].

Given the controversy over the application of LLMs in medicine and the continuous emergence and versioning of LLMs, more research has been devoted to evaluating the performance of various LLMs in medicine to provide stronger evidence. In addition to ChatGPT developed by OpenAI, the performance of many other LLMs such as Microsoft (eg, Copilot [7]), Google (eg, Gemini [8]), and Meta (eg, LLaMA [9]) in the medical domain has also been compared. Many aspects of assessment have been included, such as medical exams [10], case text diagnosis [11], and disease classification or grading [12].

Unfortunately, there are differences in the performance of different LLMs in different studies. For example, in a study by Vaishya et al [13] that explored the performance of ChatGPT-3.5, ChatGPT-4, and Google Bard when answering 120 multiple-choice questions, the results showed that Google Bard had 100% accuracy and was significantly more accurate than both ChatGPT-3.5 and ChatGPT-4 (P<.001). Another study showed that ChatGPT-4 was more accurate than Google Bard (83% vs 76%) [14]. At present, most related research is limited to a single type of LLM [15,16] or a specific domain area [17,18], and there is no high-level evidence comparing the accuracy rankings of different LLMs when responding to clinical research questions.

Therefore, this study aimed to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. This study aimed to provide high-level evidence-based support for future clinical applications, enabling clinical workers to better use LLMs to make more accurate and informed decisions for future learning, diagnosis, and different clinical scenarios.

Methods Network Meta-Analysis

The network meta-analysis (NMA) was based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) reporting guidelines. The PRISMA checklist is shown in Multimedia Appendix 1. The Bayesian approach permits the indirect comparison of performance between a range of LLMs that were not explicitly articulated throughout the experiment. The study protocol was defined and registered in the PROSPERO database prior to the commencement of the study.

Search Strategy and Selection Criteria

A computer search of the PubMed, Embase, Web of Science, and Scopus databases was conducted to identify relevant studies on the accuracy of different LLMs when answering questions in the medical field. The last search was updated to October 14, 2024, to identify studies published since the first search, with no restrictions on the type of study. When the results of a study were reported in multiple publications, we included the study with the richest and most recent findings. We also searched the list of literature on LLMs in medicine-related systematic reviews and manually searched the references included in the reviews for additional access to relevant literature. The search subject terms were “LLM,” “generative AI,” “open AI,” “Large language model,” “ChatGPT-3.5,” “ChatGPT-4,” “Google Bard,” and “Bing,” without any language restriction. The complete search strategies for all databases are shown in Multimedia Appendix 2.

A combination of EndNote X9 deduplication and manual deduplication was used to screen the literature in accordance with the developed inclusion criteria. The results of the literature searches conducted in different databases were then combined to create a new information database, which could be downloaded in full text. Independent review and assessment of the titles, abstracts, and full texts of the relevant literature were undertaken by 4 authors (LW, JL, BZ, and SH). The review encompassed studies using disparate LLMs systems to respond to medical queries. Letters, conference abstracts, editorials, reviews, and expert opinions for which no information was available were excluded from the review. In addition, the following studies were excluded: those that evaluated the performance of only 1 LLM; those that assessed the performance of 2 or more LLMs without specifying the LLM versions used (eg, the article only mentioned evaluating ChatGPT without mentioning ChatGPT-3.5, ChatGPT-4, or other versions), with the updated versions and timelines of various LLMs so far shown in Multimedia Appendix 3; those that assessed the performance of 2 or more LLMs but did not provide data isolating their accuracy when answering different types of questions; and the questions included in the study contained images. In addition, to reduce bias, we excluded research on accessing LLMs through an application programming interface (API).

Assessment of Results

The primary outcomes were the accuracy of LLMs when answering medical questions. These included objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification accuracy. Objective questions are exam questions with a clear, quantifiable answer that is usually predetermined, unique, or with a limited number of options. Open-ended questions are a type of question that does not have a fixed answer nor standardized answer. Diagnosis and triage and classification are open-ended questions, but most diagnostic questions end with “What is the most probable diagnosis?” whereas triage and classification questions end with “How would you classify this disease?” Corresponding examples are shown in Multimedia Appendix 4.

Accuracy for objective questions was calculated as the number of correctly answered questions divided by the total number of questions. For diagnosis and classification, accuracy was defined as the number of cases correctly diagnosed or triaged divided by the total number of cases. Specifically for open-ended questions, accuracy was determined based on the number of questions rated “good” or “accurate” on the accuracy scale divided by the total number of questions.

Data Extraction

The 4 researchers jointly extracted and verified the following data: (1) basic information about the included studies, such as study title and first author; (2) baseline characteristics and interventions of the study population; (3) key elements evaluated for risk of bias; and (4) outcome indicators and relevant outcome measure data. Our study involved extracting raw data from each study. In cases of disagreement, these were resolved through discussion and consultation with a third party.

Quality Assessment

Because they were cross-sectional studies, the quality of the included studies was evaluated using the Newcastle-Ottawa Scale [19]. The quality assessment was conducted by 3 independent researchers (LW, JL, and BZ), with a fourth researcher (SH) resolving any disagreements. A low overall risk of bias was determined when the Newcastle-Ottawa Scale score ranged from 7 to 9, moderate risk was determined when the score was between 4 and 6, and high risk was determined when the score was 0 to 3.

Statistical Analyses

Statistical analyses were performed using Stata 18.0 and R (version 4.3.1), with the odds ratio (OR) as the analytical statistic. Accuracy was assessed using 95% CIs and the credible interval. NMA analyses were performed on different types of LLMs.

The confidence of the NMA results estimates was assessed according to the Confidence in Network Meta-Analysis (CINEMA) methodology, which is broadly based on the Grading of Recommendations Assessment, Development, and Evaluation (GRADE). An NMA was conducted within a Bayesian framework using Markov chain Monte Carlo methods and was computed using the BUGSnet and GeMTC packages in R (V.4.3.1) software. A network graph was constructed for each LLM included in the experiment in order to facilitate a comparison of the performance of multiple LLMs. The consistency between direct and indirect evidence was evaluated using a node-splitting method when there was a closed loop. If the P value between the direct, indirect, and network comparisons of the 2 interventions was >.05, we concluded that there was no statistical difference and consistency was good. The convergence of the network models derived from the Markov chain Monte Carlo simulations was assessed using trace and density plots. We used noninformative priors for all parameters and assumed common heterogeneity. Furthermore, for all LLMs, we determined the ranking probabilities, which were articulated as the surface under the cumulative ranking curve (SUCRA). Higher SUCRA values suggest superior accuracy in model ranking.

Results Literature Search and Selection

A bibliographic search yielded 59,075 citations, of which 21,156 studies were identified as potential conditions based on abstract screening and retrieved for full text evaluation. Manual reading of the titles and abstracts of the remaining literature excluded 20,814 papers whose topics and interventions did not match the inclusion criteria for this study. Further reading of the full texts excluded the following: 174 articles that could not be separated nor extracted from the ending; 147 articles in which we were unable to separate outcome data, unable to extract outcome data, or detected issues related to images; 12 articles with unclear versions of the LLMs; and 8 articles that used an API to access LLMs. In addition, the full text of 7 articles was not available, resulting in the final inclusion of 168 articles from the literature. The literature screening process is shown in Figure 1.

Figure 1

Literature screening flowchart. API: application programming interface.

Basic Characteristics of the Incorporated Literature

To assess the accuracy of different LLMs when answering medical questions, a total of 168 studies underwent a screening process to determine their suitability for inclusion. A total of 35,896 questions and 3063 clinical cases were included in the study. The basic information of the 168 studies is presented in Multimedia Appendix 5.

Quality Assessment of the Included Studies

In the quality assessment, 40 (40/168, 23.8%) studies were assessed as having a low overall risk of bias, while 128 (128/168, 76.2%) had a moderate overall risk of bias. No studies were identified as having a high overall risk of bias. The detailed quality assessment results for each study can be found in Multimedia Appendix 6.

Network Meta-Analysis Objective Questions

The accuracy of LLMs when answering objective questions was reported in 105 studies [10,13,14,20-121]. The evidence network relationships are plotted in Figure 2A and involve 30 LLMs and a total of 33,838 multiple choice questions. Direct and indirect comparisons were formed for each LLM, partially forming a closed loop. The results of the indirect comparison are shown in Figure 3 and Multimedia Appendix 7. The red cells indicate there are statistically significant differences between the column-defining regimen and the row-defining regimen. The values in the green and blue cells are the logOR and 95% CI, respectively, from the comparison of the LLMs represented in the columns with the LLMs represented in the rows. A logOR value <0 indicates that the accuracy of the LLM corresponding to a column is lower than the LLM corresponding to a row. A value >0 indicates a higher accuracy. There was no evidence of statistically significant inconsistency (all P>.05) in the node-splitting test for NMA, except for Claude 2 versus ChatGPT-4 (P=.04), Bing chat versus people (P=.004), and Perplexity versus people (P=.04; Multimedia Appendix 8). The convergence of iterations was evaluated as good in trace and density plots, with the bandwidth tending toward 0 and reaching stability (Multimedia Appendix 9). The best probability ranking showed that ChatGPT-4o (SUCRA=0.9207) ranked first in terms of accuracy when answering objective questions, Aeyeconsult (SUCRA=0.9187) ranked second, and ChatGPT-4 (SUCRA=0.8087) ranked third (Table 1, Figure 4A).

Figure 2

Comparison network diagram of different outcomes, where larger nodes indicate more questions and thicker line segments indicate more questions between 2 types of large language models (LLMs) when answering (A) objective questions, (B) open-ended questions, (C) a top 1 diagnosis, (D) a top 3 diagnosis, (E) a top 5 diagnosis, and (F) triage and classification questions.

Figure 3

Indirect comparison of the accuracy of large language models (LLMs) when answering objective questions: A: instructGPT; A1: LLaMA 2; B: GTP-3; B1: LLaMA 3; C: ChatGPT-3.5; D: ChatGPT-4; D1: Mistral Large; E: ChatGPT-4o; E1: people; F1: chatENT; G: Bard; G1: ChatSonic; H: PaLM2; H1: Aeyeconsult; I: Gemini; I1: Med-PaLM 2; K: Gemini 1.5 pro; L: Bing chat; M: Copilot; N: Perplexity; O: Perplexity Pro; P: Claude; Q: Claude-instant; R: Claude 2; T: Claude 3 Opus; U: Claude 3 Sonnet; W: LLaMA 7B; X: LLaMA 13B; Y: LLaMA 33B; Z: LLaMA 65B.

Table 1

Bayesian ranking results (surface under the cumulative ranking curve [SUCRA] value) of the network meta-analysis for each large language model (LLM).

LLM	SUCRA
	Objective questions	Open-ended questions	Top 1 diagnosis	Top 3 diagnosis	Top 5 diagnosis	Triage and classification
instructGPT (A)	0.7805	—^a	—	—	—	—
LLaMA 2 (A1)	0.2086	0.4629	0.1395	—	—	—
GTP-3 (B)	0.7704	—	—	—	—	—
LLaMA 3 (B1)	0.239	—	—	—	0.7405	—
ChatGPT-3.5 (C)	0.4343	0.5548	0.5039	0.565	0.5084	0.2093
Mixtral-8x7B (C1)	—	0.6224	—	—	—	—
ChatGPT-4 (D)	0.8087	0.8708	0.693	0.6302	0.8089	0.6185
Mistral Large (D1)	0.3842	—	—	—	—	—
ChatGPT-4o (E)	0.9207	—	—	—	—	—
People (E1)	0.6172	0.6067	0.9001	0.7126	0.6241	0.4934
chatENT (F1)	0.7687	—	—	—	—	—
Bard (G)	0.4443	0.3512	0.3353	0.4329	0.0722	0.5885
ChatSonic (G1)	0.4617	—	—	—	—	—
PaLM2 (H)	0.421	0.312	0.4496	—	—	0.5197
Aeyeconsult (H1)	0.9187	—	—	—	—	—
Gemini (I)	0.4543	0.6703	0.2812	—	0.2405	0.9649
Med-PaLM 2 (I1)	0.3919	—	—	—	—	—
OcularBERT (J1)	—	0.0176	—	—	—	—
Gemini 1.5 pro (K)	0.2449	—	—	—	0.7905	—
Doctor GPT (K1)	—	0.745	—	—	—	—
Bing chat (L)	0.728	0.23	0.2073	0.4499	0.2042	0.3391
Docs-GPT Beta (L1)	—	0.212	—	—	—	—
Copilot (M)	0.7038	—	0.5048	—	0.2633	—
WebMD (M1)	—	—	0.7511	0.1452	—	0.4348
Perplexity (N)	0.4424	—	0.3980	0.4367	0.2801	—
Ada Health (N1)	—	—	0.8363	0.6273	—	0.3319
Perplexity Pro (O)	0.3821	—	—	—	—	—
Claude (P)	0.5048	—	—	—	—	—
Claude-instant (Q)	0.4949	—	—	—	—	—
Claude 2 (R)	0.4928	0.5647	—	—	—	—
Claude 3 Opus (T)	0.7365	—	—	—	0.9672	—
Claude 3 Sonnet (U)	0.5094	—	—	—	—	—
LLaMA 7B (W)	0.1131	—	—	—	—	—
LLaMA 13B (X)	0.1365	—	—	—	—	—
LLaMA 33B (Y)	0.2147	—	—	—	—	—
LLaMA 65B (Z)	0.2721	—	—	—	—	—

^aNot applicable because the LLM was not in the network.

Figure 4

Surface under the cumulative ranking curve (SUCRAs) for the accuracy, with higher rankings associated with larger outcome values, of different large language models (LLMs) when answering (A) objective questions, (B) open-ended questions, (C) the top 1 diagnosis, (D) the top 3 diagnosis, (E) the top 5 diagnosis, and (F) triage and classification questions. The letters in the keys indicate the following LLMs: A: instructGPT; A1: LLaMA 2; B: GTP-3; B1: LLaMA 3; C: ChatGPT-3.5; C1: Mixtral-8x7B; D: ChatGPT-4; D1: Mistral Large; E: ChatGPT-4o; E1: people; F1: chatENT; G: Bard; G1: ChatSonic; H: PaLM2; H1: Aeyeconsult; I: Gemini; I1: Med-PaLM 2; J1: OcularBERT; K: Gemini 1.5 pro; K1: Doctor GPT; L: Bing chat; L1: Docs-GPT Beta; M: Copilot; M1: WebMD; N: Perplexity; N1: Ada Health; O: Perplexity Pro; P: Claude; Q: Claude-instant; R: Claude 2; S: Claude 2.1; T: Claude 3 Opus; U: Claude 3 Sonnet; W: LLaMA 7B; X: LLaMA 13B; Y: LLaMA 33B; Z: LLaMA 65B.

Subgroup Analysis

We stratified the results based on the fields of the problem (Multimedia Appendix 10). Based on the results, we compared the accuracy of LLMs in 6 fields: ophthalmology, orthopedics, urology, dentistry, oncology, and radiology. In ophthalmology, the LLM with the highest accuracy was Aeyeconsult (SUCRA=0.8334), followed by ChatGPT-4 (SUCRA=0.6331) and PaLM2 (SUCRA=0.5517). In the field of orthopedics, the LLM accuracy rates, from highest to lowest, were for Bard (SUCRA=0.7219), people (SUCRA=0.6802), and Bing chat (SUCRA=0.4732). For urology, Bing chat (SUCRA=0.7905) was the most accurate, followed by people (SUCRA=0.6587) and ChatGPT-4 (SUCRA=0.5941). In dentistry, ChatGPT-4 (SUCRA=0.9473) was the most accurate, followed by Bard (SUCRA=0.7068) and Gemini (SUCRA=0.5535). ChatGPT-4 (SUCRA=0.9002) performed the best in oncology, followed by ChatGPT-4o (SUCRA=0.8998) and Claude (SUCRA=0.7159). In radiology, ChatGPT-4o (SUCRA=0.9053) performed the best, ChatGPT-4 (SUCRA=0.7777) was second, and Claude 3 Opus (SUCRA=0.6935) ranked third. The SUCRAs are shown in Figure 5.

Figure 5

Surface under the cumulative ranking curve (SUCRAs) for the accuracy, with higher rankings associated with larger outcome values, of different large language models (LLMs) in (A) ophthalmology, (B) orthopedics, (C) urology, (D) dentistry, (E) oncology, and (F) radiology. The letters in the keys indicate the following LLMs: C: ChatGPT-3.5; D: ChatGPT-4; E: ChatGPT-4o; E1=people; G: Bard; H: PaLM2; H1: Aeyeconsult; I: Gemini; L: Bing chat; P: Claude; T: Claude 3 Opus; U: Claude 3 Sonnet; W: LLaMA 7B; X: LLaMA 13B; Y: LLaMA 33B; Z: LLaMA 65B.

Open-Ended Questions

The accuracy of the LLMs when responding to open-ended questions was examined in 34 studies [122-155]. The relationships within the evidence network are plotted in Figure 2B and include 14 LLMs and a total of 2026 open-ended questions. Direct and indirect comparisons were formed for each LLM, partially forming a closed loop. The results of the indirect comparison are presented in Multimedia Appendix 10, where red cells indicate statistically significant differences between the column-defining regimen and the row-defining regimen (Multimedia Appendix 7). There was no evidence of a statistically significant inconsistency (all P>.05) in the node-splitting test for the NMA, except for Bard versus ChatGPT-3.5 (P=.02; Multimedia Appendix 8). The trace and density plots are shown in Multimedia Appendix 9, and from the results, the iterative convergence was good. The best probability ranking indicated that ChatGPT-4 (SUCRA=0.8708) exhibited the highest accuracy when answering open-ended questions, followed by Claude 2.1 (SUCRA=0.7796) and Doctor GPT (SUCRA=0.7450; Table 1, Figure 4B).

Top 1 Diagnosis, Top 3 Diagnosis, and Top 5 Diagnosis

The accuracy of the top 1 diagnosis in clinical cases by LLMs was reported in 19 studies [11,156-173]. The evidence network relationship diagram is shown in Figure 2C and involves 12 LLMs and a total of 1266 clinical cases. The accuracy of LLMs for the top 3 diagnosis was reported in 7 studies [158,161,169,171,174-176]. The evidence network relationships are plotted in Figure 2D and involve 8 LLMs and a total of 453 clinical cases. The accuracy of LLMs for the top 5 diagnosis in clinical cases was reported in 7 studies [158,167,168,173,177-179]. The evidence network relationships are plotted in Figure 2E and involve 11 LLMs and a total of 443 clinical cases. Each LLM formed direct and indirect comparisons, partially closing the loop.

In terms of the top 1 diagnosis and top 5 diagnosis, the results of the indirect comparison are presented in Multimedia Appendix 7, where red cells indicate statistically significant differences between the column-defining regimen and the row-defining regimen. For the top 3 diagnosis, there was no statistical difference (all P>.05) in the comparisons between the LLMs (Multimedia Appendix 7). There was no evidence of a statistically significant inconsistency (all P>.05) for the top 1 diagnosis, except for Ada Health versus ChatGPT-3.5 (P=.04). For the top 3 diagnosis and top 5 diagnosis, all P were >.05 in the node-splitting test for the NMA (Multimedia Appendix 8). Iterative convergence was good, as shown by the trace and density plots (Multimedia Appendix 9). The best probability ranking showed that, in terms of accuracy of the top 1 diagnosis in clinical cases, people ranked first (SUCRA=0.9001), Ada Health ranked second (SUCRA=0.8363), and WebMD ranked third (SUCRA=0.7511; Table 1, Figure 4C). In terms of the accuracy of the top 3 diagnosis, people ranked first (SUCRA=0.7126), ChatGPT-4 ranked second (SUCRA=0.6302), and Ada Health ranked third (SUCRA=0.6273; Table 1, Figure 4D). For the accuracy of the top 5 diagnosis, Claude 3 Opus ranked first (SUCRA=0.9672), ChatGPT-4 ranked second (SUCRA=0.8089), and Gemini 1.5 pro ranked third (SUCRA=0.7905; Table 1, Figure 4E).

Triage and Classification

The accuracy of LLMs in triage and classification was reported in 7 studies [12,167,169,174,180-182]. The evidence network relationships are plotted in Figure 2F and involve 9 LLMs and a total of 901 clinical cases. Each LLM formed direct and indirect comparisons, partially closing the loop. The results of the indirect comparison are shown in Multimedia Appendix 7. There were significant differences between Gemini and ChatGPT-3.5, ChatGPT-4, or Bing chat (P<.05). There was no evidence of a statistically significant inconsistency (all P>.05) in the node-splitting test for the NMA, except for ChatGPT-3.5 versus ChatGPT-4 (P=.045; Multimedia Appendix 8). Iterative convergence was good, as shown by the trace and density plots (Multimedia Appendix 9). The best probability ranking showed that, for the accuracy of triage and classification, Gemini ranked first (SUCRA=0.9649), ChatGPT-4 ranked second (SUCRA=0.6185), and Bard ranked third (SUCRA=0.5885), as shown in Table 1 and Figure 4F.

Discussion Principal Findings

This study presents the most comprehensive meta-analysis to date on the accuracy of various LLMs when responding to medical queries, encompassing objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. Variations in accuracy among different LLMs were observed. ChatGPT-4o demonstrated the highest accuracy when answering objective questions, while ChatGPT-4 excelled at open-ended questions. The superior performance of people at the top 1 diagnosis and top 3 diagnosis suggests that human expertise is generally more dependable than LLMs in complex medical scenarios, while Claude 3 Opus seems to perform the best in the top 5 diagnosis. In terms of triage and classification, Gemini appeared to be more reliable.

In addition, we stratified LLMs according to the medical field in which the objective questions were located and explored their accuracy in 6 fields: ophthalmology, orthopedics, urology, dentistry, oncology, and radiology. We found that Aeyeconsult performed the best in ophthalmology, Bard performed the best in orthopedics, Bing chat performed the best in urology, ChatGPT-4 performed the best in both dentistry and oncology, and ChatGPT-4o had the highest accuracy in radiology.

At present, language models based on transformer architecture, whether pretrained or fine-tuned using biomedical corpora, have been proven effective in a series of natural language processing benchmarks in the biomedical field [183]. We attempted to analyze the reasons for the performance differences when different LLMs answer questions. Parameter size is an important factor affecting the accuracy of LLMs when answering questions. Research has found that, when the parameter size of the PaLM model is expanded from 8B to 40B, the accuracy of answering medical questions is doubled [184]. However, the practicality of a model depends not only on its number of parameters but also on many factors such as its training data and architecture, fine-tuning protocols, and overall architecture [185]. Taking GPT-4 as an example, it achieved a higher performance than its predecessor by adopting more advanced training data and architecture. The timeliness and accuracy of training data are also crucial for model performance. Today, models can not only rely on a limited set of pretraining data but also obtain the latest knowledge from the internet in real time. For example, Bing AI and Google Bard already have the ability to obtain real-time updates, and ChatGPT has also begun to follow suit by accepting plugins to expand its capabilities [185,186].

In addition, we found that some models fine-tuned on the backend LLM can achieve higher accuracy and less energy consumption in specific fields. For example, in the field of ophthalmology, Aeyeconsult integrates many ophthalmic data sets based on GPT-4 for training and generation [24]. This targeted training can significantly improve its performance in ophthalmic clinical tasks. Other possible data sources include clinical texts and accurate medical information, such as guidelines and peer-reviewed literature. In fact, there are already some models built or fine-tuned based on clinical text, such as SkinGPT-4 and ChatDoctor, which perform better overall than various general LLMs at biomedical natural language processing tasks [187,188].

Progress on various grand prognostic models has been very rapid, with a newer, more arithmetically powerful version being released every few months. However, our results show that the newer versions do not necessarily outperform the older ones in terms of performance when measured as accuracy, possibly because the newer versions incorporate fewer studies, which may have biased the results somewhat. In addition, updated versions such as ChatGPT-4V provided multimodal models (eg, that can evaluate image problems), and these models may have a greater advantage for image evaluation, for example.

Studies indicate LLMs outperform humans at exams like medical licensing, orthopedics, and pediatrics globally, highlighting LLMs’ potential as a study aid. For the top 1 diagnosis and top 3 diagnosis, human accuracy is higher than that of LLMs. Despite the fact that Claude 3 Opus outperformed humans in the top 5 diagnostic results, due to the high level of accuracy required in the medical field and the multifaceted information and complex decision-making involved in medical diagnosis, we still recommend that LLMs should only be used as an auxiliary tool to assist doctors with more efficient data analysis and preliminary diagnostic recommendations.

Several meta-analyses have been conducted to assess the accuracy of LLMs in health care [15,189,190]. However, it is very unfortunate that the LLMs included in these studies included ChatGPT only and that some of the studies simply evaluated its performance on exams. Some studies did not differentiate between the types of questions answered by ChatGPT, which led to a significant amount of heterogeneity between the studies, resulting in biased results.

We acknowledge certain limitations in our study. First, for the top 3 diagnosis, top 5 diagnosis, and triage and classification, this may bias the results due to the number of included studies as well as the sample size, so caution is needed when interpreting these results. Although we minimized the heterogeneity of the research as much as possible, we cannot deny that the inclusion of different fields of study and the complexity of LLMs (such as different instructions and questioning dates) can affect the results of the study and generate heterogeneity. Therefore, caution should be exercised when interpreting the results. In addition, we did not assess the accuracy of multimodal grand prognostic models when solving medical image–related problems; with the development of artificial intelligence, more multimodal models are being developed, and in the future, these models will become indispensable in the exploration of image-based problems in the medical field.

Conclusion

Existing studies suggest that ChatGPT-4o has an advantage for answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate in the top 1 diagnosis and top 3 diagnosis of clinical cases. Claude 3 Opus performs better in the top 5 diagnosis, while for classification accuracy, Gemini is more advantageous. Although some LLMs excel at addressing medical queries, caution is advised due to the critical need for precision and rigor in medicine. Future high-quality studies and trials are necessary to gather more scientific evidence.

Multimedia Appendix 1

PRISMA checklist.

Multimedia Appendix 2

Search strategy.

Multimedia Appendix 3

Versions and timelines of LLMs iterations.

Multimedia Appendix 4

Examples of the outcomes.

Multimedia Appendix 5

Description of 168 studies included.

Multimedia Appendix 6

Quality assessment of observational study.

Multimedia Appendix 7

Indirect comparison results.

Multimedia Appendix 8

Node splitting inconsistency test.

Multimedia Appendix 9

Trace and density plots.

Multimedia Appendix 10

Objective questions are stratified according to different fields of the questions.

Abbreviations

API

application programming interface

CINEMA

Confidence in Network Meta-Analysis

GRADE

Grading of Recommendations Assessment, Development, and Evaluation

LLM

large language model

NMA

network meta-analysis

odds ratio

PRISMA

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

SUCRA

surface under the cumulative ranking curve

This work was supported by the Training Program for Young and Middle-aged Backbone Talents of Fujian Provincial Health Commission (grant number 2022GGA001), Natural Science Foundation of Fujian Province (grant number 2021J01395 and 2024J011032), Foundations of Department of Finance of Fujian Province (grant number Min Cai Zhi (2023) 830 and Min Cai Zhi (2024) 881), Joint Funding Projects for Innovation in Science and Technology of Fujian Province (grant number 2023Y9330), and Internal Supporting Project of Fuzhou University Affiliated Provincial Hospital (grant number 0080072220).

Data Availability

The data sets generated or analyzed during this study are available from the corresponding author on reasonable request.

All authors were involved in the conceptualization and design of the study and reviewed all documents and materials. LW, JL, BZ, and SH collected the data, performed data analysis, interpreted the results, and wrote the first draft of the manuscript. CW, WL, and MZ were involved in the development of the protocol for the systematic review and critically reviewed the results and the manuscript. MF and SG were involved in the development of the protocol and revised the manuscript. All authors read and approved the final manuscript.

None declared.

Shen

Heacock

Elias

Hentel

Reig

Shih

Moy

ChatGPT and other large language models are double-edged swords

Radiology 2023 04 307 2 e230163

10.1148/radiol.230163

36700838

No authors listed

Will ChatGPT transform healthcare?

Nat Med 2023 03 14 29 3 505 506

10.1038/s41591-023-02289-5

36918736

10.1038/s41591-023-02289-5

Park

Pinto-Powell

Thesen

Lindqwister

Levy

Chacko

Gonzalez

Bridges

Schwendt

Byrum

Fong

Shasavari

Hassanpour

Preparing healthcare leaders of the digital age with an integrative artificial intelligence curriculum: a pilot study

Med Educ Online 2024 12 31 29 1 2315684

10.1080/10872981.2024.2315684

38351737

PMC10868429

Sblendorio

Dentamaro

Lo Cascio

Germini

Piredda

Cicolini

Integrating human expertise and automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

Int J Med Inform 2024 08 188 105501

10.1016/j.ijmedinf.2024.105501

38810498

S1386-5056(24)00164-3

The potential applications and challenges of ChatGPT in the medical field

IJGM 2024 03 Volume 17 817 826

10.2147/ijgm.s456659

Park

Pillai

Deng

Guo

Gupta

Paget

Naugler

Assessing the research landscape and clinical utility of large language models: a scoping review

BMC Med Inform Decis Mak 2024 03 12 24 1 72

10.1186/s12911-024-02459-6

38475802

10.1186/s12911-024-02459-6

PMC10936025

Copilot

Microsoft 2025-04-21

https://www.microsoft.com/en-us/microsoft-copilot

Gemini

Google 2025-04-21

https://gemini.google.com/

Llama

Meta 2025-04-21

https://llama.meta.com/

Tsoutsanis

Evaluation of large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam

Comput Biol Med 2024 01 168 107794

10.1016/j.compbiomed.2023.107794

38043471

S0010-4825(23)01259-3

Shukla

Mishra

Banerjee

Verma

The comparison of ChatGPT 3.5, Microsoft Bing, and Google Gemini for diagnosing cases of neuro-ophthalmology

Cureus 2024 04 16 4 e58232

10.7759/cureus.58232

38745784

PMC11092423

Pressman

Borna

Gomez-Cabello

Haider

Forte

AI in hand surgery: assessing large language models in the classification and management of hand injuries

J Clin Med 2024 05 11 13 10 2832

10.3390/jcm13102832

38792374

jcm13102832

PMC11122623

Vaishya

Iyengar

Patralekh

Botchu

Shirodkar

Jain

Vaish

Scarlat

Effectiveness of AI-powered chatbots in responding to orthopaedic postgraduate exam questions-an observational study

Int Orthop 2024 08 15 48 8 1963 1969

10.1007/s00264-024-06182-9

38619565

10.1007/s00264-024-06182-9

Lee

Tessier

Brar

Malone

Jin

McKechnie

Jung

Kroh

Dang

ASMBS Artificial Intelligence and Digital Surgery Taskforce

Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions

Surg Obes Relat Dis 2024 07 20 7 609 613

10.1016/j.soard.2024.04.014

38782611

S1550-7289(24)00169-2

Wei

Yao

Cui

Wei

Jin

Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis

J Biomed Inform 2024 03 151 104620

10.1016/j.jbi.2024.104620

38462064

S1532-0464(24)00038-8

Kaboudi

Firouzbakht

Shahir Eftekhar

Mohammad

Fayazbakhsh

Fatemeh

Joharivarnoosfaderani

Niloufar

Ghaderi

Salar

Dehdashti

Mohammadreza

Mohtasham Kia

Yasmin

Afshari

Maryam

Vasaghi-Gharamaleki

Maryam

Haghani

Leila

Moradzadeh

Zahra

Khalaj

Fattaneh

Mohammadi

Zahra

Hasanabadi

Zahra

Shahidi

Ramin

Diagnostic accuracy of ChatGPT for patients' triage; a systematic review and meta-analysis

Arch Acad Emerg Med 2024 12 1 e60

10.22037/aaem.v12i1.2384

39290765

PMC11407534

Patil

Serrato

Chisvo

Arnaout

See

Huang

Large language models in neurosurgery: a systematic review and meta-analysis

Acta Neurochir (Wien) 2024 11 23 166 1 475

10.1007/s00701-024-06372-9

39579215

10.1007/s00701-024-06372-9

Nguyen

Dang

Nguyen

Hoang

Nguyen

Accuracy of latest large language models in answering multiple choice questions in dentistry: a comparative study

PLoS One 2025 1 29 20 1 e0317423

10.1371/journal.pone.0317423

39879192

PONE-D-24-40356

PMC11778630

Mertz

Loeb

Newcastle-Ottawa Scale: comparing reviewers' to authors' assessments

BMC Med Res Methodol 2014 04 01 14 45

10.1186/1471-2288-14-45

24690082

1471-2288-14-45

PMC4021422

Long

Subburam

Lowe

Dos Santos

André

Zhang

Hwang

Saduka

Horev

Côté

David W J

Wright

ChatENT: augmented large language model for expert knowledge retrieval in otolaryngology-head and neck surgery

Otolaryngol Head Neck Surg 2024 10 19 171 4 1042 1051

10.1002/ohn.864

38895862

Tao

Hua

Milkovich

Micieli

ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources

Eye (Lond) 2024 07 20 38 10 1897 1902

10.1038/s41433-024-03037-w

38509182

10.1038/s41433-024-03037-w

PMC11226422

Shieh

Tran

Kumar

Freed

Majety

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports

Sci Rep 2024 04 23 14 1 9330

10.1038/s41598-024-58760-x

38654011

10.1038/s41598-024-58760-x

PMC11039662

Sarangi

Narayan

Mohakud

Vats

Sahani

Mondal

Assessing the capability of ChatGPT, Google Bard, and Microsoft Bing in solving radiology case vignettes

Indian J Radiol Imaging 2024 04 29 34 2 276 282

10.1055/s-0043-1777746

38549897

IJRI-23-9-2963

PMC10972658

Singer

Chow

Teng

Development and evaluation of Aeyeconsult: a novel ophthalmology chatbot leveraging verified textbook knowledge and GPT-4

J Surg Educ 2024 03 81 3 438 443

10.1016/j.jsurg.2023.11.019

38135548

S1931-7204(23)00432-4

Hanna

Smith

Mhaskar

Hanna

Performance of language models on the family medicine in-training exam

Fam Med 2024 10 2 56 9 555 560

10.22454/fammed.2024.233738

Kadoya

Arai

Tanaka

Kimura

Tozuka

Yasui

Hayashi

Katsuta

Takahashi

Inoue

Jingu

Assessing knowledge about medical physics in language-generative AI with large language model: using the medical physicist exam

Radiol Phys Technol 2024 12 10 17 4 929 937

10.1007/s12194-024-00838-2

39254919

10.1007/s12194-024-00838-2

Sallam

Al-Mahzoum

Almutawaa

Alhashash

Dashti

AlSafy

Almutairi

Barakat

The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses

BMC Res Notes 2024 09 03 17 1 247

10.1186/s13104-024-06920-7

39228001

10.1186/s13104-024-06920-7

PMC11373487

Gravina

Pellegrino

Palladino

Imperio

Ventura

Federico

Charting new AI education in gastroenterology: cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam

Dig Liver Dis 2024 08 56 8 1304 1311

10.1016/j.dld.2024.02.019

38503659

S1590-8658(24)00302-5

Passby

Jenko

Wernham

Performance of ChatGPT on specialty certificate examination in dermatology multiple-choice questions

Clin Exp Dermatol 2024 06 25 49 7 722 727

10.1093/ced/llad197

37264670

7188526

Sabri

Saleh

MHA

Hazrati

Merchant

Misch

Kumar

Wang

Barootchi

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education

J Periodontal Res 2025 02 18 60 2 121 133

10.1111/jre.13323

39030766

PMC11873669

Çamur

Eren

Cesur

Güneş

Yasin Celal

Can large language models be new supportive tools in coronary computed tomography angiography reporting?

Clin Imaging 2024 10 114 110271

10.1016/j.clinimag.2024.110271

39236553

S0899-7071(24)00201-8

Lubitz

Latario

Performance of two artificial intelligence generative language models on the orthopaedic in-training examination

Orthopedics 2024 05 47 3 e146 e150

10.3928/01477447-20240304-02

38466827

Gupta

Hamid

Jhaveri

Patel

Suthar

Comparative evaluation of AI models such as ChatGPT 3.5, ChatGPT 4.0, and Google Gemini in neuroradiology diagnostics

Cureus 2024 08 16 8 e67766

10.7759/cureus.67766

39323714

PMC11422621

Lee

Hong

Kim

Jong Won

Lee

Young Hwan

Park

Sang O

Lee

Kyeong Ryong

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank

Medicine (Baltimore) 2024 03 01 103 9 e37325

10.1097/MD.0000000000037325

38428889

00005792-202403010-00048

PMC10906566

Menekseoglu

Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o

Clin Rheumatol 2024 11 28 43 11 3507 3513

10.1007/s10067-024-07154-5

39340572

10.1007/s10067-024-07154-5

D'Anna

Gennaro

Van Cauter

Thurnher

Van Goethem

Haller

Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard

Neuroradiology 2024 08 06 66 8 1245 1250

10.1007/s00234-024-03371-6

38705899

10.1007/s00234-024-03371-6

Altamimi

Alhumimidi

Alshehri

Alrumayan

Abdullah

Al-Khlaiwi

Thamir

Meo

Sultan A

Temsah

Mohamad-Hani

The scientific knowledge of three large language models in cardiology: multiple-choice questions examination-based performance

Ann Med Surg (Lond) 2024 06 86 6 3261 3266

10.1097/MS9.0000000000002120

38846858

AMSU-D-23-02753

PMC11152788

Schoch

Schmelz

Strauch

Borgmann

Nestler

Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis

World J Urol 2024 07 26 42 1 445

10.1007/s00345-024-05137-4

39060792

10.1007/s00345-024-05137-4

May

Körner-Riffard

Katharina

Kollitsch

Burger

Brookman-May

Rauchenwald

Marszalek

Eredics

Evaluating the efficacy of AI chatbots as tutors in urology: a comparative analysis of responses to the 2022 In-Service Assessment of the European Board of Urology

Urol Int 2024 3 30 108 4 359 366

10.1159/000537854

38555637

000537854

PMC11305516

Sadeq

Ghorab

RMF

Ashry

Abozaid

Banihani

Salem

Aisheh

MTA

Abuzahra

Mourid

Assker

Ayyad

Moawad

MHED

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study

Sci Rep 2024 08 14 14 1 18859

10.1038/s41598-024-68996-2

39143077

10.1038/s41598-024-68996-2

PMC11324724

Khalpey

Kumar

King

Abraham

Khalpey

Large language models take on cardiothoracic surgery: a comparative analysis of the performance of four models on American Board of Thoracic Surgery exam questions in 2023

Cureus 2024 07 16 7 e65083

10.7759/cureus.65083

39171020

PMC11337141

Patel

Fleischer

Filip

Eggerstedt

Hutz

Michaelides

Batra

Tajudeen

Comparative performance of ChatGPT 3.5 and GPT4 on rhinology standardized board examination questions

OTO Open 2024 06 27 8 2 e164

10.1002/oto2.164

38938507

OTO2164

PMC11208739

Irmici

Cozzi

Della Pepa

De Berardinis

D'Ascoli

Elisa

Cellina

Cè

Maurizio

Depretto

Scaperrotta

How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini

Radiol Med 2024 10 13 129 10 1463 1467

10.1007/s11547-024-01872-1

39138732

10.1007/s11547-024-01872-1

Kollitsch

Eredics

Marszalek

Rauchenwald

Brookman-May

Burger

Körner-Riffard

Katharina

May

How does artificial intelligence master urological board examinations? A comparative analysis of different large language models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology

World J Urol 2024 01 10 42 1 20

10.1007/s00345-023-04749-6

38197996

10.1007/s00345-023-04749-6

Morreel

Verhoeven

Mathysen

Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

PLOS Digit Health 2024 02 14 3 2 e0000349

10.1371/journal.pdig.0000349

38354127

PDIG-D-23-00311

PMC10866461

Bajčetić

Mirčić

Rakočević

Đoković

Milutinović

Zaletel

Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions

Ann Anat 2024 06 254 152261

10.1016/j.aanat.2024.152261

38521363

S0940-9602(24)00053-0

Canillas Del Rey

Canillas Arias

Exploring the potential of artificial intelligence in traumatology: conversational answers to specific questions

Rev Esp Cir Ortop Traumatol 2025 01 69 1 38 46

10.1016/j.recot.2024.05.004

38782358

S1888-4415(24)00086-9

Meyer

Riese

Streichert

Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German Medical Licensing Examination: observational study

JMIR Med Educ 2024 02 08 10 e50965

10.2196/50965

38329802

v10i1e50965

PMC10884900

Toyama

Harigai

Abe

Nagano

Kawabata

Seki

Takase

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Jpn J Radiol 2024 02 04 42 2 201 207

10.1007/s11604-023-01491-2

37792149

10.1007/s11604-023-01491-2

PMC10811006

Touma

Caterini

Liblk

Is ChatGPT ready for primetime? Performance of artificial intelligence on a simulated Canadian urology board exam

Can Urol Assoc J 2024 10 10 18 10 329 332

10.5489/cuaj.8800

38896484

cuaj.8800

PMC11477513

Chan

Dong

Angelini

The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination

Ann R Coll Surg Engl 2024 11 06 106 8 700 704

10.1308/rcsann.2024.0023

38445611

PMC11528401

Patil

Huang

van der Pol

Larocque

Comparative performance of ChatGPT and Bard in a text-based radiology knowledge assessment

Can Assoc Radiol J 2024 05 14 75 2 344 350

10.1177/08465371231193716

37578849

Hubany

Scala

Hashemi

Kapoor

Saumya

Fedorova

Julia R

Vaccaro

Matthew J

Ridout

Rees P

Hedman

Casey C

Kellogg

Brian C

Leto Barone

Angelo A

ChatGPT-4 surpasses residents: a study of artificial intelligence competency in plastic surgery in-service examinations and its advancements from ChatGPT-3.5

Plast Reconstr Surg Glob Open 2024 09 12 9 e6136

10.1097/GOX.0000000000006136

39239234

GOX-D-24-00262

PMC11377087

Nakajima

Fujimori

Furuya

Kanie

Yuya

Imai

Hirotatsu

Kita

Kosuke

Uemura

Keisuke

Okada

Seiji

A comparison between GPT-3.5, GPT-4, and GPT-4V: can the large language model (ChatGPT) pass the Japanese Board of Orthopaedic Surgery examination?

Cureus 2024 03 16 3 e56402

10.7759/cureus.56402

38633935

PMC11023708

Thibaut

Dabbagh

Liverneaux

Does Google's Bard Chatbot perform better than ChatGPT on the European hand surgery exam?

Int Orthop 2024 01 15 48 1 151 158

10.1007/s00264-023-06034-y

37968408

10.1007/s00264-023-06034-y

Lum

Collins

Dennison

Guntupalli

Lohitha

Choudhary

Soham

Saiz

Augustine M

Randall

Robert L

Generative artificial intelligence performs at a second-year orthopedic resident level

Cureus 2024 03 16 3 e56104

10.7759/cureus.56104

38618358

PMC11014641

Menekşeoğlu

İş

Comparative performance of artificial ıntelligence models in physical medicine and rehabilitation board-level questions

Rev Assoc Med Bras (1992) 2024 70 7 e20240241

10.1590/1806-9282.20240241

39045939

S0104-42302024000700614

PMC11262310

Cheong

RCT

Pang

Unadkat

Mcneillis

Williamson

Joseph

Randhawa

Andrews

Paleri

Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard

Eur Arch Otorhinolaryngol 2024 04 281 4 2137 2143

10.1007/s00405-023-08381-3

38117307

10.1007/s00405-023-08381-3

Mesnard

Benoît

Schirmann

Aurélie

Branchereau

Julien

Perrot

Ophélie

Bogaert

Guy

Neuzillet

Yann

Lebret

Thierry

Madec

François-Xavier

Artificial intelligence: ready to pass the European Board examinations in urology?

Eur Urol Open Sci 2024 02 60 44 46

10.1016/j.euros.2024.01.002

38321995

S2666-1683(24)00211-8

PMC10845241

Ming

Guo

Cheng

Lei

Influence of model evolution and system roles on ChatGPT's performance in Chinese medical licensing exams: comparative study

JMIR Med Educ 2024 08 13 10 e52784 e52784

10.2196/52784

39140269

v10i1e52784

PMC11336778

Chow

Hasan

Zheng

Gao

Valdes

Chhabra

Raman

Choi

Lin

Simone

The accuracy of artificial intelligence ChatGPT in oncology examination questions

J Am Coll Radiol 2024 11 21 11 1800 1804

10.1016/j.jacr.2024.07.011

39098369

S1546-1440(24)00675-6

Kim

Lee

Choi

Han

Lee

Performance of ChatGPT on solving orthopedic board-style questions: a comparative analysis of ChatGPT 3.5 and ChatGPT 4

Clin Orthop Surg 2024 08 16 4 669 673

10.4055/cios23179

39092297

PMC11262944

Oura

Tatekawa

Horiuchi

Matsushita

Takita

Atsukawa

Mitsuyama

Yoshida

Murai

Tanaka

Shimono

Yamamoto

Miki

Ueda

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations

Jpn J Radiol 2024 12 20 42 12 1392 1398

10.1007/s11604-024-01633-0

39031270

10.1007/s11604-024-01633-0

PMC11588758

Lewandowski

Łukowicz

Paweł

Świetlik

Dariusz

Barańska-Rybak

Wioletta

ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology

Clin Exp Dermatol 2024 06 25 49 7 686 691

10.1093/ced/llad255

37540015

7237242

Knoedler

Alfertshofer

Knoedler

Hoch

Funk

Cotofana

Maheta

Frank

Brébant

Vanessa

Prantl

Lamby

Pure wisdom or Potemkin villages? A comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 style questions: quantitative analysis

JMIR Med Educ 2024 01 05 10 e51148

10.2196/51148

38180782

v10i1e51148

PMC10799278

Khan

Yunus

Sohail

Rehman

Saeed

Jackson

Sharkey

Mahmood

Matyal

Artificial intelligence for anesthesiology board-style examination questions: role of large language models

J Cardiothorac Vasc Anesth 2024 05 38 5 1251 1259

10.1053/j.jvca.2024.01.032

38423884

S1053-0770(24)00090-9

Sheikh

Thongprayoon

Qureshi

Suppadungsuk

Kashani

Miao

Craici

Cheungpasitporn

Personalized medicine transformed: ChatGPT's contribution to continuous renal replacement therapy alarm management in intensive care units

J Pers Med 2024 02 22 14 3 233

10.3390/jpm14030233

38540976

jpm14030233

PMC10971480

Mayo-Yáñez

Miguel

Lechien

Maria-Saibene

Vaira

Maniaci

Chiesa-Estomba

Examining the performance of ChatGPT 3.5 and Microsoft Copilot in otolaryngology: a comparative study with otolaryngologists' evaluation

Indian J Otolaryngol Head Neck Surg 2024 08 01 76 4 3465 3469

10.1007/s12070-024-04729-1

39130248

4729

PMC11306834

Rydzewski

Dinakaran

Zhao

Ruppin

Turkbey

Citrin

Patel

Comparative evaluation of LLMs in clinical oncology

NEJM AI 2024 05 25 1 5 1

10.1056/aioa2300151

39131700

PMC11315428

Wang

Chen

Lin

Comparing ChatGPT and clinical nurses' performances on tracheostomy care: a cross-sectional study

Int J Nurs Stud Adv 2024 06 6 100181

10.1016/j.ijnsa.2024.100181

38746816

S2666-142X(24)00008-0

PMC11080343

Liang

Zhao

Peng

Zhong

Zhang

Hou

Enhanced artificial intelligence strategies in renal oncology: iterative optimization and comparative analysis of GPT 3.5 versus 4.0

Ann Surg Oncol 2024 06 12 31 6 3887 3893

10.1245/s10434-024-15107-0

38472675

10.1245/s10434-024-15107-0

Jaworski

Jasiński

Dawid

Jaworski

Hop

Aleksandra

Janek

Artur

Sławińska

Barbara

Konieczniak

Lena

Rzepka

Maciej

Jung

Maximilian

Sysło

Oliwia

Jarząbek

Victoria

Błecha

Zuzanna

Haraziński

Konrad

Jasińska

Natalia

Comparison of the performance of artificial intelligence versus medical professionals in the Polish final medical examination

Cureus 2024 08 16 8 e66011

10.7759/cureus.66011

39221376

PMC11366403

Bharatha

Ojeh

Fazle Rabbi

Campbell

Krishnamurthy

Layne-Yarde

Kumar

Springer

Connell

Majumder

Comparing the performance of ChatGPT-4 and medical students on MCQs at varied levels of Bloom’s taxonomy

AMEP 2024 05 Volume 15 393 400

10.2147/amep.s457408

Davis

ChatGPT yields a passing score on a pediatric board preparatory exam but raises red flags

Global Pediatric Health 2024 03 24 11 1

10.1177/2333794x241240327

Arango

Flynn

Zeitlin

Lorenzana

Daniel J

Miller

Andrew J

Wilson

Matthew S

Strohl

Adam B

Weiss

Lawrence E

Weir

Tristan B

The performance of ChatGPT on the American Society for Surgery of the Hand self-assessment examination

Cureus 2024 04 16 4 e58950

10.7759/cureus.58950

38800302

PMC11126365

Rojas

Burgess

Toro-Pérez

Javier

Salehi

Exploring the performance of ChatGPT versions 3.5, 4, and 4 with vision in the Chilean medical licensing examination: observational study

JMIR Med Educ 2024 04 29 10 e55048 e55048

10.2196/55048

38686550

v10i1e55048

PMC11082432

Chau

RCW

Thu

Hsung

ECM

Lam

WYH

Performance of generative artificial intelligence in dental licensing examinations

Int Dent J 2024 06 74 3 616 621

10.1016/j.identj.2023.12.007

38242810

S0020-6539(23)00989-9

PMC11123518

Thirunavukarasu

Mahmood

Malem

Foster

Sanghera

Hassan

Zhou

Wong

Chong

Shakeel

Chang

Tan

BKJ

Jain

Tan

Rauz

Ting

DSW

Ting

DSJ

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: a head-to-head cross-sectional study

PLOS Digit Health 2024 04 17 3 4 e0000341

10.1371/journal.pdig.0000341

38630683

PDIG-D-23-00293

PMC11023493

Bicknell

Butler

Whalen

Ricks

Dixon

Clark

Spaedy

Skelton

Edupuganti

Dzubinski

Tate

Dyess

Lindeman

Lehmann

ChatGPT-4 Omni performance in USMLE disciplines and clinical skills: comparative analysis

JMIR Med Educ 2024 11 06 10 e63430

10.2196/63430

39504445

v10i1e63430

PMC11611793

Haddad

Saade

Performance of ChatGPT on ophthalmology-related questions across various examination levels: observational study

JMIR Med Educ 2024 01 18 10 e50842

10.2196/50842

38236632

v10i1e50842

PMC10835593

Noda

Izaki

Kitano

Komatsu

Ichikawa

Shibagaki

Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal

Clin Exp Nephrol 2024 05 14 28 5 465 469

10.1007/s10157-023-02451-w

38353783

10.1007/s10157-023-02451-w

Yudovich

Makarova

Hague

Raman

Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study

J Educ Eval Health Prof 2024 07 08 21 17

10.3352/jeehp.2024.21.17

38977032

jeehp.2024.21.17

PMC11893186

Kao

Tsai

Bai

Yeh

Chu

Hsu

Cheng

Hsu

Liang

Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists

Psychiatry Clin Neurosci 2024 06 26 78 6 347 352

10.1111/pcn.13656

38404249

Farhat

Chaudhry

Nadeem

Sohail

Madsen

Evaluating large language models for the National Premedical Exam in India: comparative analysis of GPT-3.5, GPT-4, and Bard

JMIR Med Educ 2024 02 21 10 e51523

10.2196/51523

38381486

v10i1e51523

PMC10918540

Gilson

Safranek

Huang

Socrates

Chi

Taylor

Chartash

How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment

JMIR Med Educ 2023 02 08 9 e45312

10.2196/45312

36753318

v9i1e45312

PMC9947764

Kung

Marshall

Gauthier

Gonzalez

Jackson

Evaluating ChatGPT performance on the orthopaedic in-training examination

JB JS Open Access 2023 8 3 1

10.2106/JBJS.OA.23.00056

37693092

JBJSOA-D-23-00056

PMC10484364

Gencer

Aydin

Can ChatGPT pass the thoracic surgery exam?

Am J Med Sci 2023 10 366 4 291 295

10.1016/j.amjms.2023.08.001

37549788

S0002-9629(23)01292-2

Ali

Tang

Connolly

Zadnik Sullivan

Shin

Fridley

Asaad

Cielo

Oyelese

Doberstein

Gokaslan

Telfeian

Performance of ChatGPT and GPT-4 on neurosurgery written board examinations

Neurosurgery 2023 12 01 93 6 1353 1365

10.1227/neu.0000000000002632

37581444

00006123-202312000-00018

Massey

Montgomery

Zhang

Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations

J Am Acad Orthop Surg 2023 12 01 31 23 1173 1179

10.5435/JAAOS-D-23-00396

37671415

00124635-990000000-00782

PMC10627532

Suchman

Garg

Trindade

Chat Generative Pretrained Transformer fails the multiple-choice American College of Gastroenterology self-assessment test

Am J Gastroenterol 2023 12 01 118 12 2280 2282

10.14309/ajg.0000000000002320

37212584

00000434-202312000-00032

Sakai

Maeda

Ozaki

Kanda

Kurimoto

Takahashi

Performance of ChatGPT in board examinations for specialists in the Japanese Ophthalmology Society

Cureus 2023 12 15 12 e49903

10.7759/cureus.49903

38174202

PMC10763518

Huang

Gomaa

Semrau

Haderlein

Lettmaier

Weissmann

Grigo

Tkhayat

Frey

Gaipl

Distel

Maier

Fietkau

Bert

Putz

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for AI-assisted medical education and decision making in radiation oncology

Front Oncol 2023 13 1265024

10.3389/fonc.2023.1265024

37790756

PMC10543650

Yanagita

Yokokawa

Uchida

Tawara

Ikusaka

Accuracy of ChatGPT on medical questions in the National Medical Licensing Examination in Japan: evaluation study

JMIR Form Res 2023 10 13 7 e48023

10.2196/48023

37831496

v7i1e48023

PMC10612006

Teebagy

Colwell

Wood

Yaghy

Faustina

Improved performance of ChatGPT-4 on the OKAP Examination: a comparative study with ChatGPT-3.5

J Acad Ophthalmol (2017) 2023 07 11 15 2 e184 e187

10.1055/s-0043-1774399

37701862

JAO-425

PMC10495224

Kaneda

Takahashi

Kaneda

Akashima

Okita

Misaki

Yamashiro

Ozaki

Tanimoto

Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination

Cureus 2023 08 15 8 e42924

10.7759/cureus.42924

37667724

PMC10475149

Flores-Cohaila

García-Vicente

Abigaíl

Vizcarra-Jiménez

Sonia F

De la Cruz-Galán

Janith P

Gutiérrez-Arratia

Jesús D

Quiroga Torres

Taype-Rondan

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: cross-sectional study

JMIR Med Educ 2023 09 28 9 e48039

10.2196/48039

37768724

v9i1e48039

PMC10570896

Fowler

Pullen

Birkett

Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions

Br J Ophthalmol 2024 09 20 108 10 1379 1383

10.1136/bjo-2023-324091

37932006

bjo-2023-324091

Moshirfar

Altaf

Stoakes

Tuttle

Hoopes

Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions

Cureus 2023 06 15 6 e40822

10.7759/cureus.40822

37485215

PMC10362981

Brin

Sorin

Vaid

Soroush

Glicksberg

Charney

Nadkarni

Klang

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Sci Rep 2023 10 01 13 1 16492

10.1038/s41598-023-43436-9

37779171

10.1038/s41598-023-43436-9

PMC10543445

100

Miao

Thongprayoon

Garcia Valencia

Krisanapan

Sheikh

Davis

Mekraksakit

Suarez

Craici

Cheungpasitporn

Performance of ChatGPT on nephrology test questions

CJASN 2023 10 18 19 1 35 43

10.2215/cjn.0000000000000330

101

Kaneda

Namba

Kaneda

Tanimoto

Artificial intelligence in childcare: assessing the performance and acceptance of ChatGPT responses

Cureus 2023 08 15 8 e44484

10.7759/cureus.44484

37791148

PMC10544433

102

Takagi

Watari

Erabi

Sakaguchi

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison study

JMIR Med Educ 2023 06 29 9 e48002

10.2196/48002

37384388

v9i1e48002

PMC10365615

103

Ali

Tang

Connolly

Fridley

Shin

Zadnik Sullivan

Cielo

Oyelese

Doberstein

Telfeian

Gokaslan

Asaad

Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank

Neurosurgery 2023 11 01 93 5 1090 1098

10.1227/neu.0000000000002551

37306460

00006123-990000000-00775

104

Ohta

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: a comparison study

Cureus 2023 12 15 12 e50369

10.7759/cureus.50369

38213361

PMC10782219

105

Watari

Takagi

Sakaguchi

Nishizaki

Shimizu

Yamamoto

Tokuda

Performance comparison of ChatGPT-4 and Japanese medical residents in the General Medicine In-Training Examination: comparison study

JMIR Med Educ 2023 12 06 9 e52202

10.2196/52202

38055323

v9i1e52202

PMC10733815

106

Roos

Kasapovic

Jansen

Kaczmarczyk

Artificial intelligence in medical education: comparative analysis of ChatGPT, Bing, and medical students in Germany

JMIR Med Educ 2023 09 04 9 e46482

10.2196/46482

37665620

v9i1e46482

PMC10507517

107

Guillen-Grima

Guillen-Aguinaga

Alas-Brun

Onambele

Ortega

Montejo

Aguinaga-Ontoso

Barach

Aguinaga-Ontoso

Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): promising horizons for AI in clinical medicine

Clin Pract 2023 11 20 13 6 1460 1487

10.3390/clinpract13060130

37987431

clinpract13060130

PMC10660543

108

Huang

KJQ

Meaney

Kemppainen

Punnett

Leung

Assessment of resident and AI chatbot performance on the University of Toronto family medicine residency progress test: comparative study

JMIR Med Educ 2023 09 19 9 e50514

10.2196/50514

37725411

v9i1e50514

PMC10548315

109

Schubert

Wick

Venkataramani

Performance of large language models on a neurology board-style examination

JAMA Netw Open 2023 12 01 6 12 e2346721

10.1001/jamanetworkopen.2023.46721

38060223

2812620

PMC10704278

110

Torres-Zegarra

Rios-Garcia

Ñaña-Cordova

Alvaro Micael

Arteaga-Cisneros

Chalco

XCB

Ordoñez

Marina Atena Bustamante

Rios

CJG

Godoy

CAR

Quezada

KLTP

Gutierrez-Arratia

Flores-Cohaila

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study

J Educ Eval Health Prof 2023 11 20 20 30

10.3352/jeehp.2023.20.30

37981579

jeehp.2023.20.30

PMC11009012

111

Kirshteyn

Golan

Chaet

Performance of ChatGPT vs. HuggingChat on OB-GYN topics

Cureus 2024 03 16 3 e56187

10.7759/cureus.56187

38618446

PMC11015885

112

van Nuland

Erdogan

Aςar

Cenkay

Contrucci

Hilbrants

Maanach

Egberts

van der Linden

Performance of ChatGPT on factual knowledge questions regarding clinical pharmacy

J Clin Pharmacol 2024 09 16 64 9 1095 1100

10.1002/jcph.2443

38623909

113

Danesh

Pazouki

Danesh

Vardar‐Sengul

Artificial intelligence in dental education: ChatGPT's performance on the periodontic in‐service examination

Journal of Periodontology 2024 01 10 95 7 682 687

10.1002/jper.23-0514

114

Huang

Zhang

Caussade

Brown

Stockton Hogrogian

Yan

Pediatric dermatologists versus AI bots: evaluating the medical knowledge and diagnostic capabilities of ChatGPT

Pediatr Dermatol 2024 05 09 41 5 831 834

10.1111/pde.15649

38721744

115

Fiedler

Azua

Phillips

Ahmed

ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam

J Shoulder Elbow Surg 2024 09 33 9 1888 1893

10.1016/j.jse.2024.02.029

38580067

S1058-2746(24)00231-3

116

Coleman

Moore

Two artificial intelligence models underperform on examinations in a veterinary curriculum

J Am Vet Med Assoc 2024 05 01 262 5 692 697

10.2460/javma.23.12.0666

38382193

117

Abbas

Rehman

Comparing the performance of popular large language models on the National Board of Medical Examiners sample questions

Cureus 2024 03 16 3 e55991

10.7759/cureus.55991

38606229

PMC11007479

118

Jarou

Dakka

McGuire

Bunting

ChatGPT versus human performance on emergency medicine board preparation questions

Ann Emerg Med 2024 01 83 1 87 88

10.1016/j.annemergmed.2023.08.010

37725017

S0196-0644(23)00663-7

119

Sensoy

Citirik

Assessing the proficiency of artificial intelligence programs in the diagnosis and treatment of cornea, conjunctiva, and eyelid diseases and exploring the advantages of each other benefits

Cont Lens Anterior Eye 2024 04 47 2 102125

10.1016/j.clae.2024.102125

38443209

S1367-0484(24)00008-0

120

Guerra

Hofmann

Wong

Fathi

Mayfield

Petrigliano

Liu

ChatGPT, Bard, and Bing chat are large language processing models that answered orthopaedic in-training examination questions with similar accuracy to first-year orthopaedic surgery residents

Arthroscopy 2025 03 41 3 557 562

10.1016/j.arthro.2024.08.023

39209078

S0749-8063(24)00621-2

121

Agarwal

Goswami

Sharma

Priyanka

Evaluating ChatGPT-3.5 and Claude-2 in answering and explaining conceptual medical physiology multiple-choice questions

Cureus 2023 09 15 9 e46222

10.7759/cureus.46222

37908959

PMC10613833

122

Cheong

Zhang

Tan

Fenner

Wong

Teo

Wang

Sivaprasad

Keane

Lee

Cheung

CMG

Wong

Cheong

Song

Tham

Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy

Br J Ophthalmol 2024 09 20 108 10 1443 1449

10.1136/bjo-2023-324533

38749531

bjo-2023-324533

PMC11716104

123

Zhou

Luo

Chen

Jiang

Hong

Yang

Chun

Ran

Guanghui

Juan

Yin

Chengliang

The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries

Int J Surg 2024 10 01 110 10 6509 6517

10.1097/JS9.0000000000001850

38935100

01279778-990000000-01734

PMC11487020

124

Kozaily

Geagea

Akdogan

Atkins

Elshazly

Guglin

Tedford

Wehbe

Accuracy and consistency of online large language model-based artificial intelligence chat platforms in answering patients' questions about heart failure

Int J Cardiol 2024 08 01 408 132115

10.1016/j.ijcard.2024.132115

38697402

S0167-5273(24)00737-X

125

Xia

Hua

Mei

Lai

Wei

Qin

Luo

Wang

Huo

Zhou

Zhang

Wang

Song

Zhou

Clinical application potential of large language model: a study based on thyroid nodules

Endocrine 2025 01 30 87 1 206 213

10.1007/s12020-024-03981-3

39080210

10.1007/s12020-024-03981-3

126

Lee

Shin

Tessier

Javidan

Jung

Hong

Strong

McKechnie

Malone

Jin

Kroh

Dang

ASMBS Artificial Intelligence and Digital Surgery Task Force

Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations

Surg Obes Relat Dis 2024 07 20 7 603 608

10.1016/j.soard.2024.03.011

38644078

S1550-7289(24)00118-7

127

Doğan

Özçakmakcı

Gazi Bekir

Yılmaz

The performance of chatbots and the AAPOS website as a tool for amblyopia education

J Pediatr Ophthalmol Strabismus 2024 04 25 61 5 325 331

10.3928/01913913-20240409-01

38661309

128

Lee

Campbell

Patel

Hossain

Afif

Radfar

Navid

Siddiqui

Emaad

Gardin

Julius M

Unlocking health literacy: the ultimate guide to hypertension education from ChatGPT versus Google Gemini

Cureus 2024 05 16 5 e59898

10.7759/cureus.59898

38721479

PMC11078260

129

Lang

Yoseph

Gonzalez-Suarez

Kim

Fatemi

Wagner

Maldaner

Stienen

Zygourakis

Analyzing large language models' responses to common lumbar spine fusion surgery questions: a comparison between ChatGPT and Bard

Neurospine 2024 06 21 2 633 641

10.14245/ns.2448098.049

38955533

ns.2448098.049

PMC11224745

130

Iannantuono

Bracken-Clarke

Karzai

Choo-Wosoba

Gulley

Floudas

Comparison of large language models in answering immuno-oncology questions: a cross-sectional study

Oncologist 2024 05 03 29 5 407 414

10.1093/oncolo/oyae009

38309720

7600405

PMC11067804

131

Anguita

Downie

Ferro Desideri

Sagoo

Assessing large language models' accuracy in providing patient support for choroidal melanoma

Eye (Lond) 2024 11 13 38 16 3113 3117

10.1038/s41433-024-03231-w

39003430

10.1038/s41433-024-03231-w

PMC11544095

132

Zhang

Dong

Mei

Hou

Wei

Yeung

Hua

Lai

Xia

Zhou

Performance of large language models on benign prostatic hyperplasia frequently asked questions

Prostate 2024 06 84 9 807 813

10.1002/pros.24699

38558009

133

Xue

Bracken-Clarke

Iannantuono

Choo-Wosoba

Gulley

Floudas

Utility of large language models for health care professionals and patients in navigating hematopoietic stem cell transplantation: comparison of the performance of ChatGPT-3.5, ChatGPT-4, and Bard

J Med Internet Res 2024 05 17 26 e54758

10.2196/54758

38758582

v26i1e54758

PMC11143389

134

Cao

Kwon

Ghaziani

Kwo

Tse

Kesselman

Kamaya

Tse

Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability

Abdom Radiol (NY) 2024 12 01 49 12 4286 4294

10.1007/s00261-024-04501-7

39088019

10.1007/s00261-024-04501-7

135

Monroe

Abdelhafez

Atsina

Aman

Nardo

Madani

Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT

Clin Imaging 2024 08 112 110193

10.1016/j.clinimag.2024.110193

38820977

S0899-7071(24)00123-2

136

Chervonski

Harish

Rockman

Sadek

Teter

Jacobowitz

Berland

Lohr

Moore

Maldonado

Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients

Vascular 2025 02 18 33 1 229 237

10.1177/17085381241240550

38500300

137

Kassab

Hadi El Hajjar

Wardrop

Brateanu

Accuracy of online artificial intelligence models in primary care settings

Am J Prev Med 2024 06 66 6 1054 1059

10.1016/j.amepre.2024.02.006

38354991

S0749-3797(24)00060-6

138

Al-Sharif

Penteado

Dib El Jalbout

Nahia

Topilow

Nicole J

Shoji

Marissa K

Kikkawa

Don O

Liu

Catherine Y

Korn

Bobby S

Evaluating the accuracy of ChatGPT and Google BARD in fielding oculoplastic patient queries: a comparative study on artificial versus human intelligence

Ophthalmic Plast Reconstr Surg 2024 40 3 303 311

10.1097/IOP.0000000000002567

38215452

00002341-202405000-00010

139

Mejia

Arroyave

Saturno

Ndjonko

LCM

Zaidat

Rajjoub

Ahmed

Zapolsky

Cho

Use of ChatGPT for determining clinical and surgical treatment of lumbar disc herniation with radiculopathy: a North American Spine Society guideline comparison

Neurospine 2024 03 21 1 149 158

10.14245/ns.2347052.526

38291746

ns.2347052.526

PMC10992643

140

Lee

Rao

Campbell

Radfar

Dayal

Khrais

Evaluating ChatGPT-3.5 and ChatGPT-4.0 responses on hyperlipidemia for patient education

Cureus 2024 05 16 5 e61067

10.7759/cureus.61067

38803402

PMC11128363

141

Oliveira

Coelho

Guedes

Cattoni

Carvalho

Duarte-Batista

Performance of ChatGPT 3.5 and 4 as a tool for patient support before and after DBS surgery for Parkinson's disease

Neurol Sci 2024 12 29 45 12 5757 5764

10.1007/s10072-024-07732-0

39198356

10.1007/s10072-024-07732-0

PMC11554841

142

Lim

Pushpanathan

Yew

SME

Lai

Sun

Lam

JSH

Chen

Goh

JHL

Tan

MCJ

Sheng

Cheng

Koh

VTC

Tham

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

EBioMedicine 2023 09 95 104770

10.1016/j.ebiom.2023.104770

37625267

S2352-3964(23)00336-5

PMC10470220

143

Rahsepar

Tavakoli

Kim

GHJ

Hassani

Abtin

Bedayat

How AI responds to common lung cancer questions: ChatGPT vs Google Bard

Radiology 2023 06 307 5 e230922

10.1148/radiol.230922

37310252

144

Pushpanathan

Lim

Er Yew

Chen

Hui'En Lin

Lin Goh

Wong

Wang

Jin Tan

Chang Koh

Tham

Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

iScience 2023 11 17 26 11 108163

10.1016/j.isci.2023.108163

37915603

S2589-0042(23)02240-X

PMC10616302

145

Coskun

Yagiz

Ocakoglu

Dalkilic

Pehlivan

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Rheumatol Int 2024 03 25 44 3 509 515

10.1007/s00296-023-05473-5

37747564

10.1007/s00296-023-05473-5

146

King

Samaan

Yeo

Peng

Kunkel

Habib

Ghashghaei

A multidisciplinary assessment of ChatGPT's knowledge of amyloidosis: observational study

JMIR Cardio 2024 04 19 8 e53421

10.2196/53421

38640472

v8i1e53421

PMC11069089

147

Pinto

VBP

de Azevedo

Wroclawski

Gentile

Jesus

VLM

de Bessa Junior

Nahas

Sacomani

CAR

Sandhu

Gomes

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Neurourol Urodyn 2024 04 07 43 4 935 941

10.1002/nau.25442

38451040

148

Momenaei

Wakabayashi

Shahlaee

Durrani

Pandit

Wang

Mansour

Abishek

Sridhar

Yonekawa

Kuriyan

Assessing ChatGPT-3.5 versus ChatGPT-4 performance in surgical treatment of retinal diseases: a comparative study

Ophthalmic Surg Lasers Imaging Retina 2024 08 55 8 481 482

10.3928/23258160-20240227-02

38531015

149

Stevenson

Walsh

Hibberd

Can artificial intelligence replace biochemists? A study comparing interpretation of thyroid function test results by ChatGPT and Google Bard to practising biochemists

Ann Clin Biochem 2024 03 20 61 2 143 149

10.1177/00045632231203473

37699796

150

Dronkers

EAC

Geneid

Al Yaghchi

Lechien

Evaluating the potential of AI chatbots in treatment decision-making for acquired bilateral vocal fold paralysis in adults

J Voice 2024 04 06 1

10.1016/j.jvoice.2024.02.020

38584026

S0892-1997(24)00059-6

151

Rahimli Ocakoglu

Coskun

The emerging role of AI in patient education: a comparative analysis of LLM accuracy for pelvic organ prolapse

Med Princ Pract 2024 03 25 33 4 330 7

10.1159/000538538

38527444

000538538

PMC11324208

152

Gandhi

Joesph

Rajagopal

Aparnavi

Katkuri

Dayama

Satapathy

Khatib

Gaidhane

Zahiruddin

Behera

Performance of ChatGPT on the India undergraduate community medicine examination: cross-sectional study

JMIR Form Res 2024 03 25 8 e49964

10.2196/49964

38526538

v8i1e49964

PMC11002731

153

Tariq

Malik

Khanna

Evolving landscape of large language models: an evaluation of ChatGPT and Bard in answering patient queries on colonoscopy

Gastroenterology 2024 01 166 1 220 221

10.1053/j.gastro.2023.08.033

37634736

S0016-5085(23)04916-8

154

Zhang

Zhu

Sheng

Tham

Wong

Potential multidisciplinary use of large language models for addressing queries in cardio-oncology

J Am Heart Assoc 2024 03 19 13 6 e033584

10.1161/JAHA.123.033584

38497458

PMC11010006

155

Sosa

Cung

Suhardi

Morse

Thomson

Yang

Iyer

Greenblatt

Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries

J Orthop Res 2024 06 21 42 6 1276 1282

10.1002/jor.25782

38245845

156

Koga

Martin

Dickson

Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders

Brain Pathol 2024 05 08 34 3 e13207

10.1111/bpa.13207

37553205

PMC11006994

157

Warrier

Singh

Haleem

Zaki

Eloy

The comparative diagnostic capability of large language models in otolaryngology

Laryngoscope 2024 09 02 134 9 3997 4002

10.1002/lary.31434

38563415

158

Kumar

Sivan

Bachir

Sarwar

Ruzicka

O'Malley

Lobo

Morales

Cassimatis

Hundal

Patel

Can artificial intelligence mitigate missed diagnoses by generating differential diagnoses for neurosurgeons?

World Neurosurg 2024 07 187 e1083 e1088

10.1016/j.wneu.2024.05.052

38759788

S1878-8750(24)00814-3

159

Hirosawa

Harada

Mizuta

Sakamoto

Tokumasu

Shimizu

Diagnostic performance of generative artificial intelligences for a series of complex case reports

Digit Health 2024 07 21 10 20552076241265215

10.1177/20552076241265215

39229463

10.1177_20552076241265215

PMC11369864

160

Mandalos

Tsouris

Artificial versus human intelligence in the diagnostic approach of ophthalmic case scenarios: a qualitative evaluation of performance and consistency

Cureus 2024 06 16 6 e62471

10.7759/cureus.62471

39015855

PMC11251728

161

Krusche

Callhoff

Knitza

Ruffer

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Rheumatol Int 2024 02 24 44 2 303 306

10.1007/s00296-023-05464-6

37742280

10.1007/s00296-023-05464-6

PMC10796566

162

Delsoz

Madadi

Raja

Munir

Tamm

Mehravaran

Soleimani

Djalilian

Yousefi

Performance of ChatGPT in diagnosis of corneal eye diseases

Cornea 2024 05 01 43 5 664 670

10.1097/ICO.0000000000003492

38391243

00003226-202405000-00019

163

Kozel

Gurses

Gecici

Gökalp

Elif

Bahadir

Merenzon

Shah

Komotar

Ivan

Chat-GPT on brain tumors: an examination of artificial intelligence/machine learning's ability to provide diagnoses and treatment plans for example neuro-oncology cases

Clin Neurol Neurosurg 2024 04 239 108238

10.1016/j.clineuro.2024.108238

38507989

S0303-8467(24)00125-2

164

Stoneham

Livesey

Cooper

Mitchell

ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology

Clin Exp Dermatol 2024 06 25 49 7 707 710

10.1093/ced/llad402

37979201

7429032

165

Albaladejo

Lorleac'h

Allain

[The spring of artificial intelligence: AI vs. expert for internal medicine cases]

Rev Med Interne 2024 07 45 7 409 414

10.1016/j.revmed.2024.01.012

38331591

S0248-8663(24)00032-8

166

Zandi

Fahey

Drakopoulos

Bryan

Dong

Bryar

Bidwell

Bowen

Lavine

Mirza

Exploring diagnostic precision and triage proficiency: a comparative study of GPT-4 and Bard in addressing common ophthalmic complaints

Bioengineering (Basel) 2024 01 26 11 2 120

10.3390/bioengineering11020120

38391606

bioengineering11020120

PMC10886029

167

Hirosawa

Kawamura

Harada

Mizuta

Tokumasu

Kaji

Suzuki

Shimizu

ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation

JMIR Med Inform 2023 10 09 11 e48808

10.2196/48808

37812468

v11i1e48808

PMC10594139

168

Hirosawa

Harada

Yokose

Sakamoto

Kawamura

Shimizu

Diagnostic accuracy of differential-diagnosis lists generated by Generative Pretrained Transformer 3 Chatbot for clinical vignettes with common chief complaints: a pilot study

Int J Environ Res Public Health 2023 02 15 20 4 3378

10.3390/ijerph20043378

36834073

ijerph20043378

PMC9967747

169

Fraser

Crossland

Bacher

Ranney

Madsen

Hilliard

Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and Physicians for Patients in an emergency department: clinical data analysis study

JMIR Mhealth Uhealth 2023 10 03 11 e49995

10.2196/49995

37788063

v11i1e49995

PMC10582809

170

Rojas-Carabali

Cifuentes-González

Carlos

Wei

Putera

Sen

Thng

Agrawal

Elze

Sobrin

Kempen

Lee

Biswas

Nguyen

Gupta

de-la-Torre

Agrawal

Evaluating the diagnostic accuracy and management recommendations of ChatGPT in uveitis

Ocul Immunol Inflamm 2024 10 18 32 8 1526 1531

10.1080/09273948.2023.2253471

37722842

171

Gräf

Markus

Knitza

Leipe

Krusche

Welcker

Kuhn

Mucke

Hueber

Hornig

Klemm

Kleinert

Aries

Vuillerme

Simon

Kleyer

Schett

Callhoff

Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy

Rheumatol Int 2022 12 42 12 2167 2176

10.1007/s00296-022-05202-4

36087130

10.1007/s00296-022-05202-4

PMC9548469

172

Ward

Unadkat

Toscano

Kashanian

Alon

Lynch

Daniel G

Horn

Alexander C

D'Amico

Randy S

Mittler

Mark

Baum

Griffin R

A quantitative assessment of ChatGPT as a neurosurgical triaging tool

Neurosurgery 2024 08 01 95 2 487 495

10.1227/neu.0000000000002867

38353523

00006123-990000000-01055

173

Hirosawa

Mizuta

Harada

Shimizu

Comparative evaluation of diagnostic accuracy between Google Bard and physicians

Am J Med 2023 11 136 11 1119 1123.e18

10.1016/j.amjmed.2023.08.003

37643659

S0002-9343(23)00536-3

174

Lyons

Arepalli

Fromal

Choi

Jain

Artificial intelligence chatbot performance in triage of ophthalmic conditions

Can J Ophthalmol 2024 08 59 4 e301 e308

10.1016/j.jcjo.2023.07.016

37572695

S0008-4182(23)00234-X

175

Makhoul

Melkane

Khoury

Hadi

Matar

A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases

Eur Arch Otorhinolaryngol 2024 05 16 281 5 2717 2721

10.1007/s00405-024-08509-z

38365990

10.1007/s00405-024-08509-z

176

Shemer

Cohen

Altarescu

Atar-Vardi

Hecht

Dubinsky-Pertzov

Shoshany

Zmujack

Einan-Lifshitz

Pras

Diagnostic capabilities of ChatGPT in ophthalmology

Graefes Arch Clin Exp Ophthalmol 2024 07 06 262 7 2345 2352

10.1007/s00417-023-06363-z

38183467

10.1007/s00417-023-06363-z

177

Gunes

Cesur

The diagnostic performance of large language models and general radiologists in thoracic radiology cases: a comparative study

J Thorac Imaging 2024 09 13 2024

10.1097/RTI.0000000000000805

39269227

00005382-990000000-00153

178

Sarangi

Irodi

Panda

Nayak

DSK

Mondal

Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: perspectives of four large language models

Indian J Radiol Imaging 2024 04 28 34 2 269 275

10.1055/s-0043-1777289

38549881

IJRI-23-9-2923

PMC10972667

179

Berg

van Bakel

van de Wouw

Jie

Schipper

Jansen

O'Connor

Rory D

van Ginneken

Kurstjens

ChatGPT and generating a differential diagnosis early in an emergency department presentation

Ann Emerg Med 2024 01 83 1 83 86

10.1016/j.annemergmed.2023.08.003

37690022

S0196-0644(23)00642-X

180

Haider

Pressman

Borna

Gomez-Cabello

Sehgal

Leibovich

Forte

Evaluating large language model (LLM) performance on established breast classification systems

Diagnostics (Basel) 2024 07 11 14 14 1491

10.3390/diagnostics14141491

39061628

diagnostics14141491

PMC11275570

181

Gan

Ogbodo

Wee

Gan

González

Pedro Arcos

Performance of Google bard and ChatGPT in mass casualty incidents triage

Am J Emerg Med 2024 01 75 72 78

10.1016/j.ajem.2023.10.034

37967485

S0735-6757(23)00576-4

182

Aiumtrakul

Thongprayoon

Arayangkool

Wannaphut

Suppadungsuk

Krisanapan

Garcia Valencia

Qureshi

Miao

Cheungpasitporn

Personalized medicine in urolithiasis: AI chatbot-assisted dietary management of oxalate for kidney stone prevention

J Pers Med 2024 01 18 14 1 107

10.3390/jpm14010107

38248809

jpm14010107

PMC10817681

183

Wang

Gao

Dantona

Hull

Sun

DRG-LLaMA: tuning LLaMA model to predict diagnosis-related group for hospitalized patients

NPJ Digit Med 2024 01 22 7 1 16

10.1038/s41746-023-00989-3

38253711

10.1038/s41746-023-00989-3

PMC10803802

184

Singhal

Azizi

Mahdavi

Wei

Chung

Scales

Tanwani

Cole-Lewis

Pfohl

Payne

Seneviratne

Gamble

Kelly

Babiker

Schärli

Nathanael

Chowdhery

Mansfield

Demner-Fushman

Agüera Y Arcas

Blaise

Webster

Corrado

Matias

Chou

Gottweis

Tomasev

Liu

Rajkomar

Barral

Semturs

Karthikesalingam

Natarajan

Large language models encode clinical knowledge

Nature 2023 08 620 7972 172 180

10.1038/s41586-023-06291-2

37438534

10.1038/s41586-023-06291-2

PMC10396962

185

Thirunavukarasu

Ting

DSJ

Elangovan

Gutierrez

Tan

Ting

DSW

Large language models in medicine

Nat Med 2023 08 17 29 8 1930 1940

10.1038/s41591-023-02448-8

37460753

10.1038/s41591-023-02448-8

186

Researcher Access Program application

OpenAI 2025-04-21

https://platform.openai.com/docs/model-index-for-researchers

187

Zhou

Sun

Chen

Chu

Zhou

Liao

Zhang

Afvari

Gao

Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Nat Commun 2024 07 05 15 1 5649

10.1038/s41467-024-50043-3

38969632

10.1038/s41467-024-50043-3

PMC11226626

188

Zhang

Dan

Jiang

Zhang

ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge

Cureus 2023 06 15 6 e40895

10.7759/cureus.40895

37492832

PMC10364849

189

Bagde

Dhopte

Alam

Basri

A systematic review and meta-analysis on ChatGPT and its utilization in medical and dental research

Heliyon 2023 12 9 12 e23050

10.1016/j.heliyon.2023.e23050

38144348

S2405-8440(23)10258-1

PMC10746423

190

Levin

Horesh

Brezinov

Meyer

Performance of ChatGPT in medical examinations: a systematic review and a meta-analysis

BJOG 2024 02 131 3 378 380

10.1111/1471-0528.17641

37604703