Introduction

J Med Internet Res

jmir

Journal of Medical Internet Research

J Med Internet Res

1438-8871

JMIR Publications

Toronto, Canada

v27i1e81769

10.2196/81769

Letter to the Editor

Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Weizman

Zvi

MD, Prof Dr Med

Faculty of Health Sciences, Ben-Gurion University

8 Balfour Street

Tel-Aviv

Israel

Leung

Tiffany

Correspondence to Zvi Weizman, MD, Prof Dr Med, Faculty of Health Sciences, Ben-Gurion University, 8 Balfour Street, Tel-Aviv, 6521120, Israel, 972 544888686; wzvi@bgu.ac.il

2025

2492025

e81769

030820250508202529082025

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

https://www.jmir.org/2025/1/e82729

https://www.jmir.org/2025/1/e71916

letterlarge language modelsAIhealth carereviewLLMclinicalartificial intelligencedigital health

Introduction

I read with interest the study by Li et al [1] on the implementation of large language models (LLMs) in health care, which provides clinicians with guidance for selecting appropriate models for specific tasks. Although it provides a comprehensive overview, several limitations undermine its utility for clinical decision-making.

Citation Threshold Bias

The authors exclude journals below a citation threshold of 13,000, which introduces a publication bias. It excludes innovative research from emerging or specialized journals, as documented in the methodology literature. This is problematic in a rapidly evolving field where important innovations may first appear in newer venues. While the authors note that only 8.9% (24/270) of studies reported negative results, which could affect the overall perception of their clinical effectiveness, they do not adequately account for this publication bias.

Flawed Performance Definition

The definition of “best performance” is problematic. They acknowledge that performance level in one context does not guarantee similar performance in different contexts, and therefore, they state that the frequency of “best performance” should not be interpreted as a metric for comparing models. This acknowledgment undermines their quantitative analysis. The heterogeneity in evaluation metrics, datasets, and contexts across studies renders their performance comparisons essentially meaningless, a problem well-documented in AI literature [2].

Limited Quality Assessment

The review lacks assessment of the included studies. A recent meta-analysis in medical AI has emphasized the importance of evaluating study design, validation approaches, and statistical rigor [3]. The authors’ approach of simply counting “best performance” instances without considering study quality, sample sizes, or validation rigor represents a significant methodological weakness.

Conceptual and Analytical Limitations

The 5-stage linear workflow model, while organizationally useful, oversimplifies the complex and iterative nature of clinical decision-making. Modern health care delivery involves parallel processes, feedback loops, and multidisciplinary coordination that this model fails to capture, thereby limiting the practical utility of its recommendations [4].

Insufficient Discussion of Clinical Validation

They inadequately address the critical gap between research performance and clinical validation. As noted in recent systematic reviews of AI in health care, models trained and validated on research datasets face substantial deployment challenges in medical institutions due to significant differences between laboratory and clinical settings. While the authors mention this limitation, they do not adequately weigh it in their analysis.

Limited Safety and Risk Analysis

Although the authors discuss ethical concerns, their analysis of patient safety remains superficial. Recent literature emphasizes the critical importance of comprehensive risk assessment in implementing medical AI, including analysis of failure modes, error propagation, and impacts on clinical decision-making [5].

Absence of Economic Evaluation

The review lacks a comprehensive economic evaluation of LLM implementation, including cost-effectiveness analyses, resource allocation considerations, and return-on-investment assessments. These limitations significantly impact the review’s clinical applicability and highlight the need for more rigorous methodological approaches in evaluating AI in health care.

None declared.

Abbreviations

LLM

large language model

References1

Python

Implementing large language models in health care: clinician-focused review with interactive guideline

J Med Internet Res2025071127e71916

10.2196/71916

40644686

Chang

Yin

Liu

Cao

Lin

Applications and future prospects of medical LLMs: a survey based on the M-KAT conceptual framework

J Med Syst20241227481112

10.1007/s10916-024-02132-5

39725770

Liu

Cruz Rivera

Moher

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Nat Med20200926913641374

10.1038/s41591-020-1034-x

Sittig

Singh

A new sociotechnical model for studying health information technology in complex adaptive healthcare systems

Qual Saf Health Care20101019 Suppl 3Suppl 3i6874

10.1136/qshc.2010.042085

20959322

Sendak

Ratliff

Sarro

Real-World Integration of a sepsis deep learning technology into routine clinical care: implementation study

JMIR Med Inform2020071587e15182

10.2196/15182

32673244