Introduction

J Med Internet Res

jmir

Journal of Medical Internet Research

J Med Internet Res

1438-8871

JMIR Publications

Toronto, Canada

v27i1e82729

10.2196/82729

Letter to the Editor

Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Python

Andre

PhD123Li

HongYi

BS14Fu

Jun-Fen

MD, PhD567

Center for Data Science, Zhejiang University

Hangzhou

ChinaSchool of Medicine, Zhejiang University

Hangzhou

ChinaCentre for Human Genetics, Nuffield Department of Medicine, University of Oxford

Roosevelt Drive

Oxford

United KingdomSchool of Mathematical Sciences, Zhejiang University

Hangzhou

ChinaSchool of Medicine, Children’s Hospital of Zhejiang University

Hangzhou

ChinaNational Clinical Research Center for Child Health

Hangzhou

ChinaNational Regional Center for Children's Health

Hangzhou

China

Leung

Tiffany

Correspondence to Andre Python, PhD, Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, United Kingdom, 44 01865 287500; andre.python@well.ox.ac.uk

2025

2492025

e82729

200820252608202529082025

2025

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

https://www.jmir.org/2025/1/e81769

https://www.jmir.org/2025/1/e71916

large language modelLLMclinicalartificial intelligenceAIdigital healthLLM reviewreviewletter

Introduction

We thank the correspondent for engaging with our original work [1] and raising constructive points in their Letter [2].

Citation Threshold Bias

We acknowledge that the citation criteria applied to select journals may exclude relevant studies from emerging or specialized venues. Our criteria were not only desirable but necessary to balance comprehensiveness with methodological quality considering the rapidly expanding literature. To mitigate the risk of omission of innovative research, we (1) screened and incorporated all relevant articles from main database platforms as well as e-prints and (2) made available an interactive online guideline offering an up-to-date guide to clinicians.

Definition of “Best Performance”

We acknowledge the concerns associated with the performance comparison of models across heterogeneous contexts. To avoid ambiguity and misinterpretation, we stated and discussed in detail that, in our study, the term “best performance” is solely associated with the findings from the reviewed studies. Our analysis helps identify models successfully applied in clinical studies, without aiming at or implying comparison across domains. We direct readers to the excellent recent work by Liu et al [3] for a comparison of lightweight large language models (LLMs) for medical tasks.

Quality Assessment of the Included Studies

We carried out a thorough quality assessment following PRISMA guidelines [4]. This might have escaped the correspondent’s attention, as the details are provided in Multimedia Appendix 2 of our work [1].

Clinical Workflow

The suggested 5-stage workflow does not ignore nor intend to capture the complexity of clinical practice. Rather, it serves as a framework to associate the reported use of LLMs with tasks and processes familiar to clinicians, in line with a previous study [5]. Our workflow offers a practical assessment of the role and extent of LLMs applied in clinically relevant sectors of activities and tasks.

Clinical Validation Gap

We acknowledge and discuss the challenges in assessing the practicality of their deployment in clinical applications. Complementary to benchmarking LLMs on research datasets, our review covers studies using LLMs in both research and clinical settings. While we identified key challenges of LLMs in real-world applications, a comprehensive assessment of discrepancies between research and clinical settings is clearly beyond the scope.

Safety and Risk Analyses

While our review discusses key concerns of the use of LLMs in clinical settings including hallucination risks and ethical considerations, a comprehensive risk assessment is beyond scope. Future research dedicated to tackle this key topic would require substantial efforts.

Economic Evaluation

Our review assesses the associated costs of the graphics processing unit memory and its cooling requirements by process and clinical tasks. Our interactive online guideline will regularly incorporate future changes in the requirements and costs, as exemplified by the recent rise of lightweight LLMs that may offer excellent performance on consumer-grade hardware. However, a comprehensive cost-effectiveness or return-on-investment analysis is beyond the study scope.

Conclusion

These observations are a timely reminder that our current understanding of the application of LLMs in clinical settings remains provisional and that we need continual reassessment of their current and future roles in health care practice.

We declare that no part of this submission has been generated by AI.

None declared.

Abbreviations

LLM

large language model

References1

Python

Implementing large language models in health care: clinician-focused review with interactive guideline

J Med Internet Res2025071127e71916

10.2196/71916

40644686

Weizman

Critical limitations in systematic reviews of large language models in health care

J Med Internet Res202527e81769

10.2196/81769

Liu

Zhou

Application of large language models in medicine

Nat Rev Bioeng202536445464

10.1038/s44222-025-00279-5

Page

McKenzie

Bossuyt

The PRISMA 2020 statement: an updated guideline for reporting systematic reviews

BMJ20210329372n71

10.1136/bmj.n71

33782057

Betzler

Chen

Cheng

Large language models and their impact in ophthalmology

Lancet Digit Health202312512e917e924

10.1016/S2589-7500(23)00201-7

38000875