<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="article-commentary"><front><journal-meta><journal-id journal-id-type="nlm-ta">J Med Internet Res</journal-id><journal-id journal-id-type="publisher-id">jmir</journal-id><journal-id journal-id-type="index">1</journal-id><journal-title>Journal of Medical Internet Research</journal-title><abbrev-journal-title>J Med Internet Res</abbrev-journal-title><issn pub-type="epub">1438-8871</issn><publisher><publisher-name>JMIR Publications</publisher-name><publisher-loc>Toronto, Canada</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">v28i1e95004</article-id><article-id pub-id-type="doi">10.2196/95004</article-id><article-categories><subj-group subj-group-type="heading"><subject>Commentary</subject></subj-group></article-categories><title-group><article-title>Beyond GPT-4: The Rapidly Evolving Potential of Large Language Models for Clinical Guideline Improvement</article-title></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name name-style="western"><surname>Nelson</surname><given-names>Scott D</given-names></name><degrees>PharmD, MS</degrees><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author"><name name-style="western"><surname>Wright</surname><given-names>Adam</given-names></name><degrees>PhD</degrees><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1"><institution>Department of Biomedical Informatics, School of Medicine, Vanderbilt University Medical Center</institution><addr-line>3401 West End Ave</addr-line><addr-line>Nashville</addr-line><addr-line>TN</addr-line><country>United States</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Leung</surname><given-names>Tiffany</given-names></name></contrib></contrib-group><author-notes><corresp>Correspondence to Scott D Nelson, PharmD, MS, Department of Biomedical Informatics, School of Medicine, Vanderbilt University Medical Center, 3401 West End Ave, Nashville, TN, 37203, United States, 1 6158759347; <email>scott.nelson@vumc.org</email></corresp></author-notes><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>10</day><month>4</month><year>2026</year></pub-date><volume>28</volume><elocation-id>e95004</elocation-id><history><date date-type="received"><day>09</day><month>03</month><year>2026</year></date><date date-type="rev-recd"><day>20</day><month>03</month><year>2026</year></date><date date-type="accepted"><day>20</day><month>03</month><year>2026</year></date></history><copyright-statement>&#x00A9; Scott D Nelson, Adam Wright. Originally published in the Journal of Medical Internet Research (<ext-link ext-link-type="uri" xlink:href="https://www.jmir.org">https://www.jmir.org</ext-link>), 10.4.2026. </copyright-statement><copyright-year>2026</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://www.jmir.org/">https://www.jmir.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://www.jmir.org/2026/1/e95004"/><related-article related-article-type="commentary article" ext-link-type="doi" xlink:href="10.2196/81915" xlink:title="Comment on" xlink:type="simple">https://www.jmir.org/2026/1/e81915</related-article><abstract><p>This commentary reviews the study by Jones et al, which evaluated whether GPT-4 could improve the readability of injectable medication guidelines while preserving important safety information. The study found that GPT-4 produced modest readability gains comparable to manual revision, but also introduced omissions and meaning changes in a minority of sections. These findings highlight both the potential and limitations of early large language models (LLMs) in clinical contexts. However, this study reflects the capabilities of a specific model in a rapidly evolving domain. Since the release of GPT-4, advances in multistep reasoning, model-critique workflows, and structured validation have substantially improved the ability of newer systems to detect omissions, maintain factual fidelity, and support controlled editing. As a result, some documented limitations may stem from the constraints of a single-model, single-pass workflow rather than intrinsic flaws in LLM-assisted guideline revision. This commentary highlights the need for evaluation frameworks that can keep pace with LLM progress and emphasizes that clinical oversight and user-centered testing remain essential. Updated research using contemporary models is needed to determine how emerging architectures can more safely support clarity, consistency, and maintenance of clinical guidelines.</p></abstract><kwd-group><kwd>artificial intelligence</kwd><kwd>clinical guidelines</kwd><kwd>large language model</kwd><kwd>patient safety</kwd><kwd>readability</kwd><kwd>clinical decision support</kwd></kwd-group></article-meta></front><body><sec id="s1" sec-type="intro"><title>Introduction</title><p>In their recent study, Jones et al [<xref ref-type="bibr" rid="ref1">1</xref>] evaluated a GPT-4&#x2013;based pipeline for improving the readability of 20 guidelines from the United Kingdom&#x2019;s National Health Service Injectable Medicines Guide (IMG). The authors found that GPT-4 produced modest but statistically significant readability improvements, and expert pharmacist reviewers rated the revised versions as easier to understand for 26 of 60 (43%) of the ratings. The gains in readability were comparable to those achieved by manual revisions by guideline authors; however, the greatest readability improvements were for two guidelines (aminophylline and voriconazole) using iterative user testing from a previous study.</p><p>Notably, using the large language model (LLM) was not without risk. At least one pharmacist reviewer identified omissions in 30 of 153 subsections (20%), additions in 7 subsections (5%), and changes in meaning in 18 subsections (12%). Eight subsections had omissions identified by all 3 reviewers, but no additions or changes in meaning were unanimously flagged. Overall, 65% of all identified issues were flagged by only a single reviewer. The authors concluded that GPT-4 could help augment, rather than replace, manual expert review and user-centered testing to improve guideline readability.</p></sec><sec id="s2"><title>The Challenge of Evaluating a Moving Target</title><p>Interpreting these findings requires an appreciation for how rapidly LLM technology has advanced since the release of GPT-4 in March 2023, an extraordinarily long time ago in the current field of artificial intelligence advancements. By the time the study was published, newer models had already introduced major improvements in error checking, multistep reasoning, and structured critique workflows [<xref ref-type="bibr" rid="ref2">2</xref>,<xref ref-type="bibr" rid="ref3">3</xref>]. This creates a temporal mismatch where clinical research and guideline development cycles operate on the scale of <italic>years</italic>, while LLM development cycles operate on the scale of <italic>months</italic> and could continue to accelerate [<xref ref-type="bibr" rid="ref4">4</xref>]. Thus, this study should be viewed as an evaluation of a specific model at a fixed point in time, not a judgment on the overall potential of LLMs in guideline development and review workflows. This is not a criticism of the study itself, as the authors designed a careful, well-controlled evaluation and their findings are valuable precisely because they document specific failure modes that future systems must address. Rather, it is a call to develop more agile evaluation frameworks that can keep pace with technological change, so that evidence generation does not perpetually lag behind the tools available for deployment.</p><p>For example, the GPT-4 pipeline relied on a single model for both the editing and quality assurance steps. This architecture creates an inherent tension, since simplification commonly results in removing content, so omissions could be somewhat expected. Newer multiagent systems separate generation from critique, allowing an editor model to propose revisions while a critic model checks for completeness, consistency, and factual accuracy [<xref ref-type="bibr" rid="ref5">5</xref>]. Structured reasoning frameworks, such as tree-of-thought and self-consistency, enable models to justify edits and cross-check them before finalizing. Skill-based architectures allow explicit function calls to validate medication names, units, values, and section completeness, replacing soft prompts with enforceable programmatic safeguards, while dynamic prompt optimization can iteratively refine instructions to prevent prompt-induced errors. These advances do not eliminate the need for clinician oversight, but they offer more robust mechanisms for preserving informational fidelity while improving readability.</p></sec><sec id="s3"><title>Contextualizing the Use Case</title><p>The IMG guidelines represent a relatively favorable use case for LLM revision: procedural, deterministic instructions with clear ground truth. This contrasts with other clinical practice guidelines, such as those for disease management, which must synthesize heterogeneous evidence, navigate gaps, and rely on expert consensus across nonrepresentative study populations, posing far greater challenges. In this example, the study evaluated GPT-4&#x2019;s performance on the IMG, which is a nationally curated, professionally edited resource. Improving readability on an already well-crafted guideline is inherently challenging, yet the model sill produced modest readability improvements. In practice, LLMs may offer even greater value for locally developed clinical guidelines and documents, which are often produced under time pressure with less editorial rigor. Enhancing the clarity and consistency of these documents could improve staff comprehension and ultimately lower the risk of downstream errors. This study tested the LLM against the hardest version of the problem, improving something already well-crafted, while the real-world opportunity may lie in lifting up the documents that need it most.</p></sec><sec id="s4"><title>The Continued Need for User-Centered Testing</title><p>User testing remains the guideline improvement technique with the strongest evidence base, and the authors&#x2019; own data confirm this. Future studies should also include the intended end users. For example, the IMG is primarily used by nurses, who may prioritize quick scanability and visual hierarchy over the pharmacological completeness that pharmacist reviewers would naturally emphasize. The interaction between content accuracy and practical usability can only be fully understood by the people who use these documents at the point of care.</p></sec><sec id="s5"><title>Beyond Readability</title><p>Improving readability is only one part of making guidelines more usable. Simplifying text naturally risks omitting information, while providing excessive detail poses its own risks: dense guidelines can cause clinicians to overlook or misinterpret critical information. In the study, 65% of errors were identified by only a single pharmacist, suggesting that many issues were subtle and difficult to detect. The real question is how we can preserve and present essential information in the clearest, most usable form. Newer multimodal models offer approaches beyond text alone. They can generate diagrams, flowcharts, and annotated step-by-step visuals, which may communicate procedural information, such as reconstitution or infusion setup, more effectively than narrative text [<xref ref-type="bibr" rid="ref6">6</xref>]. These multimodal formats can help reduce cognitive load and help clinicians understand complex instructions more intuitively.</p><p>Furthermore, LLMs have broader potential in the guideline ecosystem. They could support translation into other languages, improving access and equity in multilingual care settings [<xref ref-type="bibr" rid="ref7">7</xref>], though clinical translation would require rigorous verification [<xref ref-type="bibr" rid="ref8">8</xref>]. LLMs may also enable just-in-time guidance by retrieving and tailoring the relevant portion of a guideline to a clinician&#x2019;s immediate question, which would often be more valuable than improving long documents clinicians may not have time to read. In addition, LLMs could assist with clinical decision support (CDS) maintenance, helping translate updated recommendations into structured CDS logic, or flag conflicts between new evidence and existing rules, reducing alert fatigue and easing the burden of keeping CDS systems current [<xref ref-type="bibr" rid="ref9">9</xref>]. These applications warrant dedicated study using current-generation models.</p></sec><sec id="s6" sec-type="conclusions"><title>Conclusion</title><p>Jones et al [<xref ref-type="bibr" rid="ref1">1</xref>] provide a rigorous, timely evaluation of GPT-4&#x2019;s capabilities and limitations in revising medication guidelines. Their findings identify clear failure modes that future systems must overcome. As LLM architectures continue to advance, updated evaluations are essential to determine how well newer systems address the documented issues and how they can safely support clinicians, guideline authors, and health care organizations. None of these advances eliminate the need for clinician oversight or user-centered testing. The goal is to equip guideline authors and informatics teams with powerful tools for improving how clinical knowledge is communicated and delivered at the point of care. The evidence base must evolve alongside the technology.</p></sec></body><back><ack><p>The authors declare the use of generative AI (GAI) in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision: text generation, proofreading and editing, and summarizing text. The GAI tool used was Microsoft Copilot. Responsibility for the final manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes.</p></ack><notes><sec><title>Funding</title><p>The authors declared no financial support was received for this work.</p></sec></notes><fn-group><fn fn-type="con"><p>SDN conceptualized the study and was responsible for writing the original draft. SDN and AW contributed to writing, review, and editing of the manuscript.</p></fn><fn fn-type="conflict"><p>SDN serves on the advisory board for Merative Micromedex and Baxter Healthcare. AW declares no conflicts of interest.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">CDS</term><def><p>clinical decision support</p></def></def-item><def-item><term id="abb2">IMG</term><def><p>Injectable Medicines Guide</p></def></def-item><def-item><term id="abb3">LLM</term><def><p>large language model</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Jones</surname><given-names>MD</given-names> </name><name name-style="western"><surname>Torgbi</surname><given-names>M</given-names> </name><name name-style="western"><surname>Tayyar Madabushi</surname><given-names>H</given-names> </name></person-group><article-title>Improving the understandability of clinical guidelines: development and evaluation of a GPT-4-based pipeline</article-title><source>J Med Internet Res</source><year>2026</year><month>02</month><day>23</day><volume>28</volume><fpage>e81915</fpage><pub-id pub-id-type="doi">10.2196/81915</pub-id><pub-id pub-id-type="medline">41730207</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="web"><article-title>Introducing GPT-5</article-title><source>OpenAI</source><year>2025</year><access-date>2026-04-06</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://openai.com/index/introducing-gpt-5">https://openai.com/index/introducing-gpt-5</ext-link></comment></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="web"><person-group person-group-type="author"><name name-style="western"><surname>Stanciuc</surname><given-names>AM</given-names> </name></person-group><article-title>OpenAI&#x2019;s GPT-54 sets new records on professional benchmarks</article-title><source>The Next Web</source><year>2026</year><access-date>2026-04-06</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://thenextweb.com/news/openai-gpt-54-launch-computer-use-benchmarks">https://thenextweb.com/news/openai-gpt-54-launch-computer-use-benchmarks</ext-link></comment></nlm-citation></ref><ref id="ref4"><label>4</label><nlm-citation citation-type="web"><person-group person-group-type="author"><name name-style="western"><surname>Aschenbrenner</surname><given-names>L</given-names> </name></person-group><source>Situational Awareness: The Decade Ahead</source><year>2024</year><access-date>2026-04-06</access-date><comment><ext-link ext-link-type="uri" xlink:href="https://situational-awareness.ai">https://situational-awareness.ai</ext-link></comment></nlm-citation></ref><ref id="ref5"><label>5</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>Yuan</surname><given-names>Y</given-names> </name><name name-style="western"><surname>Xie</surname><given-names>T</given-names> </name></person-group><article-title>Reinforce LLM reasoning through multi-agent reflection</article-title><source>arXiv</source><comment>Preprint posted online on 2025</comment><pub-id pub-id-type="doi">10.48550/arXiv.2506.08379</pub-id></nlm-citation></ref><ref id="ref6"><label>6</label><nlm-citation citation-type="book"><person-group person-group-type="author"><name name-style="western"><surname>Benito</surname><given-names>MD</given-names> </name><name name-style="western"><surname>Diana-Albelda</surname><given-names>C</given-names> </name><name name-style="western"><surname>Garc&#x00ED;a-Mart&#x00ED;n</surname><given-names>&#x00C1;</given-names> </name><name name-style="western"><surname>Bescos</surname><given-names>J</given-names> </name><name name-style="western"><surname>Vi&#x00F1;olo</surname><given-names>ME</given-names> </name><name name-style="western"><surname>SanMiguel</surname><given-names>JC</given-names> </name></person-group><person-group person-group-type="editor"><name name-style="western"><surname>Wu</surname><given-names>S</given-names> </name><name name-style="western"><surname>Shabestari</surname><given-names>B</given-names> </name><name name-style="western"><surname>Xing</surname><given-names>L</given-names> </name></person-group><article-title>MIRAGE: retrieval and generation of multimodal images and texts for medical education international workshop on applications of medical AI</article-title><source>Applications of Medical Artificial Intelligence. AMAI 2025. Lecture Notes in Computer Science, Vol 16206</source><year>2026</year><publisher-name>Springer</publisher-name><pub-id pub-id-type="doi">10.1007/978-3-032-09569-5_11</pub-id></nlm-citation></ref><ref id="ref7"><label>7</label><nlm-citation citation-type="book"><person-group person-group-type="author"><name name-style="western"><surname>Pavithra</surname><given-names>RS</given-names> </name></person-group><person-group person-group-type="editor"><name name-style="western"><surname>Zhao</surname><given-names>W</given-names> </name><name name-style="western"><surname>D&#x2019;Souza</surname><given-names>J</given-names> </name><name name-style="western"><surname>Eger</surname><given-names>S</given-names> </name><etal/></person-group><article-title>Bridging health literacy gaps in Indian languages: multilingual LLMs for clinical text simplification</article-title><source>Proceedings of The First Workshop on Human&#x2013;LLM Collaboration for Ethical and Responsible Science Production</source><year>2025</year><publisher-name>Association for Computational Linguistics</publisher-name><pub-id pub-id-type="doi">10.18653/v1/2025.sciprodllm-1.1</pub-id></nlm-citation></ref><ref id="ref8"><label>8</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>Schlicht</surname><given-names>IB</given-names> </name><name name-style="western"><surname>Sayin</surname><given-names>B</given-names> </name><name name-style="western"><surname>Zhao</surname><given-names>Z</given-names> </name><name name-style="western"><surname>Labont&#x00E9;</surname><given-names>FM</given-names> </name><name name-style="western"><surname>Barbera</surname><given-names>C</given-names> </name><name name-style="western"><surname>Viviani</surname><given-names>M</given-names> </name><etal/></person-group><article-title>Disparities in multilingual LLM-based healthcare Q&#x0026;A</article-title><source>arXiv</source><comment>Preprint posted online on 2025</comment><pub-id pub-id-type="doi">10.48550/arXiv.2510.17476</pub-id></nlm-citation></ref><ref id="ref9"><label>9</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Liu</surname><given-names>S</given-names> </name><name name-style="western"><surname>Wright</surname><given-names>AP</given-names> </name><name name-style="western"><surname>Patterson</surname><given-names>BL</given-names> </name><etal/></person-group><article-title>Using AI-generated suggestions from ChatGPT to optimize clinical decision support</article-title><source>J Am Med Inform Assoc</source><year>2023</year><month>06</month><day>20</day><volume>30</volume><issue>7</issue><fpage>1237</fpage><lpage>1245</lpage><pub-id pub-id-type="doi">10.1093/jamia/ocad072</pub-id><pub-id pub-id-type="medline">37087108</pub-id></nlm-citation></ref></ref-list></back></article>