Abstract
Background: Colorectal cancer (CRC) is a leading cause of cancer morbidity and mortality worldwide. The complexity of guideline-concordant care and unstructured clinical data has driven demand for decision-support tools. Large language models (LLMs) show promise for processing clinical data and patient–provider communication, yet evidence is fragmented, and a CRC-specific synthesis across the full care continuum is lacking.
Objective: This systematic review evaluates the current applications, performance determinants, and clinical implications of LLMs across the continuum of CRC care.
Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), we searched 6 databases (PubMed, Embase, Web of Science, Scopus, CINAHL, Cochrane) through April 1, 2026. Eligible studies were peer-reviewed original investigations of LLMs on CRC tasks with extractable outcomes; reviews, editorials, and abstracts were excluded. Two reviewers assessed quality with QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2), PROBAST (prediction model risk of bias assessment tool), and ROBINS-I (Risk of Bias in Nonrandomized Studies - of Interventions). Data on model types, applications, prompts, input/output formats, and outcomes were analyzed descriptively, with narrative synthesis per synthesis without meta-analysis (SWiM) guidelines.
Results: Of 8880 records, 37 studies met inclusion criteria (2023‐2026), mostly from China and the United States, with GPT series most frequently evaluated. Overall risk of bias was low in 10/37 studies (27.0%), moderate in 14/37 (37.8%), unclear in 7/37 (18.9%), and high or serious in 6/37 (16.2%). Problematic domains included outcome measurement, intervention classification, patient selection, and lack of blinded assessment. LLMs showed utility in automating data extraction from clinical texts, supporting patient education, aiding diagnosis, and assisting clinical decision-making, with emerging visual interpretation and multimodal capacities. Domain-specific and multimodal models showed advantages over general-purpose models in certain tasks. Performance was significantly influenced by prompt design, from zero-shot queries to fine-tuning. Despite efficiency and outcome benefits, challenges persist regarding methodological quality, data privacy, and generalizability.
Conclusions: This review provides an integrative framework synthesizing evidence across study designs and LLM categories in CRC care. Unlike prior reviews addressing gastroenterology broadly or limited to one design, it covers the full CRC continuum and, for the first time, comparatively evaluates general-purpose, domain-specific, and multimodal LLMs, clarifying how prompt engineering and heterogeneous metrics shape outcomes. Although findings support LLMs’ clinical potential, results must be interpreted cautiously, given low overall evidence quality. Most studies lacked safeguards against bias—blinded assessment, confounder adjustment, or prospective multicenter validation. Substantial heterogeneity across tasks, LLM types, prompts, reference standards, and outcomes means reported advantages cannot be generalized. Future work should prioritize real-world integration via prospective multicenter validation, robust privacy frameworks, and rigorous human oversight. Amid rising global CRC burden and health care disparities, this review informs clinical translation, equitable scaling, and policy on LLM deployment.
Trial Registration: PROSPERO CRD420251248261; https://www.crd.york.ac.uk/PROSPERO/view/CRD420251248261
doi:10.2196/89862
Keywords
Introduction
Colorectal cancer (CRC) is the third most commonly diagnosed malignancy and the second leading cause of cancer-related mortality worldwide, with incidence projected to rise substantially through 2050 []. Contemporary CRC care spans a long continuum: risk stratification, screening, endoscopic and histopathological diagnosis, multidisciplinary treatment, and long-term surveillance, in which each stage generates dense, largely unstructured clinical text and requires time-sensitive, guideline-concordant decisions []. This labor-intensive process is time-consuming and error-prone due to visual fatigue and information gaps inherent in voluminous clinical notes, pulling clinicians from direct patient care and straining both providers and institutional resources [,]. Within this context, large language models (LLMs) built on the Transformer architecture have emerged as a candidate interface between complex clinical text and decision support []. Compared with conventional clinical decision-support and patient education modalities, LLMs offer several distinct advantages: automated extraction and processing of large-scale clinical follow-up records [], real-time responses to patient inquiries regarding CRC symptoms and prevention [], guidance for geographically tailored screening strategies [], and enhanced adherence to clinical quality improvement initiatives [], less constrained by outpatient scheduling or geographic disparities in health care resource distribution []. This approach conserves clinician time and reduces operational costs while simultaneously improving the accessibility, flexibility, and scalability of CRC-related health information for patients [].
Against this backdrop, research on LLMs in CRC has expanded rapidly between 2024 and 2026, spanning the entire care continuum. In screening and early detection, GPT-4 and its successors have been evaluated for risk-stratified counseling and family-history triage for hereditary CRC syndromes [,], while multiple studies have also explored the clinical utility of LLM-based tools, notably ChatGPT (OpenAI), for preoperative screening consultations and postoperative surveillance monitoring in CRC patients [,]. In endoscopy, LLMs have been applied to automate colonoscopy report generation [,]. In pathology, LLMs have been used to extract tumor–node–metastasis (TNM) descriptors and microsatellite instability status [,]. Therapeutic decision support has emerged as a particularly active area, with LLM recommendations benchmarked against multidisciplinary tumor board consensus [,]. The accelerating volume of these publications makes a focused, structured synthesis both timely and necessary.
Nevertheless, digital health models are not without limitations, including technically inaccurate outputs attributable to hallucinations [], quality assurance concerns in complex diagnostic and therapeutic recommendations [], and challenges related to model bias, limited generalizability, and the absence of physician empathy []. The emerging literature also reflects substantial heterogeneity, with findings that vary across studies. Model selection is one key factor []. Published studies have compared various general-purpose and medically fine-tuned models, with consistent reports distinguishing the performance of GPT-4-class and domain-tuned models from that of earlier or smaller backbones in oncology evaluations, while open-source models offer data-privacy advantages but display variable accuracy across CRC tasks [,]. Equally consequential is the choice of prompt engineering strategy: zero-shot prompting, few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation (RAG), and guideline-grounded prompting yield markedly different accuracy on identical CRC questions [,-], with several studies reporting accuracy gains when few-shot or RAG approaches replace naive zero-shot baselines [,,]. Additional sources of heterogeneity include differences in evaluation rubrics, question framing, and prompt language [,]. Consequently, 2 studies addressing apparently similar questions can reach opposing conclusions. Amini et al [] assessed the clinical utility of freely available LLMs for colonoscopy surveillance interval recommendations across diverse settings, finding insufficient accuracy and notable limitations. In contrast, Chang et al [], using the more capable GPT-4 model and a guideline-anchored expert panel as reference, concluded that ChatGPT-4 exhibited accuracy comparable to professional gastroenterologists.
Within the gastroenterological domain, several reviews have mapped LLM applications. Gong conducted a systematic review of LLMs in gastroenterology and gastrointestinal endoscopy, categorizing applications into knowledge-based response evaluation and document automation, with most studies focusing on GPT-series models []. Omar et al [] reviewed 57 natural language processing (NLP) and LLM studies in gastroenterology and hepatology, confirming improved data extraction from electronic health records (EHRs) but noting persistent challenges in integrating these tools into routine clinical practice. Furthermore, a recent systematic review in lung cancer identified critical methodological limitations in primary LLM studies, notably a reliance on retrospective data and unclear risk of bias []. Given the fundamental differences in oncology protocols, the specific, multi-stage clinical trajectory of CRC, spanning distinct endoscopic, pathological, and surgical phases, necessitates an isolated, disease-specific appraisal to objectively evaluate LLM viability. However, a conspicuous gap remains: no systematic review has comprehensively evaluated the evidence for LLM applications specifically within the CRC domain. In particular, the information quality of LLM outputs across the full CRC care continuum has been insufficiently addressed in prior systematic reviews. Compounding this limitation, although recent studies have demonstrated that LLMs can achieve clinician-level performance in specific clinical tasks, substantial heterogeneity in model selection, prompt engineering strategies, and evaluation metrics precludes generalizable conclusions [,].
Accordingly, this systematic review aims to evaluate the performance of different LLM categories across the full CRC care continuum, address evidence gaps arising from fragmented research practices, and provide a foundation for future research and clinical translation, covering use cases, model types, optimization strategies, limitations, and future directions. Specifically, this review seeks to (1) map LLM applications across the principal clinical domains of CRC management; (2) compare general-purpose, domain-specific, and multimodal LLMs under different prompt engineering and fine-tuning strategies; (3) classify included studies according to their research design and apply corresponding quality appraisal tools to appraise the credibility of individual studies.
Methods
Eligibility Criteria
The eligibility criteria for this review were established according to the PICOS (Population, Intervention, Comparison, Outcome, Study design) framework, as detailed in .
| Criteria | Definition |
| Participants | General population or patients with CRC. |
| Intervention | Artificial Intelligence, specifically LLM applied in CRC management. These may be applications used by patients or health care providers for auxiliary diagnosis, information extraction, knowledge-based question answering, treatment decision-making, predictive modeling, or scientific research. LLMs are advanced AI systems designed to process complex clinical data, support decision-making, and enable effective communication. |
| Control | Control (applicable exclusively to comparative study designs): Standard clinical evaluation by health care professionals or conventional non-LLM computational algorithms. Studies without a control group were eligible for inclusion if the other criteria were met. |
| Outcomes | Outcome measures included: Clinical and performance effectiveness (eg, Accuracy, F1-score, area under the curve, sensitivity, concordance rate) and qualitative/utility measures (eg, response completeness, clarity, comprehensiveness, guideline adherence). |
| Study types | All study types were considered (eg, exploratory or comparative designs) so long as the original research concept was implemented and tested regarding LLMs and CRC. Nonoriginal research such as books, book chapters, letters, reviews, and conference proceedings were excluded. |
| Other | Studies were restricted to English language only articles. |
aLLM: large language model.
bCRC: colorectal cancer.
cAI: artificial intelligence.
Discrepancies were resolved by discussion, with arbitration by a third reviewer. This review was conducted following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 [], with search reporting per PRISMA-S [] and narrative synthesis per SWiM guidelines [].
Information Sources
Relevant studies were identified by systematically searching 6 electronic databases: PubMed, Web of Science, Embase, Cochrane Library, CINAHL, and Scopus (search cutoff date: April 1, 2026). Each database was searched individually; no multi-database searching on a single platform was performed. No published search filters (eg, validated study design filters) were applied to any database search.
Search Strategy
The search strategy combined Medical Subject Headings (MeSH and EMTREE) and free-text keywords related to CRC and LLMs. These terms were adapted for each database to maximize retrieval sensitivity. Key terms included: “colonic neoplasms,” “colorectal cancer*,” “large language models,” “artificial intelligence,” “LLM,” “GPT,” “ChatGPT,” “Claude,” “Gemini,” and “LLaMA.” The search process followed the PRISMA Search Strategy Extension []. The complete search strategy, including specific search queries, applied limits, and the number of records retrieved from each database, is provided in . The initial search was established and updated through April 1, 2026, to capture the most recent publications prior to data synthesis.
Regarding the PRISMA-S checklist, certain items were not applicable to our methodology: study registries and regulatory databases were not searched, as research on LLMs in CRC is generally not registered as clinical trials; gray literature, institutional websites, conference proceedings, and preprint servers were not searched; aside from manually screening reference lists, no citation searching tools were used; no additional search methods such as PubMed Related Articles, personal reference libraries, or other database-embedded related-article recommendation features were employed; and stakeholders or content experts were not contacted to identify additional studies, as the designed search was considered sufficiently comprehensive through database coverage alone. Although corresponding authors were contacted via email regarding missing or ambiguous data during the data extraction process, no authors, experts, manufacturers, or other parties were specifically contacted to identify additional studies or unpublished data for inclusion in this review. The search strategy did not undergo formal external peer review (such as the PRESS checklist process) but was cross-checked and finalized by investigators within the research team. A complete PRISMA-S checklist is provided in .
Selection Process
EndNote X9.3.3 (Clarivate Analytics, US) was used for reference management and automated deduplication, followed by manual verification. Two reviewers (JL and HT) independently screened titles and abstracts, then full texts against eligibility criteria. Discrepancies were resolved by discussion, with arbitration by a third reviewer (QF). Interrater agreement was assessed using the Kappa statistic.
Data Collection Process
Two reviewers (JL and WX) independently extracted data using a predesigned form (WPS Office Excel). Extracted items included: title, first author, year, study design, LLM model, model modality, application scenario, prompt engineering approach, input/output formats, and outcome measures. Interreviewer consistency was calculated; disagreements were resolved by a third reviewer (QF). For missing or ambiguous data, corresponding authors were contacted via email; if unavailable after 2 weeks, items were recorded as “not reported” and excluded from descriptive analyses. No imputation was applied. For studies reporting multiple outcomes, we gave preference to the primary outcome defined by the authors; if none was specified, we selected the metric most central to the study’s objective through consensus between 2 reviewers. For other types of outcomes, we extracted the reported values without modification.
Data Items
To manage the inherent overlap between technical tasks, studies were categorized based on their primary terminal clinical objective. For instance, studies employing information extraction specifically to enable automated TNM staging were classified under “Auxiliary Diagnosis” rather than “Information Extraction” to prioritize clinical utility over technical subprocesses.
Study Risk of Bias Assessment
Following Omar and Levkovich [], the included studies were classified and evaluated based on the assessment design and outcome indicators of the studies rather than their clinical application fields. QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) [] was applied for diagnostic accuracy studies validating LLM performance against histopathological diagnosis, endoscopist consensus, or clinical guidelines. PROBAST (prediction model risk of bias assessment tool) [] was applied for prediction model studies focusing on the development and validation of LLM-based predictive models. ROBINS-I (Risk of Bias in Nonrandomized Studies - of Interventions) [] was applied for nonrandomized intervention studies evaluating the LLM application effect or clinical value, including information extraction and knowledge-based tasks. Study classifications and corresponding tools are detailed in .
Given that LLM studies differ from conventional clinical trials, 2 oncology experts (QF) made minor framework-preserving adaptations to each tool; specific adaptations are documented in . Assessment was conducted independently by 2 researchers (JL and HT), with a third (WX) resolving disagreements. Final results were reviewed by an expert (QF). Interrater agreement was evaluated using the Kappa statistic. Overall evidence strength was evaluated considering study quality, consistency of findings, and methodological limitations.
Synthesis Methods
Given the anticipated heterogeneity in clinical tasks, study designs, and outcome constructs, narrative synthesis following SWiM reporting guidelines [] was planned a priori rather than quantitative meta-analysis. Meta-analysis was not conducted for four reasons: (1) substantial heterogeneity across fundamentally different clinical tasks, rendering pooled estimates uninterpretable; (2) a high proportion of studies rated at moderate, serious, or high risk of bias; (3) fewer than 5 studies within any subgroup sharing comparable task definitions, input modalities, and reference standards; and (4) marked inconsistency in outcome measures precluding standardized effect size extraction.
Reporting Bias Assessment
This systematic review employed a narrative synthesis and did not perform statistical tests for publication bias. Given the absence of a quantitative meta-analysis and the substantial heterogeneity in study design and outcome reporting across included studies, methods such as funnel plots were considered inapplicable. During evidence synthesis and result interpretation, the research team conducted a qualitative assessment of potential reporting bias. By comparing the consistency between study objectives, methods, and reported outcomes, and by incorporating study registration information (where available) and author explanations, the team cautiously discussed the potential impact of missing results on study conclusions.
Results
Study Selection
A total of 8880 records were retrieved (PubMed: 4047; Embase: 1423; Web of Science: 3061; Cochrane Library: 43; Scopus: 43; CINAHL: 263). After automated and manual deduplication using EndNote X9.3.3, 6260 unique records were identified. Following title/abstract screening, 2533 full-text articles were assessed, and 37 studies met the inclusion criteria. The screening-stage Kappa was 0.85. The screening process is presented in .

Study Characteristics
The data extraction consistency rate was 0.97. All 37 studies were published between 2023 and 2026, 2 in 2023 [,], 11 in 2024 [,,,,-], 22 in 2025 [,,,,-], and 2 in 2026 [,]. Studies primarily originated from China [,,,,,-,,,] and the United States [,,,,,,,,,], with others from Italy [,], Germany [,], Singapore [,], Israel [], Switzerland [], Spain [], Turkey [], the United Kingdom [], South Korea [], and multinational collaborations [,,]. Application domains included auxiliary diagnosis [,,,,,,,], information extraction [,,,,,], knowledge-based question answering [,,,,,,,,], treatment decision-making [,,,,,,,], predictive modeling [,], scientific research [,], and aided nursing [].
The LLMs used varied widely, with the most frequent being OpenAI’s GPT series. Other models included Google’s Gemini, Anthropic’s Claude, Meta’s LLaMA series, as well as DeepSeek, GLM, and Qwen, among others. The best-performing models identified in comparative studies are summarized in . The results suggest that models such as GPT-4, GPT-4o, and Claude 2.1 showed relatively favorable performance in some tasks [,,,,]; o3-mini reportedly showed comparatively higher intra-model stability and expert concordance among reasoning-oriented models for multidisciplinary team decision simulation []. However, for specific tasks, lightweight models or domain-specialized models may also perform optimally [,,]. A summary of these details is provided in .
| Study | Country | LLMs used | Model type | Application domain | Best performer |
| Zeng, 2025 [] | Multi-national | ChatGPT-4.5 | Pure LLM | Treatment Decision | — |
| Zeng, 2025 [] | China | ChatGPT-4o, DeepSeek | Pure LLM | Treatment Decision | — |
| Schmutz, 2025 [] | Germany | ChatGPT 4.0 | Pure LLM | Treatment Decision | ChatGPT 4.0 |
| Chatziisaak, 2025 [] | Switzerland | ChatGPT-4 | Pure LLM | Treatment Decision | — |
| Horesh, 2025 [] | United States | ChatGP -3.5 | Pure LLM | Treatment Decision | — |
| Kaiser, 2024 [] | United States | ChatGPT-3.5, Microsoft Copilot | Pure LLM | Treatment Decision | — |
| Garg, 2026 [] | United States | ChatGPT-4o | Pure LLM | Treatment Decision | GPT-4o |
| Qu, 2026 [] | China | ChatGPT-o3-mini, DeepSeek-R1, Qwen qwq-plus | Pure LLM | Treatment Decision | o3-mini |
| Diaz, 2025 [] | United States | AI-HOPE (LLaMA 3-based) | Pure LLM | Scientific Research | — |
| Yang, 2025 [] | United States | LLaMA 3 | Pure LLM | Scientific Research | — |
| Yang, 2025 [] | China | BGE-M3, XGBoost | Pure LLM | Predictive Modeling | XGBoost |
| Kim, 2025 [] | Singapore | BioBERT-Large, RadImageNet, 3D ResNet | Multimodal VLM | Predictive Modeling | BioBERT-Large |
| Lim, 2024 [] | Singapore | GPT-4 | Pure LLM | Knowledge QA | GPT-4 |
| Hu, 2025 [] | China | ChatGPT-4.5 | Pure LLM | Knowledge QA | ChatGPT-4.5 |
| Peng, 2024 [] | China | ChatGPT-3.5 | Pure LLM | Knowledge QA | — |
| Wang, 2024 [] | China | GPT-3.5-turbo | Pure LLM | Knowledge QA | — |
| Zhang, 2025 [] | China | ChatGPT-4o, Claude 3.5, DeepSeek | Pure LLM | Knowledge QA | ChatGPT-4o |
| Zhou, 2024 [] | China | ChatGPT, Doctor GPT, Llama-2-70B, Mixtral-8 × 7B, Bard, Claude 2.1 | Pure LLM | Knowledge QA | Claude 2.1 |
| Gorelik, 2023 [] | Israel | ChatGPT-4 | Pure LLM | Knowledge QA | — |
| Maida, 2025 [] | Italy | ChatGPT-4o | Pure LLM | Knowledge QA | — |
| Maida, 2025 [] | Multi-national | ChatGPT-4 | Pure LLM | Knowledge QA | — |
| Emile, 2023 [] | Multi-national | ChatGPT-3.5 | Pure LLM | Knowledge QA | — |
| Kepez, 2024 [] | Turkey | ChatGPT-4 | Pure LLM | Knowledge QA | ChatGPT-4 |
| Atarere, 2024 [] | United States | ChatGPT, BingChat, YouChat | Pure LLM | Knowledge QA | ChatGPT, YouChat |
| Yu, 2025 [] | China | Gemini, GPT-4, GPT-4o, Claude, Llama, DeepSeek, GLM, Qwen | Pure LLM | Information Extraction | GPT-4 |
| Chizhikova, 2025 [] | Spain | RoBERTa | Pure LLM | Information Extraction | Task-specific models |
| Alzaid, 2024 [] | UK | ChatGPT-4 Turbo, GPT-4V | Multimodal VLM | Information Extraction | — |
| Johnson, 2025 [] | United States | Gemma-2-9B-It-SPPO, Llama-3-8B-Instruct | Pure LLM | Information Extraction | Gemma-2 |
| Kim, 2025 [] | South Korea | GPT-4 | Pure LLM | Information Extraction | GPT-4 |
| Ding, 2025 [] | China | ChatGPT-4 | Multimodal VLM | Auxiliary Diagnosis | — |
| Liu, 2024 [] | China | ChatGPT-3.5, ChatGPT-4.0 | Pure LLM | Auxiliary Diagnosis | GPT-4.0 |
| Wang, 2025 [] | China | ChatGPT, Claude, ERNie, SAM | Multimodal VLM | Auxiliary Diagnosis | — |
| Ferber, 2024 [] | Germany | ChatGPT-4V | Multimodal VLM | Auxiliary Diagnosis | GPT-4V |
| Massimi, 2025 [] | Italy | ChatGPT-4o | Multimodal VLM | Auxiliary Diagnosis | GPT-4o |
| Amini, 2025 [] | United States | GPT-3.5-turbo, Bard (PaLM 2) | Pure LLM | Auxiliary Diagnosis | ChatGPT-3.5 |
| Chang, 2024 [] | United States | ChatGPT-4 | Pure LLM | Auxiliary Diagnosis | ChatGPT-4 |
| Sehgal, 2025 [] | United States | ChatGPT-4.1 | Pure LLM | Aided Nursing | — |
aLLM: large language model.
bNo intermodel comparison was performed or the metric is not applicable.
Prompt Engineering and Model Training
The data extraction consistency rate was 0.97. We synthesized prompt engineering strategies, model inputs/outputs, and evaluation metrics (). Five studies [,,,,] did not explicitly describe prompting strategies, employing basic queries primarily for educational purposes. Thirty-two studies described distinct methods, including instruction templates and instructional prompts [,,,,,,,,,,-,-], zero-shot learning [,,,,,,], few-shot learning [,], fine-tuning [,,], and hybrid approaches [,,]. Training data were text-based in 33 studies [,,,,,,,,,,-,-,-,,,-], image-based in 2 studies [,], and multimodal in 2 studies [,]. Common outcome metrics included accuracy [,,,,,,,,-,,,-,,,,,,], F1-score [,,,,], area under the curve [,,], sensitivity [,,], and concordance rate [,,,,,,,,,-,]. A categorized summary is provided in .
| Study | Prompt method or content | Model input | Model output | Outcome indicators |
| Zeng, 2025 [] | Instruction template | Standardized patient cases | Screening and monitoring recommendations | Correct/partially correct/incorrect proportions; descriptive statistics |
| Amini, 2025 [] | Instruction template | Colonoscopy reports, pathology, history, family history | Colonoscopy interval recommendation | Agreement percentage, Fleiss’ kappa, McNemar test |
| Chang, 2024 [] | Instruction template | Deidentified clinical data, colonoscopy reports, pathology reports | Follow-up colonoscopy interval suggestions | Agreement rate, Fleiss kappa |
| Johnson, 2025 [] | Instruction template | Pathology report text | Yes/no answer | F1-score, PPV, NPV, sensitivity, specificity, MCC |
| Lim, 2024 [] | Instruction template | Patient scenario descriptions | Colonoscopy interval recommendations | Correct interval percentage, hallucination rate |
| Gorelik, 2023 [] | Instruction template | Structured endoscopy reports & free-text clinical notes | Guideline-based next-step recommendations; Patient result explanation letters | Guideline adherence, accuracy, Fleiss’ kappa |
| Alzaid, 2024 [] | Instruction template | Unstructured pathology reports | Structured JSON report with confidence | Accuracy, Kappa, AUROC |
| Kepez, 2024 [] | Instruction template | 20 common questions on colon cancer | Answer text for each question | DISCERN, GQS, JAMA criteria, Flesch-Kincaid readability, SAM, HITS, VPI, HONcode |
| Zhang, 2025 [] | Instruction template | Chinese Society of Clinical Oncology guideline standards / instructions | Colorectal cancer screening educational text | Accuracy, clarity, rigor scores |
| Yang, 2025 [] | Instruction template | Natural language queries on clinical genomic data | Mutation profiles, survival curves, odds ratios | P values, hazard ratios, odds ratios |
| Wang, 2025 [] | Instruction template | Free-text colonoscopy reports | Report-level labels | Accuracy, average precision, dice similarity coefficient, AUC |
| Sehgal, 2025 [] | Instruction template | Self-reported demographics | AI-generated personalized messages or chatbot dialogues | Intent score change, Cohen d, P values, OR, Flesch-Kincaid readability |
| Schmutz, 2025 [] | Instruction template | Clinical patient summaries and pathology reports | Treatment/diagnostic recommendations | Recommendation type, information density, consistency, level of evidence, time efficiency |
| Massimi, 2025 [] | Instruction template | Colonoscopy video frames | Paris classification | Accuracy, sensitivity, specificity, Fleiss’ kappa |
| Ding, 2025 [] | Instruction template | Pathology images and text prompts | Tissue origin, lesion classification, diagnosis | Accuracy, sensitivity, specificity, PPV, NPV, F1-score, Kappa, ICC |
| Diaz, 2025 [] | Instruction template | Natural language queries for scanning and validating clinical genomic datasets | Survival analysis results, mutation frequency comparisons, statistical significance | P values, odds ratios, survival rates |
| Chatziisaak, 2025 [] | Instruction template | Patient clinical data | Treatment recommendation | Consistency, chi-square test |
| Qu, 2026 [] | Instruction template; Multi-role prompting | Structured variables and free-text summaries from clinical records | Four-category treatment classification code | Intra-model agreement; expert-model concordance, Cohen κ |
| Garg, 2026 [] | Instruction template; Role prompting; Few-shot; Chain-of-thought; JSON schema enforcement | Colonoscopy reports, pathology reports, patient family history and preoperative diagnoses | Structured clinical entities and 2020 USMSTF-based surveillance interval recommendations; 2024 ACG/ASGE quality indicators | Case-level accuracy, Cohen κ; Fleiss’ κ; ADR, SSLDR, cecal intubation rate, bowel prep adequacy |
| Kim, 2025 [] | Instruction template; Role prompting | Unstructured preoperative abdominal CT / rectal MRI reports | Lesion location and cTNM stage and reasoning | Lesion location accuracy |
| Kim, 2025 [] | Fine-tuning | CT images and radiology report texts | Binary NAR score classification | AUC |
| Chizhikova, 2025 [] | Fine-tuning | Spanish colon MRI report texts, numerical features, categorical features | TNM staging | Accuracy, macro F1-score, precision, recall |
| Yang, 2025 [] | Fine-tuning | Clinical EHR data | Binary colorectal adenoma risk | AUC, sensitivity, specificity, F1-score, PPV, NPV, mean lead time |
| Ferber, 2024 [] | Few-shot | Cancer pathology images | Image classification labels | Accuracy, confidence interval, recall |
| Zeng, 2025 [] | Few-shot; Role prompting; Context learning | Real-world pathology report text | Recommendation on need for additional surgery | Accuracy; guideline consistency proportion |
| Peng, 2024 [] | Zero-shot | Medical questions from books | Colorectal cancer-related answers | Accuracy, comprehensiveness scores |
| Zhou, 2024 [] | Zero-shot | 150 CRC-related closed-ended questions | Yes/no answers | Accuracy |
| Liu, 2024 [] | Zero-shot | Colorectal cancer case report texts | Primary/secondary diagnoses | Accuracy |
| Wang, 2024 [] | Zero-shot | Pathology report text and related questions | Answers to pathology questions | 7-point Likert scale |
| Horesh, 2025 [] | Zero-shot | Clinical patient summaries | Next best management recommendation | Consistency with multidisciplinary team decisions, reasonableness score, interrater reliability |
| Yu, 2025 [] | Zero-shot; Chain-of-thought | Endoscopy/colonoscopy report texts | Structured JSON including lesion location, features, layer structure, distribution, diagnosis | Precision, recall, F1-score, accuracy |
| Hu, 2025 [] | Zero-shot | Patient question texts | Answer texts | Accuracy, completeness, clarity scores |
| Maida, 2025 [] | — | 15 questions on colorectal cancer screening | Text answers to questions | Accuracy, completeness, clarity scores |
| Emile, 2023 [] | — | 38 common questions on CRC prevention, diagnosis, management | Text answers | Expert consensus; consistency with guidelines |
| Atarere, 2024 [] | — | 15 questions on CRC screening concepts and 5 experience-based questions | Response appropriateness | Appropriateness rating |
| Kaiser, 2024 [] | — | Clinical scenario questions on next management | Text recommendations for clinical questions | Accuracy score, consistency, verbosity |
| Maida, 2025 [] | — | Patient queries | ChatGPT-generated answers | Expert scores, patient scores |
aPPV: positive predictive value.
bNPV: negative predictive value.
cMCC: Matthews correlation coefficient.
dAUROC: area under the receiver operating characteristic curve.
eAUC: area under the curve.
fOR: odds ratio.
gICC: intraclass correlation coefficient.
hTNM: tumor–node–metastasis.
iEHR: electronic health record.
jCRC: colorectal cancer.
kPrompt method was not explicitly reported.
Risk of Bias in Studies
The included studies were categorized by research objective, and quality was assessed using the corresponding appraisal tool. The kappa value between the 2 reviewers was 0.95. Two predictive modeling studies [,] were evaluated using PROBAST (); both showed low risk of bias across the participants, predictors, and outcome domains, but one exhibited high risk of bias in the analysis domain. Eighteen diagnostic studies [,,,,,-,,,,,,,,] were assessed using QUADAS-2 (); while most demonstrated acceptable applicability, risk of bias in the patient selection domain was frequently unclear or high. Seventeen intervention studies [,,,,,,,,,,,,,,,] were appraised using ROBINS-I (); risk of bias was predominantly low for participant selection, deviations from intended interventions, and missing data, but moderate to serious for outcome measurement and classification of interventions.

Overall, 27 of the 37 included studies were rated above low risk of bias: 6 as high or serious [,-,,], 7 as unclear [,,,,,,], and 14 as moderate [,,,,,,,,,,,,,], while 10 were rated as low [,,,,,,,-]. The most problematic domains across tools were outcome measurement [,,,,,,-,,,,,,-,,,] and patient selection [,,-,,,,,]. Given these recurring concerns, particularly regarding blinding [,,,,], outcome measurement [,,,,,,-,,,,,,-,,,], and confounding [,,-,,,-,,], and the considerable heterogeneity in clinical tasks, LLM models, and outcome metrics, the overall certainty of evidence was judged as moderate to low. Quantitative meta-analysis was not feasible; even within the largest subgroup, fewer than 5 studies were sufficiently aligned in task definition, input modality, and reference standard to permit reliable pooling. A narrative synthesis was therefore adopted, and the findings should be interpreted with caution.
Discussion
Principal Findings
Through a comprehensive analysis of 37 studies, we identified 5 primary application domains of LLMs in CRC diagnosis and treatment: auxiliary diagnosis, information extraction, knowledge-based question-answering and patient education, treatment decision support, and scientific research and predictive modeling (). These domains are often interconnected in clinical practice. For instance, information extraction frequently provides structured data to support diagnostic processes [,], while knowledge-based question-answering is widely applied in scientific communication and patient education [,,,,].
Applications of LLMs in CRC
LLMs enable the automated extraction of clinical features through NLP []. Multiple studies have utilized LLMs to extract key information from EHRs [], endoscopy reports [], radiology reports [], and pathology reports [,]. This capability assists not only in clinical staging and histological classification [] but also in predicting disease progression and treatment response []. For instance, lymph node metastasis assessment based on MRI reports [] and tumor progression prediction from radiology reports [] have shown promising accuracy. These advancements underscore the significant value of LLMs in early CRC screening. Early diagnosis can effectively improve survival rates [], and mass screening achieves a high detection rate for early-stage lesions []. Wang leveraged LLMs to automatically extract knowledge from colonoscopy image-text records, enabling polyp detection and segmentation without manual annotation, thereby offering a novel approach to screening automation []. A systematic review of LLMs in gastroenterology similarly demonstrated the potential applications of LLMs in gastrointestinal endoscopy and precancerous lesion screening []. Despite challenges such as insufficient extraction performance for complex tasks and hallucinations reporting a lower accuracy of 55% for LLMs in classifying pedunculated polyps, indicating they cannot yet fully replace endoscopic experts [,], we remain optimistic about their future performance in assisting CRC diagnosis and early screening. This optimism is fueled by ongoing advancements in multimodal integration [], the development of domain-specific models [,], and the continuous optimization of training data [].
Leveraging their strong interactive capabilities and extensive knowledge, LLMs are widely evaluated for CRC medical question-answering and patient education [,]. Furthermore, advancing multimodal models now enable LLMs to jointly analyze medical images and text, offering CRC diagnostic and therapeutic suggestions in controlled settings []. Gong has recently emphasized that multimodal fusion has emerged as the dominant next-generation development trend for gastrointestinal artificial intelligence []; however, this important technological milestone has not yet received adequate attention in available systematic reviews. Ferber demonstrated that multimodal LLMs applying in-context learning achieved near-pathologist-level classification of cancer pathology images [], and Kim [] showed that combined LLM and vision deep learning architectures outperformed either modality alone for neoadjuvant rectal score prediction, which preliminarily suggests the potential of multimodal LLMs, as they can reach a level close to that of pathologists when processing pathological image classification and clinical prediction tasks, and outperform single-modality models. Despite this progress, the diagnostic accuracy of current multimodal models on morphologically complex tasks remains constrained [,]. This reinforces the prevailing clinical consensus that current LLMs must be deployed strictly as decision-support adjuncts rather than autonomous diagnostic agents, thereby mitigating the significant clinical risks associated with automation bias and diagnostic delay [,]. Furthermore, extraction performance varied markedly, dictated by underlying model architecture and optimization strategy. GPT-4, augmented with multi-strategy prompting, appeared to outperform zero-shot baselines for colonoscopy report extraction [], while biomedical pretrained RoBERTa showed better performance than general-purpose GPT models for TNM staging in the available evidence from Spanish-language reports []. This discrepancy unequivocally indicates that domain-adaptive and language-specific pretraining confers fundamental structural semantic advantages that advanced prompt engineering alone cannot replicate [], consistent with recent evaluations where specialized models exhibited superior performance within data-constrained clinical settings []. Nonetheless, the majority of this evidence is derived from retrospective analyses and single-center validations, with a notable paucity of prospective, multicenter clinical trials to confirm generalizability and real-world efficacy [,,].
The NLP and named entity recognition capabilities of LLM extend their utility beyond direct clinical support for practitioners and patients, substantially improving the efficacy of medical research workflows []. In the domain of data extraction and analysis, Johnson leveraged the Gemma-2 model to accurately identify and extract key pathological diagnostic entities—such as dysplasia, high-grade dysplasia/adenocarcinoma, and invasive carcinoma—from unstructured pathology reports []. This high accuracy aligns robustly with Chen et al [], who demonstrated comparable reliability in extracting oncological variables from EHRs, confirming automated information extraction as one of the most mature LLM applications. Beyond data retrieval, LLMs are increasingly serving as active engines for hypothesis generation [,]. Their probabilistic structure allows them to synthesize vast, disparate datasets and infer latent correlations that traditional algorithms might overlook []. In translational medicine, Yang developed AI-HOPE-TP53, a LLaMA 3-based conversational agent that facilitates pathway-centric analysis of clinical genomic data in early-onset CRC []. By rapidly generating statistical outputs like survival curves and hazard ratios, this system accelerates hypothesis-driven research in precision oncology []. The viability of this paradigm shift is further corroborated by Abdel-Rehim, who experimentally validated that LLM-driven pipelines can successfully identify novel, laboratory-verifiable synergistic drug combinations []. Furthermore, hybrid LLM architectures are democratizing access to complex analytical tools in routine practice. Yang et al [] developed an early-stage CRC adenoma risk prediction model combining BGE-M3 semantic vector encoding with XGBoost algorithms. By enabling clinicians without specialized computational expertise to perform sophisticated risk stratification based on LLM-processed outputs, such models substantially reduce the administrative burden and facilitate a more patient-centered clinical workflow []. The research-supportive functions of LLMs have also expanded into foundational scholarly activities, including knowledge synthesis and the drafting of study protocols, ethics materials, and preliminary manuscript sections [,]. However, the originality and factual accuracy of such artificial intelligence-generated scholarly content necessitate rigorous human oversight to ensure scientific integrity [].
Limitations of LLMs and Future Directions
Current research on LLMs in the field of CRC predominantly focuses on textual data processing []; investigations into other modalities, including CT images [,], histopathological slides [], and bioinformatics data [], remain in their nascent stages, demonstrating suboptimal output precision and task stability. General-purpose LLMs (eg, ChatGPT and the LLaMA series), predominantly pretrained on public databases, frequently manifest deficiencies such as delayed knowledge base updates, insufficient coverage of CRC subspecialty knowledge, a propensity for hallucinations, and an absence of authoritative evidence-based support for pivotal clinical content [,,]. Conversely, although existing medical-domain-specific LLMs (eg, Med-PaLM 2, BioBERT, and ClinicalBERT) possess certain advantages in general medical tasks, their comprehensive performance in complex subspecialty tasks, such as the precision treatment of CRC, still lags behind that of large-parameter general-purpose models []. More critically, the reliability and generalizability of currently well-developed diagnostic and decision-support tools are severely hindered by methodological flaws; existing evidence relies disproportionately on retrospective, single-center datasets lacking temporal or geographic stratification []. This evaluative paradigm renders models highly susceptible to overfitting and training data leakage, thereby precipitating a drastic degradation in performance within real-world clinical environments []. There remains a critical paucity of rigorous prospective, multicenter clinical validation data within this domain [,].
The risk-of-bias assessment revealed several recurring methodological weaknesses across study designs. Among diagnostic accuracy studies evaluated with QUADAS-2, the patient selection domain was the most common source of concern, with ratings of “unclear” or “high” largely attributable to unreported sampling procedures and potentially inappropriate exclusion criteria [,,,,]. For nonrandomized intervention studies assessed with ROBINS-I, the principal limitations were inadequate adjustment for confounding variables [,-,,,,] and the absence of blinded outcome assessment, both of which may bias effect estimates [,,,,,,,,,,,]. Prediction model studies appraised with PROBAST generally performed well in the participants, predictors, and outcome domains but showed weaknesses in the analysis domain, including limited sample size, unexplained participant attrition, and insufficiently described handling of missing data [,]. Collectively, these methodological limitations reduce the reliability of the current evidence base and constrain its translational applicability.
Beyond data-related constraints, the intrinsic technical vulnerabilities and compliance risks of LLMs pose substantial threats to clinical safety []. The profound sensitivity of models to version iterations and prompt variations results in exceedingly poor reproducibility of outputs across multi-institutional settings []. In the absence of specific instructional constraints, models are not only prone to hallucinations but may also exacerbate negative societal biases and stereotypes []. Uncritical acceptance of these recommendations by clinicians may engender bias, subsequently precipitating critical diagnostic delays or inappropriate clinical interventions []. Furthermore, constrained by the heterogeneity of patient requirements and the stringent governance of sensitive data, applications pertaining to patient follow-up and supportive care remain the most underdeveloped []. Moreover, the pervasive absence of data privacy and information security protocols during the cloud-based deployment of open-source LLMs further impedes their clinical translation and real-world implementation [,].
To address current technical bottlenecks, it is imperative to enhance model precision and reliability through future technological advancements []. Multimodal integration is recognized as the predominant trajectory for next-generation technological development in this domain, offering the potential to transcend the limitations of unimodal text processing []. Regarding optimization strategies, RAG technology emerges as an optimal solution for tailoring general-purpose models to subspecialty clinical scenarios []. By interfacing with independent, verifiable, and authoritative subspecialty knowledge bases, RAG facilitates real-time knowledge updates, effectively enhances the concordance between model outputs and authoritative guidelines, substantially mitigates hallucinations, and endows models with robust interpretability [,]. Concurrently, prompt engineering (eg, instruction templates, few-shot learning, and chain-of-thought prompting) can rapidly augment the performance of general-purpose models in specific tasks, including pathological data extraction, treatment regimen recommendation, and follow-up protocol formulation, without altering underlying model weights [,].
Regarding clinical integration and ethical governance, future research priorities must pivot toward achieving real-world validity and safety []. Primarily, prospective, multicenter clinical validations must be conducted for diagnostic and treatment planning applications, while patient follow-up and supportive care systems must be specifically developed to rectify deficiencies in full-cycle management []. More crucially, LLMs cannot supplant medical professionals; their responsible clinical application must be strictly predicated on the establishment of a robust ethical governance framework []. This necessitates the strict enforcement of their adjunctive role under continuous human supervision, concurrent with the resolution of data privacy issues and the assurance of foundational data quality []. Ultimately, cross-disciplinary collaboration is imperative to delineate accountability, ensuring the synchronous evolution of governance frameworks and cutting-edge technologies [].
Limitations of This Systematic Review
This review has several limitations. First, the majority of included studies were retrospective and single-center in design, and no prospective multicenter clinical trials establishing real-world LLM effectiveness in CRC care were identified. Only a minority conducted independent external validation, precluding confirmation of generalizability across diverse populations and institutions. Second, the rapid publication pace of LLM research means some recent developments may not have been captured despite the April 1, 2026 search cutoff. Third, restriction to English-language publications may introduce geographic bias. Fourth, several included studies evaluated proprietary commercial models such as GPT-4 and Claude, whose architectures and training data are not fully disclosed, introducing additional transparency and reproducibility concerns. No included study reported direct industry sponsorship for LLM evaluation. Finally, the search strategy was only cross-checked internally without formal external peer review, potentially leading to omission of a few unpublished or noncore journal studies. Inherent subjectivity in quality appraisal was mitigated through independent dual assessment, third-reviewer arbitration, and expert validation [,].
Conclusions
This review establishes an integrative framework that synthesizes evidence across diverse study designs and LLM categories to compare their respective strengths and limitations in CRC care. Distinct from prior reviews that have addressed gastroenterology broadly or have been confined to a single study design, our work focuses specifically on the full-cycle CRC care continuum and, for the first time, comparatively evaluates general-purpose, domain-specific, and multimodal LLMs, thereby elucidating how prompt engineering and heterogeneous evaluation metrics shape reported outcomes. While our findings substantiate the clinical potential of LLMs, these results should be interpreted with caution, given the overall low quality of the available evidence. Most included studies failed to report key safeguards against bias—such as blinding of outcome assessors, adequate adjustment for confounders, or the use of prospective, multicenter designs to validate model generalizability. Moreover, the substantial heterogeneity we observed across task types, LLM categories, prompt engineering strategies, reference standards, and outcome measures indicates that the performance advantages reported for any specific LLM are confined to the corresponding tasks and clinical scenarios and cannot be generalized. Future efforts should therefore prioritize the integration of LLMs into real-world clinical practice, which will require prospective, multicenter validation, a robust privacy-protection framework, and rigorous human oversight to mitigate bias. Against the backdrop of a rising global CRC burden and persistent disparities in health care resource allocation, this review provides an evidence base to inform the clinical translation, equitable scaling, and policy formulation surrounding LLM deployment in CRC care.
Registration and Protocol
This systematic review was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO) under registration number CRD420251248261. The review protocol is publicly accessible through the PROSPERO database. No separate protocol manuscript was published.
One amendment was made to the registered protocol: the literature search cutoff date was extended from November 1, 2025 to April 1, 2026, to capture the most recent publications prior to data synthesis. This amendment was implemented after the initial search had been completed and did not alter the review’s eligibility criteria, synthesis methodology, or any other prespecified procedures. The narrative synthesis approach (SWiM), quality assessment tools (QUADAS-2, PROBAST, ROBINS-I), eligibility criteria, database selection, screening processes, and data extraction methods were all carried out as prespecified in the registered protocol. No other amendments were made.
Acknowledgments
The authors sincerely thank Zhejiang Chinese Medical University and Hangzhou First People’s Hospital for providing the academic research platform, professional literature resource support, and methodological guidance for the completion of this systematic review. This manuscript was originally drafted in Chinese and subsequently translated into English. During the preparation and translation process, the authors further used ChatGPT (OpenAI) to assist with English-language polishing. All AI-generated outputs were critically reviewed and manually edited by the authors, who take full responsibility for the accuracy and integrity of the final content. The authors declare the use of generative artificial intelligence (GAI) in the research and writing process. In accordance with the GAIDeT taxonomy (2025), GAI tools were used under full human supervision for idea generation, proofreading and editing, and translation. The GAI tools used were Gemini 3 and DeepSeek. Responsibility for the content and integrity of the final manuscript rests entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes. This declaration is submitted under the collective responsibility of the authors.
Funding
This work was supported by the Clinical Research Application Project of Zhejiang Provincial Medical and Health Science and Technology Program (grant number 2024KY190), the Hangzhou Municipal Medical and Health Science and Technology Program (grant number A20241859), and the Hangzhou Municipal Biomedical Special Project (grant number 2023WJC120).
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Authors' Contributions
JL conceived the study, designed the methodology, and wrote the original draft. QF and WX contributed equally to data collection, analysis, and manuscript revision. HY, HT, and YL participated in data curation, validation, and discussion. All authors reviewed and approved the final manuscript. QF is the corresponding author. JL and WX contributed equally to this work.
Conflicts of Interest
None declared.
Multimedia Appendix 2
Methodological classification, appraisal tools, and evaluation metrics of the included studies.
PDF File, 108 KBMultimedia Appendix 3
Documentation of framework-preserving adaptations to quality appraisal tools.
PDF File, 145 KBMultimedia Appendix 4
Prompt engineering strategies and application scenarios in the included studies.
PDF File, 87 KBReferences
- Wu S, Zhang Y, Lin Z, Wei M. Global burden of colorectal cancer in 2022 and projections to 2050: incidence and mortality estimates from GLOBOCAN. BMC Cancer. Nov 14, 2025;25(1):1770. [CrossRef] [Medline]
- Eng C, Yoshino T, Ruíz-García E, et al. Colorectal cancer. The Lancet. Jul 2024;404(10449):294-310. [CrossRef] [Medline]
- Sloss EA, Abdul S, Aboagyewah MA, et al. Toward alleviating clinician documentation burden: a scoping review of burden reduction efforts. Appl Clin Inform. May 2024;15(3):446-455. [CrossRef] [Medline]
- Holmgren AJ, Apathy NC, Crews J, Shanafelt T. National trends in oncology specialists’ EHR inbox work, 2019-2022. J Natl Cancer Inst. Jun 1, 2025;117(6):1253-1259. [CrossRef] [Medline]
- Wong EYT, Verlingue L, Aldea M, et al. ESMO guidance on the use of large language models in clinical practice (ELCAP). Ann Oncol. Dec 2025;36(12):1447-1457. [CrossRef] [Medline]
- Chen D, Alnassar SA, Avison KE, Huang RS, Raman S. Large language model applications for health information extraction in oncology: scoping review. JMIR Cancer. Mar 28, 2025;11(1):e65984. [CrossRef] [Medline]
- Peng W, feng Y, Yao C, et al. Evaluating AI in medicine: a comparative analysis of expert and ChatGPT responses to colorectal cancer questions. Sci Rep. 2024;14(1):2840. [CrossRef]
- Zeng A, Steinke J, Bocse HF, De Pastena M. Dr. LLM will see you now: the ability of ChatGPT to provide geographically tailored colorectal cancer screening and surveillance recommendations. J Clin Med. Jul 18, 2025;14(14):5101. [CrossRef] [Medline]
- Gong EJ, Bang CS, Lee JJ, et al. Large language models in gastroenterology: systematic review. J Med Internet Res. Dec 20, 2024;26:e66648. [CrossRef] [Medline]
- Maida M, Ramai D, Mori Y, et al. The role of generative language systems in increasing patient awareness of colon cancer screening. Endoscopy. Mar 2025;57(3):262-268. [CrossRef] [Medline]
- Yang EW, Waldrup B, Velazquez-Villarreal E. Conversational artificial intelligence for integrating social determinants, genomics, and clinical data in precision medicine: development and implementation study of the AI-HOPE-PM system. JMIR Bioinform Biotechnol. Oct 10, 2025;6:e76553. [CrossRef] [Medline]
- Pereyra L, Schlottmann F, Steinberg L, Lasa J. Colorectal cancer prevention: is chat generative pretrained transformer (Chat GPT) ready to assist physicians in determining appropriate screening and surveillance recommendations? J Clin Gastroenterol. 2024;58(10):1022-1027. [CrossRef] [Medline]
- Amini M, Chang PW, Davis RO, et al. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: bridging the gap in healthcare settings. Endosc Int Open. 2025;13(CP):a25865912. [CrossRef] [Medline]
- Chang PW, Amini MM, Davis RO, et al. ChatGPT4 outperforms endoscopists for determination of postcolonoscopy rescreening and surveillance recommendations. Clin Gastroenterol Hepatol. Sep 2024;22(9):1917-1925. [CrossRef] [Medline]
- Omar M, Nassar S, SharIf K, Glicksberg BS, Nadkarni GN, Klang E. Emerging applications of NLP and large language models in gastroenterology and hepatology: a systematic review. Front Med (Lausanne). 2024;11:1512824. [CrossRef] [Medline]
- Naito T, Nosaka T, Tanaka T, et al. Usefulness of an artificial intelligence-based colonoscopy report generation support system. Clin Endosc. Mar 2025;58(2):327-330. [CrossRef] [Medline]
- Johnson B, Bath T, Huang X, et al. Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records. BMJ Open Gastroenterol. Sep 18, 2025;12(1):e001896. [CrossRef] [Medline]
- Bräutigam K, Baker AM, Koelzer VH, Kather JN, Graham TA. Integrating artificial intelligence (AI) into colorectal cancer reporting. J Pathol. Apr 2026;268(4):367-382. [CrossRef] [Medline]
- Yılmaz M, Abbaslı N, Tuna S, et al. Comparison of artificial intelligence and multidisciplinary team recommendations in the management of colorectal cancer liver metastases. Sci Rep. 2026;16(1):7278. [CrossRef]
- Qu B, Cao L, Wu C, et al. Comparison of large language models and expert multidisciplinary team decisions in colorectal cancer. BMJ Health Care Inform. Mar 10, 2026;33(1):e101780. [CrossRef] [Medline]
- Biesheuvel LA, Workum JD, Reuland M, et al. Large language models in critical care. J Intensive Med. Apr 2025;5(2):113-118. [CrossRef] [Medline]
- Emile SH, Horesh N, Freund M, et al. How appropriate are answers of online chat-based artificial intelligence (ChatGPT) to common questions on colon cancer? Surgery. Nov 2023;174(5):1273-1275. [CrossRef] [Medline]
- Haltaufderheide J, Ranisch R. The ethics of ChatGPT in medicine and healthcare: a systematic review on large language models (LLMs). NPJ Digit Med. Jul 8, 2024;7(1):183. [CrossRef] [Medline]
- Wang Q, Zou H, Zhang H, Huang Y, Tian J, Cheng W. A survey on medical competence evaluation benchmarks for large language models. Health Care Sci. Feb 2026;5(1):4-18. [CrossRef] [Medline]
- Zhou S, Luo X, Chen C, et al. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg. Oct 1, 2024;110(10):6509-6517. [CrossRef] [Medline]
- Jeon S, Kim HG. A comparative evaluation of chain-of-thought-based prompt engineering techniques for medical question answering. Comput Biol Med. Sep 2025;196(Pt A):110614. [CrossRef] [Medline]
- Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med Inform. Apr 8, 2024;12:e55318. [CrossRef] [Medline]
- Lim DYZ, Tan YB, Koh JTE, et al. ChatGPT on guidelines: providing contextual knowledge to GPT allows it to provide advice on appropriate colonoscopy intervals. J Gastroenterol Hepatol. Jan 2024;39(1):81-106. [CrossRef] [Medline]
- Amugongo LM, Mascheroni P, Brooks S, Doering S, Seidel J. Retrieval augmented generation for large language models in healthcare: a systematic review. PLOS Digit Health. Jun 2025;4(6):e0000877. [CrossRef] [Medline]
- Yang Y, Jin Q, Huang F, Lu Z. Adversarial prompt and fine-tuning attacks threaten medical large language models. Nat Commun. Oct 9, 2025;16(1):9011. [CrossRef] [Medline]
- Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med. Jan 24, 2024;7(1):20. [CrossRef] [Medline]
- Williams CYK, Miao BY, Kornblith AE, Butte AJ. Evaluating the use of large language models to provide clinical recommendations in the emergency department. Nat Commun. Oct 8, 2024;15(1):8236. [CrossRef] [Medline]
- Zhong R, Chen S, Li Z, et al. Large language models in lung cancer: systematic review. J Med Internet Res. Sep 30, 2025;27:e74177. [CrossRef] [Medline]
- Hao Y, Qiu Z, Holmes J, et al. Large language model integrations in cancer decision-making: a systematic review and meta-analysis. NPJ Digit Med. Jul 17, 2025;8(1):450. [CrossRef] [Medline]
- Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med (Lond). Oct 10, 2023;3(1):141. [CrossRef] [Medline]
- Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. Mar 29, 2021;372:n71. [CrossRef] [Medline]
- Rethlefsen ML, Kirtley S, Waffenschmidt S, et al. PRISMA-S: an extension to the PRISMA statement for reporting literature searches in systematic reviews. Syst Rev. Jan 26, 2021;10(1):39. [CrossRef] [Medline]
- Campbell M, McKenzie JE, Sowden A, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ. Jan 16, 2020;368:l6890. [CrossRef] [Medline]
- Omar M, Levkovich I. Exploring the efficacy and potential of large language models for depression: a systematic review. J Affect Disord. Feb 15, 2025;371:234-244. [CrossRef] [Medline]
- Whiting PF, Rutjes AWS, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. Oct 18, 2011;155(8):529-536. [CrossRef] [Medline]
- Moons KGM, Wolff RF, Riley RD, et al. PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration. Ann Intern Med. Jan 1, 2019;170(1):W1-W33. [CrossRef] [Medline]
- Sterne JA, Hernán MA, Reeves BC, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. Oct 12, 2016;355:i4919. [CrossRef] [Medline]
- Gorelik Y, Ghersin I, Maza I, Klein A. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc. Oct 2023;98(4):639-641. [CrossRef] [Medline]
- Alzaid E, Pergola G, Evans H, Snead D, Minhas F. Large multimodal model-based standardisation of pathology reports with confidence and its prognostic significance. J Pathol Clin Res. Nov 2024;10(6):e70010. [CrossRef] [Medline]
- Atarere J, Naqvi H, Haas C, et al. Applicability of online chat-based artificial intelligence models to colorectal cancer screening. Dig Dis Sci. Mar 2024;69(3):791-797. [CrossRef] [Medline]
- Ferber D, Wölflein G, Wiest IC, et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun. Nov 21, 2024;15(1):10104. [CrossRef] [Medline]
- Kaiser KN, Hughes AJ, Yang AD, et al. Accuracy and consistency of publicly available large language models as clinical decision support tools for the management of colon cancer. J Surg Oncol. Oct 2024;130(5):1104-1110. [CrossRef] [Medline]
- Kepez MS, Ugur F. Comparative evaluation of information quality on colon cancer for patients: a study of ChatGPT-4 and Google. Cureus. Nov 2024;16(11):e73989. [CrossRef] [Medline]
- Liu J, Liang X, Fang D, et al. The diagnostic ability of GPT-3.5 and GPT-4.0 in surgery: comparative analysis. J Med Internet Res. Sep 10, 2024;26:e54985. [CrossRef] [Medline]
- Wang A, Zhou J, Zhang P, et al. Large language model answers medical questions about standard pathology reports. Front Med (Lausanne). 2024;11:1402457. [CrossRef] [Medline]
- Kim HB, Tan HQ, Nei WL, Tan YCRS, Cai Y, Wang F. Impact of large language models and vision deep learning models in predicting neoadjuvant rectal score for rectal cancer treated with neoadjuvant chemoradiation. BMC Med Imaging. Jul 31, 2025;25(1):306. [CrossRef] [Medline]
- Chizhikova M, López-Úbeda P, Martín-Noguerol T, et al. Automatic TNM staging of colorectal cancer radiology reports using pre-trained language models. Comput Methods Programs Biomed. Feb 2025;259:108515. [CrossRef] [Medline]
- Horesh N, Emile SH, Gupta S, et al. Comparing the management recommendations of large language model and colorectal cancer multidisciplinary team: a pilot study. Dis Colon Rectum. Jan 1, 2025;68(1):41-47. [CrossRef] [Medline]
- Yang X, Xu J, Ji H, Li J, Yang B, Wang L. Early prediction of colorectal adenoma risk: leveraging large-language model for clinical electronic medical record data. Front Oncol. 2025;15:1508455. [CrossRef] [Medline]
- Zhang Z, Zhang ZC, Zhang SP, et al. Comparative analysis of artificial intelligence tools for the dissemination of colorectal cancer screening guidelines: a novel perspective on early screening education. Int J Surg. Nov 1, 2025;111(11):8616-8620. [CrossRef] [Medline]
- Zeng L, Cao Q, Deng J, Hu J, Pang M, Liu F. Guideline adherence in surgical decisions for T1 colorectal cancer after endoscopic resection: large language models vs clinicians. Int J Surg. Jan 1, 2026;112(1):1886-1890. [CrossRef] [Medline]
- Yu Z, Fang L, Ding Y, et al. Evaluating large language models for information extraction from gastroscopy and colonoscopy reports through multi-strategy prompting. J Biomed Inform. Aug 2025;168:104844. [CrossRef] [Medline]
- Yang EW, Waldrup B, Velazquez-Villarreal E. Conversational AI agent for precision oncology: AI-HOPE-WNT integrates clinical and genomic data to investigate WNT pathway dysregulation in colorectal cancer. Front Artif Intell. 2025;8:1624797. [CrossRef] [Medline]
- Wang S, Zhu Y, Yang Z, et al. Leveraging large language and vision models for knowledge extraction from large-scale image-text colonoscopy records. Nat Biomed Eng. Sep 16, 2025. [CrossRef] [Medline]
- Sehgal NKR, Tonneau M, Tan A, et al. Effect of static vs. conversational AI-generated messages on colorectal cancer screening intent: a randomized controlled trial. arXiv. Preprint posted online on Jul 10, 2025. [CrossRef]
- Schmutz M, Sommer S, Sander J, et al. Large language model processing capabilities of ChatGPT 4.0 to generate molecular tumor board recommendations-a critical evaluation on real world data. Oncologist. Oct 1, 2025;30(10):oyaf293. [CrossRef] [Medline]
- Massimi D, Carlini L, Mori Y, et al. Large language model for interpreting the Paris classification of colorectal polyps. Endosc Int Open. 2025;13(CP):a27030209. [CrossRef] [Medline]
- Maida M, Mori Y, Fuccio L, et al. Exploring ChatGPT effectiveness in addressing direct patient queries on colorectal cancer screening. Endosc Int Open. 2025;13(CP):a25689416. [CrossRef] [Medline]
- Hu Y, Wang S, Cai P, Artificial Intelligence Colorectal Cancer Research (AI-CORE) Working Group. Multidimensional assessment of ChatGPT in colorectal cancer postoperative consultations: analysing response variations across critical clinical domains. Digit Health. 2025;11:20552076251393297. [CrossRef] [Medline]
- Ding L, Fan L, Shen M, et al. Evaluating ChatGPT’s diagnostic potential for pathology images. Front Med. 2025;11:1507203. [CrossRef] [Medline]
- Diaz FC, Waldrup B, Carranza FG, Manjarrez S, Velazquez-Villarreal E. Artificial intelligence-enhanced precision medicine reveals prognostic impact of TGF-beta pathway alterations in FOLFOX-treated early-onset colorectal cancer among disproportionately affected populations. Int J Mol Sci. Sep 17, 2025;26(18):9067. [CrossRef] [Medline]
- Chatziisaak D, Burri P, Sparn M, Hahnloser D, Steffen T, Bischofberger S. Concordance of ChatGPT artificial intelligence decision-making in colorectal cancer multidisciplinary meetings: retrospective study. BJS Open. May 7, 2025;9(3):zraf040. [CrossRef] [Medline]
- Garg SK, Mau B, Hubers J, et al. Colon-Pilot: a generative AI tool for automated colonoscopy surveillance recommendations and 2024 ACG/ASGE quality benchmarking. Am J Gastroenterol. Apr 1, 2026;121(4):964-973. [CrossRef] [Medline]
- Kim JS, Baek SJ, Ryu HS, et al. Using large language models for clinical staging of colorectal cancer from imaging reports: a pilot study. Ann Surg Treat Res. Nov 2025;109(5):318-327. [CrossRef] [Medline]
- Wang L, Ma Y, Bi W, Lv H, Li Y. An entity extraction pipeline for medical text records using large language models: analytical study. J Med Internet Res. Mar 29, 2024;26:e54580. [CrossRef] [Medline]
- Chen RJ, Ding T, Lu MY, et al. Towards a general-purpose foundation model for computational pathology. Nat Med. Mar 2024;30(3):850-862. [CrossRef] [Medline]
- Zhu M, Lin H, Jiang J, et al. Large language model trained on clinical oncology data predicts cancer progression. NPJ Digit Med. Jul 2, 2025;8(1):397. [CrossRef] [Medline]
- Tariq R, Malik S, Khanna S. Evolving landscape of large language models: an evaluation of ChatGPT and Bard in answering patient queries on colonoscopy. Gastroenterology. Jan 2024;166(1):220-221. [CrossRef] [Medline]
- Maida M, Celsa C, Lau LHS, et al. The application of large language models in gastroenterology: a review of the literature. Cancers (Basel). Sep 28, 2024;16(19):3328. [CrossRef] [Medline]
- Jonnagaddala J, Shulajkovska M, Gradišek A, et al. Multimodal analysis of whole slide images in colorectal cancer. NPJ Digit Med. Nov 24, 2025;8(1):719. [CrossRef] [Medline]
- Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature New Biol. Aug 2023;620(7972):172-180. [CrossRef] [Medline]
- Giuffrè M, Kresevic S, Pugliese N, You K, Shung DL. Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes. Liver Int. Sep 2024;44(9):2114-2124. [CrossRef] [Medline]
- Fraile Navarro D, Ijaz K, Rezazadegan D, et al. Clinical named entity recognition and relation extraction using natural language processing of medical free text: a systematic review. Int J Med Inform. Sep 2023;177:105122. [CrossRef] [Medline]
- Guo S, Shariatmadari AH, Xiong G, Zhang A. Embracing foundation models for advancing scientific discovery. Presented at: 2024 IEEE International Conference on Big Data (BigData); Dec 15-18, 2024:1746-1755; Washington, DC, USA. [CrossRef]
- Abdel-Rehim A, Zenil H, Orhobor O, et al. Scientific hypothesis generation by large language models: laboratory validation in breast cancer treatment. J R Soc Interface. Jun 2025;22(227):20240674. [CrossRef] [Medline]
- Sun D, Hadjiiski L, Gormley J, et al. Outcome prediction using multi-modal information: integrating large language model-extracted clinical information and image analysis. Cancers (Basel). Jun 29, 2024;16(13):2402. [CrossRef] [Medline]
- Kocak Z. Publication ethics in the era of artificial intelligence. J Korean Med Sci. Aug 26, 2024;39(33):e249. [CrossRef] [Medline]
- Chen D, Parsa R, Swanson K, et al. Large language models in oncology: a review. BMJ Oncol. 2025;4(1):e000759. [CrossRef] [Medline]
- Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. Sep 2024;30(9):2613-2622. [CrossRef] [Medline]
- Huang J, Yang DM, Rong R, et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med. May 1, 2024;7(1):106. [CrossRef] [Medline]
- Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. Sep 2020;26(9):1364-1374. [CrossRef] [Medline]
- Group S. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat Med. Feb 2021;27(2):186-187. [CrossRef] [Medline]
- Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. Jan 2024;6(1):e12-e22. [CrossRef] [Medline]
- Tschandl P, Rinner C, Apalla Z, et al. Human-computer collaboration for skin cancer recognition. Nat Med. Aug 2020;26(8):1229-1234. [CrossRef] [Medline]
- Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. Jul 29, 2023;6(1):135. [CrossRef] [Medline]
- Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature New Biol. Apr 2023;616(7956):259-265. [CrossRef] [Medline]
- Zakka C, Shad R, Chaurasia A, et al. Almanac - retrieval-augmented language models for clinical medicine. NEJM AI. Feb 2024;1(2). [CrossRef] [Medline]
- Artsi Y, Sorin V, Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: a systematic review of applications and implementation. Front Digit Health. 2025;7:1659134. [CrossRef] [Medline]
- Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. arXiv. Preprint posted online on Mar 20, 2023. [CrossRef]
- Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv. Preprint posted online on Jan 10, 2023. [CrossRef]
- Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191-200. [Medline]
- Cervantes A, Adam R, Roselló S, et al. Metastatic colorectal cancer: ESMO clinical practice guideline for diagnosis, treatment and follow-up. Ann Oncol. Jan 2023;34(1):10-32. [CrossRef] [Medline]
- Char DS, Shah NH, Magnus D. Implementing machine learning in health care - addressing ethical challenges. N Engl J Med. Mar 15, 2018;378(11):981-983. [CrossRef] [Medline]
- Amann J, Blasimme A, Vayena E, Frey D, Madai VI, Precise4Q consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak. Nov 30, 2020;20(1):310. [CrossRef] [Medline]
- Ahmed M, Whicher D, Israni ST. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National Academy of Medicine; 2023. URL: http://www.ncbi.nlm.nih.gov/books/NBK605955 [Accessed 2026-03-18]
- Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. Jun 2019;6(2):94-98. [CrossRef] [Medline]
Abbreviations
| CRC: colorectal cancer |
| EHR: electronic health record |
| LLM: large language model |
| NLP: natural language processing |
| PICOS: Population, Intervention, Comparison, Outcome, Study design |
| PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
| PRISMA-S: Preferred Reporting Items for Systematic Reviews and Meta-Analyses Literature Search Extension |
| PROBAST: prediction model risk of bias assessment tool |
| PROSPERO: prospective register of systematic reviews |
| QUADAS-2: Quality Assessment of Diagnostic Accuracy Studies-2 |
| RAG: retrieval-augmented generation |
| ROBINS-I: Risk of Bias in Nonrandomized Studies - of Interventions |
| SWiM: synthesis without meta-analysis |
| TNM: tumor–node–metastasis |
Edited by Stefano Brini; submitted 22.Dec.2025; peer-reviewed by Alexandros Sagkriotis, Ilker Tosun; final revised version received 26.Apr.2026; accepted 27.Apr.2026; published 21.May.2026.
Copyright© Jinglei Tian, Qifeng Lou, Xue Wang, Hangying Xv, Huiting Mei, Yanli Yu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.May.2026.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

