A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support
Date Submitted: Jan 1, 2026
Open Peer Review Period: Jan 1, 2026 - Feb 26, 2026
Background: Large language models (LLMs) have shown growing potential for clinical decision support. However, effectively integrating domain-specific medical knowledge into LLMs while maintaining accuracy, safety, and interpretability remains a key challenge for postoperative discharge instructions and patient education. Fine-tuning (FT), retrieval-augmented generation (RAG), and hybrid FT+RAG approaches represent three prominent strategies for knowledge integration, yet their comparative performance in postoperative clinical contexts has not been systematically evaluated. Objective: We aimed to compare the clinical performance, reliability, and safety characteristics of baseline, fine-tuned, retrieval-augmented, and hybrid FT+RAG LLM configurations for postoperative clinical decision support. Methods: We conducted a controlled comparative evaluation of four LLM configurations using Google Gemini 2.5 Flash. A total of 600 postoperative question–answer pairs were used for model adaptation and validation, while 150 queries were reserved for final evaluation. Queries included routine postoperative care questions, emergency escalation scenarios, and deliberately out-of-scope questions. Model outputs were independently assessed by three blinded clinical experts for accuracy, completeness, and relevance. Automated metrics were used to evaluate readability, faithfulness, and hallucination propensity. Results: All knowledge-enhanced models significantly outperformed the baseline model in clinical accuracy (baseline 68.0% vs FT 92.7%, RAG 91.3%, FT+RAG 97.3%; p<.001). The hybrid FT+RAG model achieved the highest overall performance, including 100% precision, 96.7% recall, and the lowest hallucination rate. FT and RAG alone yielded comparable gains across accuracy, completeness, relevance, faithfulness, and hallucination reduction, with no statistically significant differences between them. While enhanced models produced shorter and more concise responses, they demonstrated reduced readability compared with the baseline model. Conclusions: Incorporating domain knowledge substantially improves the clinical performance of LLMs for postoperative decision support. Hybrid FT+RAG approaches provide the strongest overall accuracy and safety profile, although trade-offs in readability, interpretability, and rater variability remain. These findings support the use of knowledge-augmented LLMs in postoperative care while underscoring the need for careful governance, transparency, and human oversight prior to clinical deployment. Clinical Trial: Not applicable
