Original Paper
Abstract
Background: Accurately assessing perioperative risk is critical for informed surgical planning and patient safety. However, current prediction models often rely on structured data and overlook the nuanced clinical reasoning embedded in free-text preoperative notes. Recent advances in large language models (LLMs) have opened opportunities for harnessing unstructured clinical data, yet their application in perioperative prediction remains limited by concerns about factual accuracy. Retrieval-augmented generation (RAG) offers a promising solution—enhancing LLM performance by grounding outputs in domain-specific knowledge sources, potentially improving both predictive accuracy and clinical interpretability.
Objective: This study aimed to investigate whether integrating LLMs with RAG can improve the prediction of 30-day postoperative mortality and American Society of Anesthesiologists (ASA) physical status classification using unstructured preoperative clinical notes.
Methods: We conducted a retrospective cohort study using 24,491 medical records from a tertiary medical center, including preoperative anesthesia assessments, discharge summaries, and surgical information. To extract clinical insights from free-text data, we used the LLaMA 3.1-8B language model with RAG, using MedEmbed for text embedding and Miller’s Anesthesia as the primary retrieval source. We evaluated model performance under various configurations, including embedding models, chunk sizes, and few-shot prompting. Machine learning (ML) models, including random forest, support vector machines (SVM), Extreme Gradient Boosting (XGBoost), and logistic regression, were trained on structured features as baselines.
Results: A total of 520 (2.1%) patients experienced in-hospital 30-day postoperative mortality. The ASA physical status distribution was as follows: class I: 535 (2.2%); class II: 15,272 (62.4%); class III: 8024 (32.8%); class IV: 606 (2.5%); and class V: 54 (0.22%). For 30-day postoperative mortality prediction, the LLaMA‑RAG model achieved an F1-score of 0.4663 (95% CI 0.4654-0.4672), versus 0.2369 (95% CI 0.2341-0.2397) without few‑shot prompting, 0.0879 (95% CI 0.0717-0.1041) without RAG, and 0.0436 (95% CI 0.0292-0.0580) without either few‑shot prompting or RAG. Among ML models, XGBoost scored 0.4459 (95% CI 0.4176-0.4742); random forest, 0.3953 (95% CI 0.3791-0.4115); logistic regression, 0.2720 (95% CI 0.2647-0.2793); and SVM, 0.2474 (95% CI 0.2275-0.2673). For ASA classification, LLaMA‑RAG achieved a micro F1-score of 0.8409 (95% CI 0.8238-0.8551) versus 0.6546 (95% CI 0.6430-0.6796) without few-shot prompting, 0.6340 (95% CI 0.6157-0.6535) without RAG, and 0.4238 (95% CI 0.3952-0.4490) without either few‑shot prompting or RAG. In comparison, XGBoost achieved 0.8273 (95% CI 0.8209-0.8498); logistic regression, 0.7940 (95% CI 0.7671-0.7950); random forest, 0.7847 (95% CI 0.7637-0.7868); and SVM, 0.7697 (95% CI 0.7637-0.7697). Notably, the model demonstrated exceptional sensitivity in identifying rare but high-risk cases, such as ASA Class 5 patients and postoperative deaths.
Conclusions: The LLaMA-RAG model significantly improved the prediction of postoperative mortality and ASA classification, especially for rare high-risk cases. By grounding outputs in domain knowledge, retrieval-augmented generation enhanced both accuracy and prompt‑driven interpretability over ML and ablation models—highlighting its promise for real-world clinical decision support.
doi:10.2196/75052
Keywords
Introduction
Background
Nowadays, postoperative mortality remains a significant concern for health care professionals. Accurate preoperative risk prediction enables better surgical planning and anesthesia management, which can ultimately improve patient outcomes [
- ]. Existing risk assessment tools, such as the American College of Surgeons National Surgical Quality Improvement Program [ ] and the American Society of Anesthesiologists physical status (ASA-PS) [ ], focus primarily on structured data like patient demographics and medical history. However, these models often exclude valuable information contained in unstructured clinical text, such as surgical notes and physician reports [ ].Prior Work
The use of machine learning (ML) in predicting postoperative mortality has shown promising results. ML models leveraging electronic health records (EHRs) have demonstrated superior predictive performance compared to traditional methods, often achieving a higher area under the receiver operating characteristic curve (AUROC) values [
- ]. Recent advancements in natural language processing, particularly the development of models like Bidirectional Encoder Representations from Transformers (BERT) [ ], have improved the ability to extract meaningful insights from unstructured clinical texts [ ]. However, BERT’s token length limitation constrains its ability to fully use long-form clinical notes, potentially limiting the model’s effectiveness in real-world applications [ , ].Recent research has explored the application of large language models (LLMs) in perioperative risk prediction, specifically for predicting ASA-PS classifications. For instance, one study demonstrated that while LLMs, such as GPT-4, achieved moderate F1-scores both in hospital mortality (F1-score of 0.86) and ASA-PS predictions (F1-score of 0.50), their ability to provide interpretability and explanations for the classifications offered potential clinical utility [
]. Despite relatively lower performance in numeric scoring, the use of chain-of-thought reasoning in LLMs allowed for a better understanding of adjacent classifications, assisting clinicians in decision-making during the surgical planning phase [ ].Among LLMs, the LLaMA (LLM Meta AI) series has gained attention for its high performance, open-source availability, and adaptability across diverse natural language processing tasks [
]. LLaMA 3.1, in particular, has been shown to excel in clinical prediction tasks. For example, DRG-LLaMA, a model fine-tuned from LLaMA, demonstrated superior predictive performance for assigning diagnosis-related groups using unstructured clinical notes from the MIMIC-IV dataset. The model achieved an area under the curve of 0.986 and outperformed existing methods like ClinicalBERT, showcasing its utility in handling complex health care–related classification tasks [ ].Furthermore, a systematic review by Pressman et al [
] highlighted the broader potential of LLMs in clinical applications, including diagnostic support and risk stratification through the analysis of unstructured EHR data, underscoring the value of LLMs in perioperative care settings. These findings suggest that LLMs can complement existing risk assessment models by providing valuable insights from unstructured clinical data, despite certain limitations in numeric accuracy.Aim
This study aimed to demonstrate and evaluate the effectiveness of using an LLM integrated with retrieval-augmented generation (LLM-RAG) to improve clinical risk prediction, specifically for 30-day postoperative mortality and ASA-PS classification, based on preoperative unstructured clinical notes. To this end, we incorporated RAG [
], a technique designed for LLMs that dynamically retrieves and integrates relevant external information during inference. In addition to predicting postoperative mortality, we include the task of ASA-PS classification to evaluate the model’s capabilities in handling preoperative risk stratification [ , , ]. RAG allows the model to retrieve and incorporate relevant information from external sources during inference, thereby enhancing predictive power [ ].Furthermore, this study explores both zero-shot and few-shot prediction (also known as in-context learning) strategies to investigate their respective impacts on model performance. Few-shot prompting involves providing representative tasks and solution examples in the prompt to guide the model, while zero-shot prompting provides only the query task without examples [
]. By leveraging these approaches, we aim to address the limitations of current models—particularly in handling long clinical notes, improving interpretability [ , ], and enhancing generalization across multiple tasks. Through this research, we seek to demonstrate the potential for improved prediction accuracy, model interpretability [ ], and better clinical decision-making in real-world hospital settings [ ]. We expect that, by improving risk prediction and enabling prompt‑driven interpretability, this LLM‑RAG approach could enhance perioperative decision support.Methods
Ethical Considerations
This retrospective observational study was approved by the institutional review board of Far Eastern Memorial Hospital (112166-F), a large academic medical center. Preoperative anesthesia assessment records, surgical information, and discharge summaries were collected from the hospital’s EHR from January 1, 2016, to July 31, 2023.
Inclusion and Exclusion Criteria
Patients included in this study were aged 18 years or older and had undergone at least one surgical procedure under general or neuraxial anesthesia (n=118,274). Cases were excluded if they had an ASA-PS classification of 6 (n=90), indicating patients declared brain-dead for organ donation. Additionally, records were excluded if they lacked critical information, including entry time (n=2), exit time, preoperative diagnosis, proposed procedure text (n=3), or those whose surgery date fell after all discharge dates (n=93,688). The final cohort consisted of 24,491 patients and was divided into training (n=19,592) and validation cohorts (n=4899;
).
Outcome Definition
The primary endpoint of this study was in-hospital 30-day postoperative mortality, defined as any patient death occurring within 30 days after surgery. Mortality status was determined based on discharge records explicitly labeled as “expired” or “critical against-advice discharge” (ie, patients who left the hospital against medical advice in a critical condition, with no subsequent readmission recorded within the 30-day window). Patients discharged alive and not readmitted within 30 days were classified as nonmortality cases.
The secondary endpoint was the ASA-PS classification, recorded at the time of preoperative anesthesia evaluation. ASA-PS was categorized into five classes (labeled as 1-5), reflecting the patient’s physical condition prior to surgery [
, , ]:- Class 1 (ASA I): A normal, healthy patient with no systemic disease. This category includes individuals who are free from any significant health conditions and are at the lowest risk for anesthesia-related complications.
- Class 2 (ASA II): A patient with mild systemic disease that does not limit physical activity. Examples include controlled hypertension or diabetes without systemic complications, mild obesity, or smoking without chronic obstructive pulmonary disease.
- Class 3 (ASA III): A patient with severe systemic disease that limits physical activity but is not incapacitating. Conditions may include poorly controlled hypertension, diabetes with systemic complications, or a history of myocardial infarction.
- Class 4 (ASA IV): A patient with severe systemic disease that is a constant threat to life. Examples include recent myocardial infarction, unstable angina, advanced liver or kidney disease, or severe heart failure.
- Class 5 (ASA V): A moribund patient who is not expected to survive without the operation. These patients typically have life-threatening conditions such as a ruptured abdominal aneurysm or massive trauma.
Data Preprocessing
Outlier Filtering
Continuous variables such as height, weight, and BMI were filtered to remove physiologically implausible values (eg, height: 120-250 cm; weight: 25-200 kg; BMI ≤100 kg/m2). Laboratory data, including hemoglobin, hematocrit, blood sugar, and potassium, were filtered for outliers. Vital signs, including heart rate, blood pressure, respiratory rate, body temperature, and pulse oximetry, were filtered for outliers (
). Outliers were identified as potential input errors and treated as missing data.Structured Data Preprocessing
Structured features included patient characteristics, surgical characteristics, comorbidities, preoperative laboratory values, and preoperative vital signs (
). Continuous features included age, height, weight, latest laboratory data before surgery (eg, hemoglobin, platelet count, and creatinine), and preoperative vital signs (eg, body temperature, oxygen saturation, heart rate, respiratory rate, and systolic and diastolic blood pressure). Missing values were imputed with the median value of the dataset for continuous features.Feature type | Feature classes | |||
Structured data | ||||
Patient characteristics | ||||
Continuous | Age, height, weight, BMI | |||
Categorical | Sex (n=2), ASA-PSb (n=5), ASA emergency (n=2), department (n=22), preoperative location (n=4), anesthesia type (n=4) | |||
Surgery characteristics | ||||
Categorical | Emergency level (n=4) | |||
Comorbid conditions | ||||
Categorical | Diabetes mellitus (n=2), hyperlipidemia (n=2), hypertension (n=2), cerebrovascular accident (n=2), cardiac disease (n=2), chronic obstructive pulmonary disease (n=2), asthma (n=2), hepatic disease (n=2), renal disease (n=2), bleeding disorder (n=2), major operations (n=2), smoking (n=2), drug allergy (n=2) | |||
Preoperative laboratory values | ||||
Continuous | Hemoglobin, platelet, international normalized ratio, prothrombin time, activated partial thromboplastin time, creatinine, aspartate transaminase, alanine transaminase, blood sugar, serum sodium, serum potassium | |||
Preoperative vital signs | ||||
Continuous | Body temperature, oxygen saturation, heart rate, respiratory rate, systolic and diastolic blood pressure | |||
Categorical | Consciousness status (n=2) | |||
Unstructured data | ||||
Free text | Provider service, procedure, diagnosis, planned anesthesia, description, chief complaint, present illness, discharge diagnosis, discharge treatment |
aThe parenthesized numbers indicate the number of categories within each feature class. ASA-PS classes are features for predicting mortality, but outputs for the ASA-PS classification task.
bASA-PS: American Society of Anesthesiologists physical status.
Categorical features with only 2 classes, such as sex, comorbidities (eg, diabetes mellitus and hypertension), ASA emergency status, and consciousness status, were converted into binary encoding. Features with more than 2 classes, such as ASA-PS (5 classes), department (22 classes), emergency level (4 classes), preoperative location (4 classes), and anesthesia type (4 classes), were transformed into one-hot encodings. For all categorical features, missing values were imputed using the majority category from the training dataset.
Unstructured Data Preprocessing
Unstructured data were primarily used in the LLaMA-RAG model, leveraging LLMs, using RAG or not, and using few-shot or one-shot. These free-text clinical notes included provider service, procedure, diagnosis, emergency level, planned anesthesia, description, chief complaint, present illness, discharge diagnosis, and discharge treatment. Provider service (department), emergency level, and planned anesthesia (anesthesia type) were originally recorded in text but were also used in categorical features above after preprocessing. The integration of unstructured data allowed us to evaluate their ability to provide additional clinical insights and improve predictive accuracy beyond what structured data could achieve.
For each surgery, we extracted the latest preoperative anesthesia assessment and the discharge summary preceding the operation, along with up to four earlier summaries to capture longitudinal context. All text underwent cleaning to remove templates and headers, was converted to lowercase, stripped of punctuation, and uniformly tokenized. To meet the 4096-token limit of LLaMA 3.1‑8B, we first used a summarization prompt (
) to summarize each discharge note into its key clinical elements, then concatenated these summaries in chronological order. If the combined narrative exceeded the token limit, we truncated from the oldest summaries, preserving the most recent information. This pipeline distilled rich longitudinal context into a compact, structured input optimized for downstream LLM-based prediction tasks.Model Development
LLM Setup
outlines our LLaMA‑RAG experimental framework. We split the full cohort into an 80% training and a 20% validation set. The lower section illustrates the workflow with RAG and the upper section without. During validation, cases with only 1 prior record were fed directly as zero‑ or few‑shot prompts, whereas those with up to 4 prior records were first summarized using a summarization prompt and then concatenated in chronological order before prompting. For RAG configurations, each case’s query was also sent to a vector database of domain text—chunked, embedded, and retrieved via semantic search—and the top‑ranked passages were appended to the prompt.

To ensure that the selected model aligns with our needs, we conducted preliminary comparisons between LLaMA 3.1-8B and LLaMA 3.2-3B using identical prompts and hyperparameters (eg, temperature, top_p) [
]. We selected these two models due to their ability to operate efficiently on available computational resources, including the V100 graphic processing unit, 4 central processing unit cores, and 90 GB of memory. The results showed that LLaMA 3.1-8B consistently outperformed its smaller counterpart under the same conditions. Specifically, for the ASA classification task, the micro F1-score improved significantly from 0.3246 (95% CI 0.3222-0.3246) with LLaMA 3.2 to 0.4544 (95% CI 0.4531-0.4548) with LLaMA 3.1. Similarly, for the mortality prediction task, the weighted area under the precision-recall curve (AUPRC) exhibited an increase from 0.4311 (95% CI 0.4253-0.4369) with LLaMA 3.2 to 0.4959 (95% CI 0.4828-0.5090) with LLaMA 3.1 [ , ]. Other performance metrics, including precision, recall, and AUROC, also demonstrated consistent improvements and are detailed in . Based on these findings, LLaMA 3.1-8B was chosen as the primary model for this study.Prompting Strategies
We used four prompting strategies to evaluate the model’s performance:
- No RAG, no few-shot: Zero-shot Q&A using original clinical notes without any prior examples or retrieval.
- No RAG, with few-shot: Incorporating one or more prior examples in the prompt but without RAG.
- RAG, no few-shot: Zero-shot Q&A with RAG using external knowledge retrieved from Miller’s Anesthesia, 2-Volume Set, 9th Edition (2020) [ ].
- RAG, with few-shot: Combining RAG with few-shot prompting, where external knowledge retrieved from Miller’s Anesthesia, 2-Volume Set, 9th Edition (2020) [ ] is used to augment the prompt.
Prompts included structured tags for role assignment, task specification, retrieved information summaries, and output formatting. Full prompt examples are shown in
.
Experimental Setup
We incorporated experiments using ML models—including random forest [
], eXtreme Gradient Boosting (XGBoost) [ ], logistic regression [ ], and support vector machines (SVM) [ ]—which used structured data as input features, in contrast to the unstructured clinical notes used by the LLM-based RAG framework. To ensure a fair and robust comparison, we systematically optimized the hyperparameters of these ML models using grid search and randomized search, implemented via the Scikit-learn library [ ]. Grid Search exhaustively evaluates all possible hyperparameter combinations within a predefined space, ensuring optimal model configurations, whereas randomized search provides a computationally efficient alternative by sampling hyperparameter values from a given distribution [ ]. Hyperparameter tuning was conducted using 5-fold cross-validation, optimizing for weighted AUPRC in the mortality prediction task and micro F1-score in the ASA classification task, ensuring that each model was evaluated under its best-performing configuration.Experiments compared the LLaMA-RAG model configurations against ML baselines. Performance was benchmarked using the same training and validation cohorts, consistent preprocessing, and matched evaluation criteria. All experiments were executed on a V100 graphic processing unit with 4 central processing unit cores and 90 GB of memory.
Model Evaluation and Statistical Analysis
This study used a retrospective cohort design with a total sample size of 24,491 surgical cases, split into a training cohort (n=19,592) and a validation cohort (n=4899). No a priori sample size or power calculation was performed; instead, the study included all eligible cases within the data collection period (January 2016-July 2023).
Model performance was assessed using the F1-score for 30-day mortality prediction and micro F1-score for ASA classification. Additional metrics, including accuracy, precision, recall, and specificity, were reported. Statistical significance of model comparisons was assessed post hoc using bootstrapping with 1000 iterations, defining significance as a 95% CI for the performance difference excluding zero (P<.05).
We applied Shapley Additive Explanations (SHAP) to our structured-data ML models to quantify and visualize each feature’s contribution to both 30-day postoperative mortality and ASA-PS classification, generating global importance bar plots and summary dot plots for interpretability [
].Results
Outcome Distribution
In the final dataset of 24,491 unique patient records, a total of 520 (2.1%) patients experienced in-hospital 30-day postoperative mortality, including 424 (2.2%) deaths in the training (n=19,592) and 96 (2%) deaths in the validation cohort (n=4899). The ASA-PS class distributions were as follows: in the overall cohort (n=24,491), Class 1 (n=535, 2.2%), Class 2 (n=15,272, 62.4%), Class 3 (n=8024, 32.8%), Class 4 (n=606, 2.5%), and Class 5 (n=54, 0.22%); in the training cohort (n=19,592), Class 1 (n=433; 2.2%), Class 2 (n=12,207, 62%), Class 3 (n=6428, 33%), Class 4 (n=481, 2.4%), and Class 5 (n=43, 0.22%); in the validation cohort (n=4899), Class 1 (n=102, 2.1%), Class 2 (n=3065, 63%), Class 3 (n=1596, 33%), Class 4 (n=125, 2.6%), and Class 5 (n=11, 0.22%). These distributions were balanced across training and validation cohorts to support reliable model development and evaluation (
and ).Feature | Training cohort (n=19,592) | Validating cohort (n=4899) | Overall cohort (n=24,491) | |||||
Age (years), median (IQR) | 60 (47-69) | 60 (47-69) | 60 (47-69) | |||||
Sex (male), n (%) | 11,027 (56.3) | 2748 (56.1) | 13,775 (56.2) | |||||
Height (cm), median (IQR) | 162 (156-169) | 162 (156-168) | 162 (156-169) | |||||
Weight (kg), median (IQR) | 65 (56.9-75) | 65 (56.6-74) | 65 (56.9-75) | |||||
BMI (kg/m2), median (IQR) | 24 (22-27) | 24 (22-27) | 24 (22-27) | |||||
ASA-PSb, n (%) | ||||||||
1 | 433 (2.2) | 102 (2.1) | 535 (2.2) | |||||
2 | 12,207 (62) | 3065 (63) | 15,272 (62) | |||||
3 | 6428 (33) | 1596 (33) | 8024 (33) | |||||
4 | 481 (2.4) | 125 (2.6) | 606 (2.5) | |||||
5 | 43 (0.22) | 11 (0.22) | 54 (0.22) | |||||
ASA emergency, n (%) | 966 (4.9) | 234 (4.8) | 1200 (4.9) | |||||
30-day mortality, n (%) | 424 (2.2) | 96 (2) | 520 (2.1) | |||||
Emergency level of surgery, n (%) | ||||||||
Normal | 16,953 (87) | 4264 (87) | 21,217 (87) | |||||
Urgent | 2046 (10) | 485 (9.9) | 2531 (10) | |||||
Emergent | 543 (2.8) | 143 (2.9) | 686 (2.8) | |||||
Very emergent | 48 (0.24) | 7 (0.14) | 55 (0.22) | |||||
Surgery department, n (%) | ||||||||
Urology | 4858 (24.8) | 1210 (24.7) | 6068 (24.8) | |||||
Orthopedic | 3030 (15.5) | 729 (14.9) | 3759 (15.3) | |||||
General | 2497 (12.7) | 597 (12.2) | 3094 (12.6) | |||||
Cardiovascular | 2311 (11.8) | 593 (12.1) | 2904 (11.9) | |||||
Otolaryngology | 1637 (8.4) | 439 (9) | 2076 (8.5) | |||||
Gynecology | 1417 (7.2) | 371 (7.6) | 1788 (7.3) | |||||
Plastic surgery | 1012 (5.2) | 258 (5.3) | 1270 (5.2) | |||||
Neurosurgery | 784 (4) | 190 (3.9) | 974 (4) | |||||
Thoracic surgery | 632 (3.2) | 177 (3.6) | 809 (3.3) | |||||
Colorectal surgery | 390 (2) | 106 (2.2) | 496 (2) | |||||
Traumatology | 320 (1.6) | 63 (1.3) | 383 (1.6) | |||||
Others | 704 (3.6) | 166 (3.4) | 870 (3.6) | |||||
Comorbidity | ||||||||
Diabetes mellitus | 5016 (25.6) | 1222 (24.9) | 6238 (25.5) | |||||
Hyperlipidemia | 3613 (18.4) | 921 (18.8) | 4534 (18.5) | |||||
Hypertension | 9110 (46.5) | 2205 (45) | 11,315 (46.2) | |||||
Prior cerebrovascular accident | 1469 (7.5) | 369 (7.5) | 1838 (7.5) | |||||
Cardiac disease | 5008 (25.6) | 1271 (25.9) | 6279 (25.6) | |||||
Chronic obstructive pulmonary disease | 673 (3.4) | 150 (3.1) | 823 (3.4) | |||||
Asthma | 821 (4.2) | 187 (3.8) | 1008 (4.1) | |||||
Hepatic disease | 2823 (14.4) | 681 (13.9) | 3504 (14.3) | |||||
Renal disease | 3912 (20) | 916 (18.7) | 4828 (19.7) | |||||
Bleeding disorder | 4770 (24.3) | 1166 (23.8) | 5936 (24.2) | |||||
Prior major operations | 17,039 (87) | 4224 (86.2) | 21,263 (86.8) | |||||
Smoking | 5297 (27) | 1314 (26.8) | 6611 (27) | |||||
Drug allergy | 3683 (18.8) | 877 (17.9) | 4560 (18.6) | |||||
Consciousness | 17,903 (91.4) | 4507 (92) | 22,410 (91.5) | |||||
Anesthesia type, n (%) | ||||||||
General | 17,539 (89.5) | 4437 (90.6) | 21,976 (89.7) | |||||
Neuraxial | 2053 (10.5) | 462 (9.4) | 2515 (10.3) | |||||
Preoperative location, n (%) | ||||||||
Ward | 11,911 (60.8) | 2998 (61.2) | 14,909 (60.9) | |||||
Outpatient | 4937 (25.2) | 1228 (25.1) | 6165 (25.2) | |||||
Emergency department | 1009 (5.2) | 222 (4.5) | 1231 (5) | |||||
Intensive care unit | 1678 (8.6) | 435 (8.9) | 2123 (8.7) |
aCategorical variables are represented as frequency (%) and continuous variables are represented as the median (IQR).
bASA-PS: American Society of Anesthesiologists physical status.
Preliminary Hyperparameter Tuning for the LLaMA + RAG Model
Before evaluating the performance of our proposed LLaMA + RAG model against ML approaches, we conducted a series of preliminary experiments to determine the optimal configuration for embedding models, retrieval settings, and hyperparameters. Specifically, we explored:
- Embedding models: MedEmbed [ ] and PubMedBERT [ ]
- Chunk sizes: 150, 200, 250, 300, and 350
- Retrieval top-k values: 6, 8, and 12
- Temperature and top-p settings: (0.001, 0.001), (0.1, 0.1), and (1, 0.5)
The goal of these experiments was to systematically identify the most effective parameter combinations for mortality prediction and ASA classification, ensuring a fair comparison between LLaMA + RAG and traditional ML models.
Our findings revealed that:
- For mortality prediction, the best-performing configuration was MedEmbed, chunk size=250, temperature=1, top_p=0.5, retrieval top_k=8.
- For ASA classification, the optimal setup was MedEmbed, chunk size=250, temperature=0.001, top_p=0.001, retrieval top_k=8.
The complete results of these parameter selection experiments, including comparisons across different settings, are provided in
.With these optimal settings, we proceeded to evaluate our LLaMA + RAG model against ML algorithms, including random forest, XGBoost, SVM, and logistic regression. The following sections present a comparative analysis of mortality prediction and ASA-PS classification, demonstrating the advantages of integrating RAG with LLMs for clinical risk prediction.
Performance Comparison
To evaluate the effectiveness of our proposed LLaMA-based method using unstructured clinical notes, we compared its performance against ML models trained on structured data. The comparison focused on mortality prediction and ASA-PS classification, using evaluation metrics including F1-score, accuracy, precision, recall, and specificity. The detailed results are presented in
for mortality prediction and for ASA-PS classification. Additionally, a more comprehensive set of performance metrics is provided in Tables S1 and S2 in to offer deeper insights into model performance across various evaluation perspectives.Method | F1-score (95% CI) | Accuracy (95% CI) | Precision (95% CI) | Recall (95% CI) | Specificity (95% CI) | ||
LLaMA-RAG model | 0.4663c (0.4654-0.4672) | 0.9580 (0.9555-0.9605) | 0.3103 (0.2993-0.3213) | 0.9375c (0.9341-0.9409) | 0.9584 (0.9562-0.9606) | ||
Without few-shot, with RAG | 0.2369 (0.2341-0.2397) | 0.9106 (0.9087-0.9125) | 0.1423 (0.1281-0.1565) | 0.7083 (0.6937-0.7229) | 0.9146 (0.9139-0.9153) | ||
With few-shot, without RAG | 0.0879 (0.0717-0.1041) | 0.6440 (0.6428-0.6452) | 0.0463 (0.0410-0.0516) | 0.8750 (0.8630-0.8870) | 0.6394 (0.6387-0.6401) | ||
Without few-shot, without RAG | 0.0436 (0.0292-0.0580) | 0.5152 (0.5149-0.5155) | 0.0226 (0.0114-0.0338) | 0.5625 (0.5590-0.5660) | 0.5143 (0.5115-0.5171) | ||
Random forest with grid search | 0.3583 (0.3391-0.3775) | 0.9510 (0.9481-0.9539) | 0.2410 (0.2218-0.2602) | 0.6979 (0.6979-0.6979) | 0.9561 (0.9540-0.9582) | ||
Random forest with randomized search | 0.3953 (0.3791-0.4115) | 0.9788d (0.9766-0.9810) | 0.4474d (0.4360-0.4588) | 0.3542 (0.3259-0.3825) | 0.9913d (0.9901-0.9925) | ||
SVMe with grid search | 0.2474 (0.2209-0.2739) | 0.8981 (0.8966-0.8996) | 0.1446 (0.1385-0.1507) | 0.8542d (0.8437-0.8647) | 0.8991 (0.8988-0.8994) | ||
SVM with randomized search | 0.2474 (0.2275-0.2673) | 0.8981 (0.8956-0.9006) | 0.1446 (0.1165-0.1727) | 0.8542 (0.8478-0.8606) | 0.8991 (0.8975-0.9007) | ||
XGBoostf with grid search | 0.2992 (0.2879-0.3105) | 0.9245 (0.9235-0.9255) | 0.1829 (0.1677-0.1981) | 0.8229 (0.7965-0.8493) | 0.9265 (0.9250-0.9280) | ||
XGBoost with randomized search | 0.4459d (0.4176-0.4742) | 0.9822c (0.9817-0.9827) | 0.5738c (0.5571-0.5905) | 0.3646 (0.3379-0.3913) | 0.9946c (0.9932-0.9960) | ||
Logistic Regression with grid search | 0.2648 (0.2492-0.2804) | 0.9139 (0.9128-0.9150) | 0.1590 (0.1418-0.1762) | 0.7917 (0.7882-0.7952) | 0.9163 (0.9158-0.9168) | ||
Logistic Regression with randomized search | 0.2720 (0.2647-0.2793) | 0.9104 (0.9076-0.9132) | 0.1617 (0.1402-0.1832) | 0.8542d (0.8538-0.8546) | 0.9115 (0.9114-0.9116) |
aML: machine learning.
bRAG: retrieval-augmented generation.
cIndicates best performance in each column.
dIndicates second-best performance.
eSVM: support vector machine.
fXGBoost: extreme gradient boosting.
Method | Micro F1-score (95% CI) | Accuracy (95% CI) | Precision (95% CI) | Recall (95% CI) | Specificity (95% CI) | ||||||
LLaMA-RAG model | 0.8409d (0.8238-0.8551) | 0.7409 (0.7381-0.7435) | 0.8125 (0.7832-0.8418) | 0.7409 (0.7370-0.7448) | 0.8836 (0.8834-0.8838) | ||||||
Without few-shot, with RAG | 0.6546 (0.6430-0.6796) | 0.6546 (0.6534-0.6573) | 0.8113 (0.7986-0.8240) | 0.6546 (0.6545-0.6547) | 0.9065d (0.9039-0.9091) | ||||||
With few-shot, without RAG | 0.6340 (0.6157-0.6535) | 0.6340 (0.6328-0.6350) | 0.7501 (0.7380-0.7622) | 0.6340 (0.6087-0.6593) | 0.8076 (0.8075-0.8077) | ||||||
Without few-shot, without RAG | 0.4238 (0.3952-0.4490) | 0.4239 (0.4225-0.4251) | 0.6834 (0.6659-0.7009) | 0.4239 (0.3975-0.4503) | 0.7199 (0.7185-0.7213) | ||||||
SVMe with grid search | 0.7697 (0.7637-0.7697) | 0.7696 (0.7675-0.7710) | 0.8054 (0.7930-0.8178) | 0.7697 (0.7590-0.7804) | 0.8450 (0.8449-0.8451) | ||||||
SVM with randomized search | 0.7250 (0.7068-0.7364) | 0.7250 (0.7238-0.7268) | 0.8015 (0.7948-0.8082) | 0.7250 (0.6994-0.7506) | 0.8569f (0.8566-0.8572) | ||||||
XGBoostg with grid search | 0.8273f (0.8209-0.8498) | 0.8273d (0.8253-0.8286) | 0.8178d (0.8136-0.8220) | 0.8273d (0.8065-0.8481) | 0.8050 (0.8044-0.8056) | ||||||
XGBoost with randomized search | 0.8247 (0.8127-0.8401) | 0.8247f (0.8221-0.8272) | 0.8173f (0.7999-0.8347) | 0.8247f (0.8241-0.8253) | 0.8011 (0.8011-0.8011) | ||||||
Logistic regression with grid search | 0.7940 (0.7671-0.7950) | 0.7940 (0.7930-0.7953) | 0.8063 (0.7803-0.8323) | 0.7940 (0.7899-0.7981) | 0.8314 (0.8291-0.8337) | ||||||
Logistic regression with randomized search | 0.7867 (0.7700-0.8062) | 0.7867 (0.7848-0.7874) | 0.8074 (0.7960-0.8188) | 0.7867 (0.7755-0.7979) | 0.8369 (0.8360-0.8378) |
aASA: American Society of Anesthesiologists.
bML: machine learning.
cRAG: retrieval-augmented generation.
dIndicates best performance in each column.
eSVM: support vector machine.
fIndicates second-best performance.
gXGBoost: extreme gradient boosting.
Mortality Prediction Task
Our LLaMA-RAG model achieved the highest F1-score of 0.4663 (95% CI 0.4654-0.4672) and recall of 0.9375 (95% CI 0.9341-0.9409), indicating its strong capability in correctly identifying mortality cases (
). Among ML models, XGBoost with randomized search attained the second-highest F1-score of 0.0.4459 (95% CI 0.4176-0.4742), while SVM with randomized search obtained the second-highest recall of 0.8542 (95% CI 0.8478-0.8606). In ablation experiments, the F1-score dropped to 0.2369 (95% CI 0.2341-0.2397) without few‑shot prompting, 0.0879 (95% CI 0.0717-0.1041) without RAG, and 0.0436 (95% CI 0.0292-0.0580) without either few-shot prompting or RAG.Attention heatmaps reveal that, for mortality prediction, the LLaMA-RAG model consistently concentrates its attention on the “present illness” section, often supplemented by the “discharge diagnosis” field, underscoring acute clinical trajectory as the primary driver of mortality risk estimation (
and ). SHAP analysis of the XGBoost revealed that hemoglobin was the single most influential predictor—lower values markedly increasing predicted risk—followed by intensive care unit, tachycardia, and serum sodium, all of which exhibited the highest mean absolute SHAP values among the top 20 features ( ).ASA Classification Task
For ASA classification, our LLaMA-RAG model outperformed ML models, achieving the highest micro F1-score of 0.8409 (95% CI 0.8238-0.8551) and specificity of 0.8836 (95% CI 0.8834-0.8838), detailed in
. Among traditional methods, XGBoost with grid search achieved the second-highest micro F1-score of 0.8273 (95% CI 0.8209-0.8498), whereas SVM with randomized search obtained the second-highest specificity of 0.8569 (95% CI 0.8566-0.8572). In ablation experiments, the micro F1-score dropped to 0.6546 (95% CI 0.6430-0.6796) without few‑shot prompting, 0.6340 (95% CI 0.6157-0.6535) without RAG, and 0.4238 (95% CI 0.3952-0.4490) without either few-shot prompting or RAG.Attention heatmaps for ASA-PS classification show that the LLaMA-RAG model consistently focuses its attention on the “present illness” section, often with additional peaks on “discharge diagnosis” and “discharge treatment,” indicating that the model leverages the patient’s most recent clinical trajectory and outcomes when assigning ASA-PS (
and ). SHAP analysis of the XGBoost revealed that age was the single most influential predictor, lower values markedly decreased predicted class, followed by elective surgery, cardiac disease, and cardiovascular surgery, all of which exhibited the highest mean absolute SHAP values among the top 20 features ( ).Discussion
Principal Results
This study demonstrates the effectiveness of integrating an LLM with RAG for clinical risk prediction, achieving superior performance in both mortality prediction and ASA classification.
In the mortality prediction task, our LLaMA-RAG model obtained the highest AUPRC of 0.6536 (95% CI 0.6479-0.6593), AUROC of 0.9570 (95% CI 0.9543-0.9597), and F1-score of 0.4663 (95% CI 0.4654-0.4672), significantly outperforming ML models, including random forest, SVM, XGBoost, and logistic regression in AUPRC (
). While our AUROC score was significantly higher than that of SVM and logistic regression, it remained comparable to that of random forest and XGBoost ( ). For ASA multiclassification, our method achieved the highest micro F1-score of 0.8409 (95% CI 0.8238-0.8551), outperforming all other ML models ( ). In ablation experiments for both tasks, removal of either RAG or few‑shot prompting led to substantial declines in the F1-score ( and ), underscoring the critical contributions of both components to the model’s discriminative performance.However, while our approach demonstrated advantages in AUPRC, AUROC, and recall, it did not surpass all ML baselines in terms of accuracy and precision. A closer examination of confusion matrices (Tables S1 and S2 in
) provides insight into this result: our model exhibited superior sensitivity in identifying rare cases, such as ASA Class 5 patients and postoperative mortality. Even when misclassifications occurred, they were predominantly within adjacent ASA classes, which may reflect the inherent subjectivity in ASA grading ( 0). This suggests that our method has the capability to perform clinical risk stratification.Comparison With Prior Works
Our method also outperformed the approach of Chung et al [
], which used ChatGPT-4 Turbo for ASA classification, achieving a micro F1-score of 0.8409 (95% CI 0.8238-0.8551), compared to their best result of 0.54 (95% CI 0.49-0.60). This difference highlights the limitations of a general-domain LLM without structured retrieval augmentation. While GPT-4 Turbo benefits from extensive pretraining on diverse text sources, its lack of domain-specific retrieval may restrict its ability to accurately capture complex clinical relationships. By contrast, our retrieval-enhanced model systematically integrates domain knowledge, leading to a more interpretable and clinically reliable risk stratification process [ , ].When compared to prior deep learning–based approaches, our method significantly outperforms the BERT–deep neural network (DNN) model of Chen et al [
] in mortality prediction. Specifically, our LLM with RAG integration achieved a substantially higher macro F1-score (0.7222, 95% CI 0.6998-0.7446) compared to their model (0.307, 95% CI 0.269-0.342), indicating a superior balance between precision and recall across both mortality and survival classes. Unlike the BERT-DNN framework, which relies solely on parametric memory for text extraction, our approach dynamically retrieves specialized anesthesia knowledge from textbooks [ ]. This retrieval mechanism not only enhances contextual understanding but also mitigates the risk of hallucinated responses common in LLMs [ , ]. By leveraging domain-specific knowledge, our method achieves more reliable and clinically relevant predictions.In addition to performance comparisons, it is important to note that our study leveraged a substantially larger and more diverse dataset (n=24,491) compared to Chung et al [
], whose task-specific datasets were limited to 1000 cases each. This broader dataset allowed us to capture a wider range of surgical procedures, patient backgrounds, and rare outcomes, enhancing the generalizability of our model. Furthermore, unlike Chung et al [ ], who focused primarily on few-shot prompting strategies without retrieval augmentation, our integration of RAG provided domain-specific knowledge, significantly improving predictive accuracy and interpretability. Although Chen et al [ ] trained a BERT-DNN model on 121,313 cases, their approach relied on structured inputs and brief diagnostic and procedural snippets (less than 50 tokens) without RAG. By contrast, our framework leveraged both structured features and rich, LLM-summarized discharge narratives to evaluate LLaMA-RAG for 30-day mortality and ASA-PS prediction.Key Factors Influencing Model Performance
Our findings highlight the critical role of RAG in improving LLM-driven clinical predictions. Traditional LLMs rely solely on their pretraining corpus, which limits their ability to provide up-to-date and domain-specific medical insights. By dynamically retrieving relevant knowledge from trusted medical references, RAG ensures that predictions remain factually grounded and clinically relevant [
, ]. This advantage was particularly evident in mortality prediction, where the incorporation of RAG helped mitigate class imbalance issues, leading to improved AUPRC and predictive performance ( ).To further investigate the effectiveness of RAG-enhanced LLMs, we systematically evaluated 3 key factors influencing model performance: embedding model selection, chunk size configuration, and few-shot prompting strategies (Table S3 in
and 1).Impact of Embedding Model Selection
We first compared MedEmbed and PubMedBERT under identical hyperparameter settings to assess their impact on model performance. Our results demonstrated that MedEmbed consistently outperformed PubMedBERT, achieving the highest AUPRC of 0.4733 for mortality prediction and the highest micro F1-score of 0.6307 for ASA classification (Figures S1 and S2 in
). These findings suggest that embedding models optimized for medical text representations can significantly enhance LLM-based clinical predictions.Effect of Chunk Size on Model Performance
Next, we examined the influence of chunk size selection in retrieval-based augmentation. Our experiments revealed that a chunk size of 250 yielded the best performance across both tasks. For mortality prediction, the highest AUPRC (0.4733) was obtained at retrieval top_k=12, whereas for ASA classification, the best micro F1-score (0.6307) was achieved at retrieval top_k=6 (Figures S3 and S4 in
). These results emphasize that optimizing the chunking strategy when embedding clinical texts plays a crucial role in maximizing retrieval effectiveness and overall predictive accuracy.Influence of Few-Shot Learning on Performance
We further assessed the impact of few-shot prompting, which provided additional performance gains. For mortality prediction, incorporating 9-shot examples resulted in the highest AUPRC of 0.6536, compared to 0.6176 with 5-shot prompting. In ASA classification, few-shot learning significantly improved model performance, yielding a micro F1-score of 0.8409, a substantial increase over the 0.5938 obtained without few-shot prompting (Figures S5 and S6 in
). These findings underscore the importance of carefully curating Few-Shot Examples to enhance the generalizability of LLM-based clinical models.Implications for Clinical Artificial Intelligence
Collectively, our findings highlight the benefits of integrating RAG with LLMs, selecting an effective embedding model, optimizing chunk size, and leveraging few-shot prompting to improve clinical prediction performance. The ability to dynamically retrieve domain-specific knowledge makes RAG-enhanced LLMs particularly well-suited for real-world medical decision-making, where contextual interpretation of patient characteristics and clinical guidelines is essential (
2). This capability is particularly advantageous in medical domains where well-established guidelines and textbooks serve as primary decision-making references. Unlike conventional LLMs that rely solely on pretraining data, RAG dynamically retrieves up-to-date clinical knowledge, ensuring that predictions and recommendations remain aligned with current best practices. By mitigating the risks of outdated information and hallucinated responses, RAG enhances the reliability and interpretability of artificial intelligence (AI)–driven decision support.Additionally, our confusion matrices suggest that while our method may not always outperform traditional ML models in precision and accuracy, it excels at identifying rare but clinically significant cases, such as ASA Class 5 patients and postoperative mortality events (
3). For ASA class 5, the full LLaMA-RAG model achieved a true positive rate of 54.5%, compared with 54.5% without RAG, 27.3% without few-shot prompting, and 9.1% without either component; traditional ML models identified no class V cases (0% true positives). For 30-day mortality, LLaMA-RAG reached a true positive rate of 94.8%, versus 87.5% without RAG, 79.2% without few-shot prompting, and 56.2% without either, while ML baselines ranged from 35.4% to 85.4%. These results demonstrate that both RAG and few-shot prompting incrementally enhance the LLaMA-RAG model’s ability to detect rare but clinically significant outcomes. This highlights the trade-off between precision and recall in clinical decision support systems, emphasizing the need to balance risk prediction sensitivity and specificity in practice.Moreover, these results underscore the transformative potential of LLM integrated with RAG architectures in clinical informatics. By improving both discrimination and interpretability, RAG-enhanced LLMs provide a more reliable and adaptable framework for clinical AI applications, bridging the gap between machine intelligence and human expertise in medical practice.
Limitations
This study used a fixed-length chunking strategy for RAG, which may misalign with the semantic structure of a clinical text. Semantic chunking, leveraging meaning-based segmentation, could enhance retrieval relevance [
, ]. Adaptive retrieval, incorporating hybrid query-based methods, may further optimize context selection [ ]. Additionally, Adaptive RAG, which adjusts retrieval depth based on query complexity, remains unexplored in our setting but has shown promise in dynamically optimizing information retrieval [ ]. Future studies should systematically evaluate these techniques to improve retrieval precision, computational efficiency, and applicability in clinical AI systems.Conclusions
This study demonstrates the effectiveness of integrating LLMs with RAG for clinical risk prediction. Our LLaMA-RAG model demonstrated superior performance compared to both ML models and previously published deep learning approaches in mortality prediction and ASA classification [
, ], especially for rare high‑risk patients, by grounding its analysis in reliable medical knowledge to deliver more accurate, interpretable, and clinically relevant predictions. Through systematic evaluations of embedding model selection, chunk size optimization, and few-shot prompting, we identified key factors influencing predictive accuracy. RAG-enhanced LLMs provide a promising pathway toward more interpretable, context-aware, and accurate clinical decision support systems.Acknowledgments
This study was supported by grants from Far Eastern Memorial Hospital, Taiwan (grants FEMH-2024-C-052 and FEMH-2025-C-017). The sponsors had no role in the study design, data collection and analysis, publication decision, or manuscript drafting. The authors attest that there was no use of generative artificial intelligence technology in the generation of text, figures, or other informational content of this manuscript.
Data Availability
The datasets generated or analyzed during this study are not publicly available due to Institutional Review Board restrictions that prohibit open sharing of patient-level data to protect participant privacy but are available from the corresponding author upon reasonable request.
Authors' Contributions
YHC helped with software, formal analysis, investigation, visualization, data curation, writing the original draft, and writing review and editing. SJR helped with supervision, writing review, and editing. PC helped with conceptualization, methodology, investigation, resources, validation, writing review and editing, and funding acquisition.
Conflicts of Interest
None declared.
Continuous feature limits to define outliers.
XLSX File (Microsoft Excel File), 9 KBExample prompt template used for summarizing patient records with multiple prior discharge notes, illustrating role assignment, task specification, and structured output formatting for use in few-shot learning.
PDF File (Adobe PDF File), 60 KBComprehensive performance metrics for all models across mortality prediction and ASA classification tasks, including accuracy, precision, recall, F1-scores, AUROC, AUPRC, and confusion matrices (Tables S1 and S2), as well as comparative results evaluating the effects of embedding models, chunk sizes, and few-shot prompting on model performance (Table S3). ASA: American Society of Anesthesiologists; AUROC: area under the receiver operating characteristic curve; AURPC: area under the precision-recall curve.
XLSX File (Microsoft Excel File), 32 KBSummary of laboratory and vital sign values.
XLSX File (Microsoft Excel File), 10 KBComparative analysis of embedding models, chunk sizes, and few-shot prompting strategies in retrieval-augmented clinical risk prediction tasks. Results include their impact on AUPRC, F1 scores, and recall for both mortality prediction and ASA classification. ASA: American Society of Anesthesiologists; AURPC: area under the precision-recall curve.
PDF File (Adobe PDF File), 563 KBNormalized token-level attention weights visualized for example cases in 30-day postoperative mortality prediction and ASA-PS classification tasks. ASA-PS: American Society of Anesthesiologists physical status.
PDF File (Adobe PDF File), 420 KBRepresentative preoperative assessment and discharge summary excerpts used to generate the attention heatmaps in Multimedia Appendix 11.
XLSX File (Microsoft Excel File), 11 KBGlobal explainability plots for XGBoost models, including (A) mean absolute SHAP value bar chart and (B) SHAP summary dot plot, for both mortality prediction and ASA-PS classification. ASA-PS: American Society of Anesthesiologists physical status; SHAP: Shapley Additive Explanations; XGBoost: Extreme Gradient Boosting.
PDF File (Adobe PDF File), 931 KBStatistical significance testing of model performance differences for mortality prediction (AUPRC) and ASA classification (Micro F1 Score), based on 1,000 test set resamples and pairwise comparisons with the proposed method. ASA: American Society of Anesthesiologists; AUPRC: area under the precision-recall curve.
PDF File (Adobe PDF File), 256 KBMisclassification analysis for ASA Class 5 patients, comparing adjacent and distant errors across models. Results highlight the rate and severity of ASA classification errors, particularly in high-risk cases. ASA: American Society of Anesthesiologists.
XLSX File (Microsoft Excel File), 85 KBConfusion matrices for mortality prediction and ASA classification across various model configurations, including comparisons of machine learning baselines, LLaMA variants, RAG integration, and few-shot prompting strategies. ASA: American Society of Anesthesiologists; RAG: Retrieval-Augmented Generation.
PDF File (Adobe PDF File), 1825 KBSample LLaMA-RAG model outputs for ASA classification and mortality prediction, illustrating how the model generates clinically grounded justifications based on patient history. ASA: American Society of Anesthesiologists; RAG: Retrieval-Augmented Generation.
XLSX File (Microsoft Excel File), 14 KBTrue positive rates for ASA Class 5 and postoperative mortality across baseline machine learning models, illustrating detection performance for rare but clinically critical cases. ASA: American Society of Anesthesiologists.
XLSX File (Microsoft Excel File), 84 KBReferences
- Weiser TG, Regenbogen SE, Thompson KD, Haynes AB, Lipsitz SR, Berry WR, et al. An estimation of the global volume of surgery: a modelling strategy based on available data. Lancet. 2008;372(9633):139-144. [CrossRef] [Medline]
- Pearse RM, Moreno RP, Bauer P, Pelosi P, Metnitz P, Spies C, et al. Mortality after surgery in Europe: a 7 day cohort study. Lancet. 2012;380(9847):1059-1065. [FREE Full text] [CrossRef] [Medline]
- Bronsert MR, Lambert-Kerzner A, Henderson WG, Hammermeister KE, Atuanya C, Aasen DM, et al. The value of the "Surgical Risk Preoperative Assessment System" (SURPAS) in preoperative consultation for elective surgery: a pilot study. Patient Saf Surg. 2020;14:31. [FREE Full text] [CrossRef] [Medline]
- Hughes MJ, McNally S, Wigmore SJ, Deans D, Skipworth RJ. Preoperative risk stratification: identifying modifiable risks for surgical patients. In: Perioperative Medicine. cham. Springer; 2021.
- Firde M, Yetneberk T. Preoperative investigation practices for elective surgical patients: clinical audit. BMC Anesthesiol. 2024;24(1):184. [FREE Full text] [CrossRef] [Medline]
- Khuri S. The NSQIP: a new frontier in surgery. Surgery. 2005;138(5):837-843. [CrossRef] [Medline]
- Mayhew D, Mendonca V, Murthy BVS. A review of ASA physical status—historical perspectives and modern developments. Anaesthesia. 2019;74(3):373-379. [FREE Full text] [CrossRef] [Medline]
- Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018;22(5):1589-1604. [FREE Full text] [CrossRef] [Medline]
- Corey KM, Kashyap S, Lorenzi E, Lagoo-Deenadayalan SA, Heller K, Whalen K, et al. Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study. PLoS Med. 2018;15(11):e1002701. [FREE Full text] [CrossRef] [Medline]
- Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Med. 2018;1:18. [FREE Full text] [CrossRef] [Medline]
- Soguero-Ruiz C, Hindberg K, Mora-Jiménez I, Rojo-Álvarez JL, Skrøvseth SO, Godtliebsen F, et al. Predicting colorectal surgical complications using heterogeneous clinical data and kernel methods. J Biomed Inform. 2016;61:87-96. [FREE Full text] [CrossRef] [Medline]
- Weller GB, Lovely J, Larson DW, Earnshaw BA, Huebner M. Leveraging electronic health records for predictive modeling of post-surgical complications. Stat Methods Med Res. 2018;27(11):3271-3285. [CrossRef] [Medline]
- Ren Y, Loftus TJ, Datta S, Ruppert MM, Guan Z, Miao S, et al. Performance of a machine learning algorithm using electronic health record data to predict postoperative complications and report on a mobile platform. JAMA Netw Open. 2022;5(5):e2211973. [CrossRef] [Medline]
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics; 2019. Presented at: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); May 09, 2025:4171-4186; Minneapolis, Minnesota. [CrossRef]
- Raza S, Schwartz B. Entity and relation extraction from clinical case reports of COVID-19: a natural language processing approach. BMC Med Inform Decis Mak. 2023;23(1):20. [FREE Full text] [CrossRef] [Medline]
- Su P, Vijay-Shanker K. Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation extraction. BMC Bioinformatics. 2022;23(1):120. [FREE Full text] [CrossRef] [Medline]
- Chung P, Fong CT, Walters AM, Aghaeepour N, Yetisgen M, O'Reilly-Shah VN. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 2024;159(8):928-937. [CrossRef] [Medline]
- Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The Llama 3 herd of models. ArXiv. Preprint posted online on July 31, 2024. 2024. [FREE Full text]
- Wang H, Gao C, Dantona C, Hull B, Sun J. DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digital Med. 2024;7(1):16. [FREE Full text] [CrossRef] [Medline]
- Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and surgical applications of large language models: a systematic review. J Clin Med. 2024;13(11):3041. [FREE Full text] [CrossRef] [Medline]
- Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9549-9571. [FREE Full text]
- Saklad M. Grading of patients for surgical procedures. Anesthesiology. 1941;2(3):281-284. [CrossRef]
- Horvath B, Kloesel B, Todd M, Cole D, Prielipp R. The evolution, current value, and future of the American Society of Anesthesiologists physical status classification system. Anesthesiology. 2021;135(5):904-919. [CrossRef] [Medline]
- Hao B, Zhu H, Paschalidis I. Enhancing clinical BERT embedding using a biomedical knowledge base. International Committee on Computational Linguistics; 2020. Presented at: Proceedings of the 28th International Conference on Computational Linguistics; December 8-13, 2020:657-661; Barcelona, Spain (Online). [CrossRef]
- Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, et al. Flamingo: a visual language model for few-shot learning. ArXiv. Preprint posted online on April 29, 2022. 2022. [FREE Full text]
- Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. [FREE Full text] [CrossRef] [Medline]
- Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data. 2019;6(1):52. [FREE Full text] [CrossRef] [Medline]
- Wang Y, Zhang L, Zhang Y, Xia Y, Yang Y, Guo J. Incorporating domain knowledge into BERTvaluating the impact of knowledge sources on medical relation extraction. J Biomed Inform. 2021;115:103688. [CrossRef]
- Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. [FREE Full text] [CrossRef] [Medline]
- Ozenne B, Subtil F, Maucort-Boulch D. The precision-recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol. 2015;68(8):855-859. [CrossRef] [Medline]
- Miller RD, Cohen NH, Eriksson LI, Fleisher LA, Leslie K, Wiener-Kronish JP. Miller's Anesthesia, 2-Volume Set. Philadelphia. Elsevier; 2020.
- Breiman L. Random Forests. Mach Learn. 2001;45:5-32. [FREE Full text] [CrossRef] [Medline]
- Chen T, Guestrin C. XGBoost: a scalable tree boosting system. 2016. Presented at: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2016:785-794; San Francisco California USA. [CrossRef]
- Cox D. The regression analysis of binary sequences. J R Stat Soc Series B Stat Methodol. 1958;20:215-242. [CrossRef]
- Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273-297. [CrossRef]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825-2830. [CrossRef]
- Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13:281-305. [FREE Full text]
- Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):56-67. [FREE Full text] [CrossRef] [Medline]
- Abhinand B. MedEmbed: medical-focused embedding models. Github. 2024. URL: https://github.com/abhinand5/MedEmbed [accessed 2025-05-14]
- Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare. 2021;3(1):1-23. [FREE Full text] [CrossRef]
- Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI. 2024;1(2):2300068. [FREE Full text] [CrossRef] [Medline]
- Unlu O, Shin J, Mailly CJ, Oates MF, Tucci MR, Varugheese M, et al. Retrieval-augmented generation–enabled GPT-4 for clinical trial screening. NEJM AI. 2024;1(7):2400181. [CrossRef]
- Chen PF, Chen L, Lin YK, Li GH, Lai F, Lu CW, et al. Predicting postoperative mortality with deep neural networks and natural language processing: model development and validation. JMIR Med Inform. 2022;10(5):e38241. [FREE Full text] [CrossRef] [Medline]
- Mahboub A, Za'ter M, Al-Rfooh B, Estaitia Y, Jaljuli A, Hakouz A. Evaluation of semantic search and its role in retrieved-augmented-generation (RAG) for arabic language. ArXiv. Preprint posted online on March 27, 2024. 2024. [FREE Full text]
- Sawarkar K, Mangal A, Solanki S. Blended RAG: improving RAG (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. IEEE; 2024. Presented at: IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR); August 07-09, 2024:155-161; San Jose, CA, USA. [CrossRef]
- Jeong S, Baek J, Cho S, Hwang S, Park J. Adaptive-RAG: learning to adapt retrieval-augmented large language models through question complexity. 2024. Presented at: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); June 16-21, 2024:7036-7050; Mexico City, Mexico. URL: https://arxiv.org/abs/2403.14403 [CrossRef]
Abbreviations
AI: artificial intelligence |
ASA-PS: American Society of Anesthesiologists physical status |
AUPRC: area under the precision-recall curve |
AUROC: area under the receiver operating characteristic curve |
BERT: Bidirectional Encoder Representations from Transformers |
DNN: deep neural network |
EHR: electronic health record |
LLM: large language model |
ML: machine learning |
RAG: Retrieval-Augmented Generation |
SHAP: Shapley Additive Explanations |
SVM: support vector machine |
XGBoost: Extreme Gradient Boosting |
Edited by J Sarvestan; submitted 27.03.25; peer-reviewed by P Chung, E Bignami; comments to author 18.04.25; revised version received 08.05.25; accepted 12.05.25; published 03.06.25.
Copyright©Ying-Hao Chen, Shanq-Jang Ruan, Pei-fu Chen. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 03.06.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.