This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Timely identification of patients at a high risk of clinical deterioration is key to prioritizing care, allocating resources effectively, and preventing adverse outcomes. Vital signs–based, aggregate-weighted early warning systems are commonly used to predict the risk of outcomes related to cardiorespiratory instability and sepsis, which are strong predictors of poor outcomes and mortality. Machine learning models, which can incorporate trends and capture relationships among parameters that aggregate-weighted models cannot, have recently been showing promising results.
This study aimed to identify, summarize, and evaluate the available research, current state of utility, and challenges with machine learning–based early warning systems using vital signs to predict the risk of physiological deterioration in acutely ill patients, across acute and ambulatory care settings.
PubMed, CINAHL, Cochrane Library, Web of Science, Embase, and Google Scholar were searched for peer-reviewed, original studies with keywords related to “vital signs,” “clinical deterioration,” and “machine learning.” Included studies used patient vital signs along with demographics and described a machine learning model for predicting an outcome in acute and ambulatory care settings. Data were extracted following PRISMA, TRIPOD, and Cochrane Collaboration guidelines.
We identified 24 peer-reviewed studies from 417 articles for inclusion; 23 studies were retrospective, while 1 was prospective in nature. Care settings included general wards, intensive care units, emergency departments, step-down units, medical assessment units, postanesthetic wards, and home care. Machine learning models including logistic regression, tree-based methods, kernel-based methods, and neural networks were most commonly used to predict the risk of deterioration. The area under the curve for models ranged from 0.57 to 0.97.
In studies that compared performance, reported results suggest that machine learning–based early warning systems can achieve greater accuracy than aggregate-weighted early warning systems but several areas for further research were identified. While these models have the potential to provide clinical decision support, there is a need for standardized outcome measures to allow for rigorous evaluation of performance across models. Further research needs to address the interpretability of model outputs by clinicians, clinical efficacy of these systems through prospective study design, and their potential impact in different clinical settings.
Patient deterioration and adverse outcomes are often preceded by abnormal vital signs [
EWS typically employ heart rate (HR), respiratory rate (RR), blood pressure (BP), peripheral oxygen saturation (SpO2), temperature, and sometimes the level of consciousness [
Some of the commonly used aggregate-weighted EWS for predicting cardiorespiratory insufficiency and mortality are the Modified Early Warning Score (MEWS) [
The predictive ability of aggregate-weighted EWS has limitations. First, the scores indicate the present risk of the patient but do not incorporate trends nor provide information about the possible risk trajectory [
A newer approach to EWS relies on machine learning (ML). ML models learn patterns and relationships directly from data rather than relying on a rule-based system [
Two systematic reviews in 2019 [
The review conducted by Linnen et al [
These are important findings, but to date no review has systematically reviewed the evidence from studies using ML-based EWS using vital sign measurements of varying frequencies, across different care settings and clinical outcomes in order to identify common methodological trends and limitations with current approaches to generate recommendations for future research in this area.
The objective of this study was to scope the state of research in ML-based EWS using vital signs data for predicting the risk of physiological deterioration in patients across acute and ambulatory care settings and to identify directions for future research in this area.
A systematic scoping review was conducted by following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for scoping reviews (PRISMA-ScR) framework [
We searched PubMed, CINAHL, Cochrane Library, Web of Science, Embase, and Google Scholar for peer-reviewed studies without using any filters for study design and language. Searches were also conducted without any date restrictions. The reference lists of all studies that met the inclusion criteria were screened for additional articles. The search strategy involved a series of searches using a combination of relevant keywords and synonyms, including “vital signs,” “clinical deterioration,” and “machine learning.” See
The inclusion criteria covered the following:
Peer-reviewed studies evaluating continuous or intermittent vital sign monitoring in adult patients so that all data collection or sampling frequencies (eg, 1 measurement per minute vs 1 measurement every 2 hours) wedre taken into consideration;
Studies conducted using data gathered from all acute and ambulatory care settings including medical or surgical hospital wards, ICUs, step-down units, ED, and in-home care;
Quantitative, observational, retrospective, and prospective cohort studies and randomized controlled trials;
Studies that involved ML or multivariable statistical or ML models and reported some model performance measure (eg, area under the curve) [
Studies that reported mortality or any outcomes related to clinical deterioration so that EWS models and performance can be examined for all explored outcomes.
The exclusion criteria included the following:
Studies that used any laboratory values as predictors for the ML-based EWS, as this review focuses on examining time-sensitive predictions of clinical deterioration using patient parameters that are readily available across all care settings;
Studies involving pediatric or obstetric populations due to these patients having different or altered physiologies that cannot be compared to standard adult patients;
Qualitative studies, reviews, preprints, case reports, commentaries, or conference proceedings.
References from the preliminary searches were handled using Mendeley reference management software. After duplicates were removed, titles and abstracts were screened to assess preliminary eligibility. Eligible studies were then read in full length to be assessed against the inclusion and exclusion criteria.
Data were extracted from eligible studies using an extraction sheet that followed the PRISMA [
The search for “vital signs” AND “clinical deterioration” AND “machine learning” using the same query terms and filters identified 417 studies after duplicate removal. During the title and abstract screening process, 386 studies were excluded. Of the 31 full-text articles that were assessed, 7 studies were excluded for not meeting the eligibility criteria: 2 studies did not use ML models to predict deterioration, 3 studies included vital sign measurements in addition to laboratory values as predictors, 1 study focused on a cohort of pregnant women, and 1 study did not meet the criteria for model performance measures. A review of the reference lists of the 24 selected studies did not yield any additional studies fulfilling the eligibility criteria (refer to
PRISMA flowchart of the search strategy and study selection.
Of the selected studies, 23 conducted a retrospective analysis of the vital signs data, while 1 study [
Studies were conducted in a variety of settings within hospitals while the study by Larburu et al [
Study characteristics.
Authors, year | Setting(s) | Data collection | Cohort description | Event rate | Study purpose | Predictors | Measurement frequency | Outcome |
Badriyah et al, 2014 [ |
Medical assessment unit for 24 hours | Personal digital assistants running VitalPAC software | 35,585 admissions | 199 (0.56%), cardiac arrest; |
Compare the performance of a decision tree analysis with NEWSb | HRc, RRd, SBPe, temperature, SpO2, AVPUf level, % breathing air at the time of SpO2 measurement | Not specified | Cardiac arrest, unanticipated ICU admission, or death, each within 24 hours of a given vital sign observation |
Chen at al, 2017 [ |
Step-down unit | Bedside monitors | 1880 patients (1971 admissions) | 997 patients (53%) or 1056 admissions (53.6%) who experienced CRIg events | Describe the dynamic and personal character of CRI risk evolution observed through continuous vital sign monitoring of individual patients | HR, RR, SPO2 (at 1/20 Hz), SBP, DBPh | Every 2 hours | CRI |
Churpek et al, 2016 [ |
All wards at the University of Chicago and 4 North Shore University Health System hospitals | Data collected manually, documented electronically | 269,999 admissions | 16,452 outcomes (6.09%) | Whether adding trends improves accuracy of early detection of clinical deterioration and which methods are optimal for modelling trends | Temperature, HR, RR, SpO2, DBP, SBP | Every 4 hours | Development of critical illness on the wards: deaths, cardiac arrest, ICU transfers |
Chiew et al, 2019 [ |
EDi at Singapore general hospital | Measurements at triage; hospital EHRj | 214 patients | 40 patients (18.7%) met outcome | Compare the performance of HR variability–based machine learning models vs conventional risk stratification tools to predict 30-day mortality | Age, gender, ethnicity, temperature, HR, RR, SBP, DBP, GCSk, HR variability | At triage | 30-day mortality due to sepsis |
Chiu et al, 2019 [ |
Postoperative surgical wards at 4 UK adult cardiac surgical centers | VitalPac to electronically capture patients’ vital signs | Adults undergoing risk-stratified major cardiac surgery, n=13,631 | 578 patients (4.2%) with an outcome; 499 patients (3.66%) with unplanned ICU readmissions | Using logistic regression to model the association of NEWS variables with a serious patient event in the subsequent 24 hours; secondary objectives: comparing the discriminatory power of each model for events in the next 6 hours or 12 hours | RR, SpO2, SBP, HR, temperature, consciousness level | Not specified | Death, cardiac arrest, unplanned ICU readmissions |
Clifton et al, 2014 [ |
Postoperative ward of the cancer center, Oxford University Hospitals NHSl Trust, United Kingdom | Continuous vitals monitored by wearable devices; intermittent vitals monitored manually by ward staff | 200 patients in the postoperative ward following upper gastrointestinal cancer surgery | Not specified | Using continuous vitals monitoring to provide early warning of physiological deterioration, such that preventative clinical action may be taken | SpO2, HR (256 Hz), BP, RR | Continuously (SpO2, HR), intermittently (BP, RR) | Physiological deterioration |
Desautels et al, 2016 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMICm-III) | 22,853 ICU stays | 2577 (11.28%) stays with confirmed sepsis | Validate a sepsis prediction method, InSight, for the new Sepsis-3 definitions and make predictions using a minimal set of variables | GCS, HR, RR, SpO2, temperature, invasive and noninvasive SBP and DBP | At least 1 measurement per hour | Onset of sepsis |
Forkan et al, 2017 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC-II) | 1023 patients | Not specified | Develop a probabilistic model for predicting the future clinical episodes of a patient using observed vital sign values prior to the clinical event | HR, SBP, DBP, mean BP, RR, SpO2 | All samples converted to per-minute sampling | Abnormal clinical events |
Forkan et al, 2017 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC & MIMIC-II) | 85 patients | Not specified | Develop an intelligent method for personalized monitoring and clinical decision support through early estimation of patient-specific vital sign values | HR, SBP, DBP, mean BP, RR, SpO2 | Per-minute sampling | Patient-specific anomalies, disease symptoms, and emergencies |
Forkan et al, 2017 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC-II) | 4893 patients | Not specified | Build a prognostic model, ViSiBiD, that can accurately identify dangerous clinical events of a home-monitored patient in advance | HR, SBP, DBP, mean BP, RR, SpO2 | Per-minute sampling | Dangerous clinical events |
Guillame-Bert et al, 2017 [ |
Step-down unit | Bedside monitor measurements over 8 weeks | 297 admissions | 127 patients (43%) exhibited at least 1 real event during their stay | Forecast CRI utilizing data from continuous monitoring of physiologic vital sign measurements | HR, RR, SPO2, SBP, DBP, mean BP | Every 20 seconds (HR, RR, SPO2), every 2 hours (SBP, DBP, and mean BP) | At least 1 event threshold limit criteria exceeded for >80% of last 3 minutes |
Ho et al, 2017 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC-II) | 763 patients | 197 patients (25.8%) experienced a cardiac arrest event | Build a cardiac arrest risk prediction model capable of early notification at time z (z ≥5 hours prior to the event) | Temperature, SpO2, HR, RR, DBP, SBP, pulse pressure index | 1 reading per hour | Cardiac arrest |
Jang et al, 2019 [ |
ED visits to a tertiary academic hospital | EHR data from ED visits | Nontraumatic ED visits | 374,605 eligible ED visits of 233,763 patients; 1097 (0.3%) patients with cardiac arrest | Develop and test artificial neural network classifiers for early detection of patients at risk of cardiac arrest in EDs | Age, sex, chief complaint, SBP, DBP, HR, RR, temperature, AVPU | Not specified | Development of cardiac arrest within 24 hours after prediction |
Kwon et al, 2018 [ |
Cardiovascular teaching hospital and community general hospital | Data collected manually by staff on general wards, by bedside monitors in ICUs | 52,131 patients | 419 patients (0.8%) with cardiac arrest; 814 (1.56%) deaths without attempted resuscitation | Predict whether an input vector belonged within the prediction time window (0.5-24 hours before the outcome) | SBP, HR, RR, temperature | 3 times a day on general wards, every 10 minutes in ICUs | Primary outcome: first cardiac arrest; secondary outcome: death without attempted resuscitation |
Kwon et al, 2018 [ |
151 EDs in Korea | Korean National Emergency Department Information System (NEDIS) | 10,967,518 ED visits | 153,217 (1.4%) in-hospital deaths; 625,117 (5.7%) critical care admissions; 2,964,367 (27.0%) hospitalizations | Validate that a DTASn identifies high-risk patients more accurately than existing triage and acuity scores | Age, sex, chief complaint, time from symptom onset to ED visit, arrival mode, trauma, initial vital signs (SBP, DBP, HR, RR, temperature), mental status | At ED admission | Primary outcome: in-hospital mortality; secondary outcome: critical care; tertiary outcome: hospitalization |
Larburu et al, 2018 [ |
OSI Bilbao-Basurto (Osakidetza) Hospital and ED admissions, ambulatory | Collected manually by clinicians and patients | 242 patients | 202 predictable decompensations | Prevent mobile heart failure patients’ decompensation using predictive models | SBP, DBP, HR, SaO2, weight | At diagnosis and 3-7 times per week in ambulatory patients | Heart failure decompensation |
Li et al, 2016 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC-II) | 12 patients | Not specified | Adaptive online monitoring of patients in ICUs | HR, SBP, DBP, MAPo, RR | At least 1 measurement per hour | Signs of deterioration |
Liu et al, 2014 [ |
ED of a tertiary hospital in Singapore | Manual vital measurements by nurses or physicians | 702 patients with undifferentiated, nontraumatic chest pain | 29 (4.13%) patients met primary outcome | Discover the most relevant variables for risk prediction of major adverse cardiac events using clinical signs and HR variability | SBP, RR, HR | Not specified | Composite of events such as death and cardiac arrest within 72 hours of arrival at the ED |
Mao et al, 2018 [ |
ICU, inpatient wards, outpatient visits | UCSFp dataset:inpatient and outpatient visits; MIMIC-III: ICU bedside monitors | UCSF: 90,353 patients; |
UCSF: 1179 (1.3%) sepsis, 349 (0.39%) severe sepsis, 614 (0.68%) septic shock; MIMIC-III: sepsis (1.91%), severe sepsis (2.82%), septic shock (4.36%) | Sepsis prediction | SBP, DBP, HR, RR, SpO2, temperature | Hourly | Sepsis, severe sepsis, septic shock |
Olsen et al, 2018 [ |
PACUq, Rigshospitalet, University of Copenhagen, Denmark | IntelliVue MP5, BMEYE Nexfin bedside monitors during admission to post anesthetic care unit | 178 patients | 160 (89.9%) had ≥1 microevent occurring during admission; 116 patients (65.2%) had ≥1 microevent with a duration >15 minutes | Develop a predictive algorithm detecting early signs of deterioration in the PACU using continuously collected cardiopulmonary vital signs | SpO2, SBP, HR, MAP | Every minute (SpO2, SBP, HR), every 15 minutes (MAP) | Signs of deterioration |
Shashikumar et al, 2017 [ |
Adult ICU units | ICU bedside monitors, Bedmaster system; up to 24 hours of monitoring | Patients with unselected mixed surgical procedures | 242 sepsis cases | Predict onset of sepsis 4 hours ahead of time, using commonly measured vital signs | MAP, HR, SpO2, SBP, DBP, RR, GCS, temperature, comorbidity, clinical context, admission unit, surgical specialty, wound type, age, gender, weight, race | ≥1 measurement per hour | Onset of sepsis |
Tarassenko et al, 2006 [ |
General wards at John Radcliffe Hospital in Oxford, United Kingdom | Bedside monitors for at least 24 hours per patient | 150 general-ward patients | Not specified | A real-time automated system, BioSign, which tracks patient status by combining information from vital signs | HR, RR, SpO2, skin temperature, average SBP -average DBP | Every 30 minutes (BP), every 5 seconds (other vitals) | Signs of deterioration |
Van Wyk et al, 2017 [ |
Methodist LeBonheur Hospital, Memphis, TN | Bedside monitors: Cerner CareAware iBus system | 2995 patients | 343 patients (11.5%) diagnosed with sepsis | Classify patients into sepsis and nonsepsis groups using data collected at various frequencies from the first 12 hours after admission | HR, MAP, DBP, SBP, SpO2, age, race, gender, fraction of inspired oxygen | Every minute | Sepsis detection |
Yoon et al, 2019 [ |
Beth Israel Deaconess Medical Center ICU | ICU bedside monitors and medical records (MIMIC-II) | 2809 subjects | 787 tachycardia episodes | Predicting tachycardia as a surrogate for instability | Arterial DBP, arterial SBP, HR, RR, SpO2, MAP | 1/60 Hz or 1 Hz | Tachycardia episode |
aICU: intensive care unit.
bNEWS: National Early Warning Score.
cHR: heart rate.
dRR: respiratory rate.
eSBP: systolic blood pressure.
fAVPU: alert, verbal, pain, unresponsive.
gCRI: cardiorespiratory instability.
hDBP: diastolic blood pressure.
iED: emergency department.
jEHR: electronic health record.
kGCS: Glasgow Coma Score.
lNHS: National Health Service.
mMIMIC: Medical Information Mart for Intensive Care.
nDTAS: Deep learning–based Triage and Acuity Score.
oMAP: mean arterial pressure.
pUCSF: University of California, San Francisco.
qPACU: postanesthesia care unit.
The most commonly used vital sign predictors were HR, RR, systolic BP, diastolic BP, SpO2, body temperature, level of consciousness through either the Glasgow Coma Score or the AVPU scale, and mean arterial pressure. Measurement frequencies for these variables ranged from once every 5 seconds [
The outcomes being predicted in most studies focused on cardiorespiratory insufficiency–related events. Cardiac arrest was the primary outcome in 7 [
Outcomes were first identified, and baseline models were created using predefined parameter thresholds (ground truth) consistent with the MEWS [
All included studies consider the prediction of deterioration risk to be a classification task and therefore use different types of classification models in the process, including tree-based models, linear models, kernel-based methods, and neural networks (refer to
Measures used to assess model performance varied across the studies. The most common measure was the area under the receiver operator characteristic (AUROC) along with model accuracy, sensitivity, and specificity. Area under the precision-recall, F-score, Hamming’s score, and precision (positive predictive value) were reported less commonly.
Prediction windows ranged from 30 minutes to 30 days before an event.
Model performance varied substantially based on outcome measure being predicted (eg, cardiorespiratory insufficiency vs sepsis), ML method used (eg, linear vs tree-based), and prediction window (eg, 30 minutes before an event vs 4 hours before).
Machine learning (ML) models and comparisons used for outcome prediction.
Study | Cohort | Event rate | ML model(s) | Missing data handling | Best ML model performance | ML model comparisons | Prediction window | Aggregate weighted EWSa comparisons |
Badriyah et al, 2014 [ |
35,585 admissions | 199 (0.56%), cardiac arrest; |
Decision tree analysis | Not specified | Decision tree predicted cardiac arrest: AUROCc=0.708; |
Not specified | Within 24 hours preceding events | NEWSd AUROC: cardiac arrest, 0.722; unanticipated ICU admission, 0.857; |
Chen at al, 2017 [ |
1880 patients (1971 admissions) | 997 patients (53%) or 1056 admissions (53.6%) who experienced CRIe events | Variant of the random forest classification model using nonrandom splits | Not specified | Random forest AUCf initially remained constant (0.58-0.60), followed by an increasing trend, with AUCs rising from 0.57 to 0.89 during the 4 hours immediately preceding events | Logistic regression: AUC=0.7; lasso logistic regression: AUC=0.82 | Within 4 hours preceding events | No comparison |
Churpek et al, 2016 [ |
269,999 admissions | 16,452 outcomes (6.09%) | Univariate analysis, bivariate analysis |
Forward imputation, median value imputation | Trends increased model accuracy compared to a model containing only current vital signs (AUC 0.78 vs 0.74); vital sign slope improved AUC by 0.013 | Not specified | Within 4 hours preceding events | No comparison |
Chiew et al, 2019 [ |
214 patients | 40 patients (18.7%) met outcome | K-nearest neighbor, random forest, adaptive boosting, gradient boosting, support vector machine | Not specified | Gradient boosting predicted 30-day sepsis-related mortality: F1 score=0.50, AUPRC=0.35, precision (PPVg)=0.62, recall=0.5 | K-nearest neighbor: F1 score=0.10, AUPRC=0.10, precision (PPV)=0.33, recall=0.6; random forest: F1 score=0.35, AUPRC=0.27, precision (PPV)=0.26, recall=0.56; adaptive boosting: F1 score=0.40, AUPRC=0.31, precision (PPV)=0.43, recall=0.38; SVMh: F1 score=0.43, AUPRC=0.29, precision (PPV)=0.33, recall=0.63 | Within 30 days preceding event | SEDSi: F1=0.40, AUPRC=0.22; qSOFAj: F1=0.32, AUPRC=0.21; NEWS; F1=0.38, AUPRC=0.28; MEWSk: F1=0.30, AUPRC=0.25 |
Chiu et al, 2019 [ |
Adults undergoing risk-stratified major cardiac surgery (n=13,631) | 578 patients (4.2%) with an outcome; 499 patients (3.66%) with unplanned ICU readmissions | Logistic regression | Observations with missing values were excluded | Logistic regression predicted the event 24 hours in advance: AUROC=0.779; 12 hours in advance: AUROC=0.815; 6 hours in advance: AUROC=0.841 | Not specified | Within 24, 12, and 6 hours preceding event | NEWS: 24 hours before event, |
Clifton et al, 2014 [ |
200 patients in the postoperative ward following upper gastrointestinal cancer surgery | Not specified | Classifiers, Gaussian process, one-class support vector machine, kernel estimate | Missing channels replaced by mean of that channel | SVM predicted deterioration: accuracy=0.94, partial AUC=0.28, sensitivity=0.96, specificity=0.93 | Conventional SVM: accuracy=0.90, partial AUC=0.26, sensitivity=0.92, specificity=0.87; Gaussian mixture models: accuracy=0.9, partial AUC=0.24, sensitivity=0.97, specificity=0.84; Gaussian processes: accuracy=0.90, partial AUC=0.26, sensitivity=0.91, specificity=0.89; kernel density estimate: accuracy=0.91, partial AUC=0.26, sensitivity=0.94, specificity=0.87 | Not specified | No comparison |
Desautels et al, 2016 [ |
22,853 ICU stays | 2577 (11.28%) stays with confirmed sepsis | Insight classifier | Carry forward imputation | Classifier predicts sepsis at onset: AUROC=0.880, APRl=0.6, accuracy=0.8; classifier predicts sepsis 4 hours before onset: AUROC=0.74, APR=0.28, accuracy=0.57 | Not specified | Within 4 hours preceding event and at time of event onset | SIRSm: AUROC= 0.609, APR= 0.160; qSOFA: AUROC= 0.772, APR=0.277; MEWS: AUROC=0.803, APR=0.327; SAPSn II: AUROC=0.700, APR=0.225; SOFA: AUROC=0.725, APR=0.284 |
Forkan et al, 2017 [ |
1023 patients | Not specified | PCAo used to separate patients into multiple categories; hidden Markov Model adopted for probabilistic classification and future prediction | Data with consecutive missing values over a long period are eliminated | Hidden Markov Model event prediction: accuracy=97.8%, precision=92.3, sensitivity=97.7, specificity=98, F-score=95% | Neural network: accuracy=93% | Within 30 minutes preceding event | No comparison |
Forkan et al, 2017 [ |
85 patients | Not specified | Multilabel classification algorithms are applied in classifier design; result analysis with J48 decision tree, random tree and sequential minimal optimization (SMO, a simplified version of SVM) | Where ≥1 vital signs data are missing while clean values of others are available, considered as recoverable and imputed using median-pass and k-nearest neighbor filter | Predictions across 24 classifier combinations yielded a Hamming score of 90%-95%; F1-micro average of 70.1%-84%; accuracy of 60.5%-77.7% | Not specified | Within 1 hour preceding event | No comparison |
Forkan et al, 2017 [ |
4893 patients | Not specified | J48 decision tree, random forest, sequential minimal optimization, MapReduce random forest | Data with consecutive missing values over a long period are eliminated | Event prediction by random forest: within a 60-minute forecast horizon, F score=0.96, accuracy=95.86; within a 90-minute forecast horizon, F-score=0.95, accuracy=95.35; within a 120-minute forecast horizon, F-score=0.95, accuracy=95.18 | J48 decision tree: within a 60-minute forecast horizon, F score=0.93, accuracy=92.46; within a 90-minute forecast horizon, F score=0.92, accuracy=91.59; within a 120-minute forecast horizon, F score=0.91, accuracy=91.30; Event prediction with sequential minimal optimization: within a 60-minute forecast horizon, F score=0.91, accuracy=90.72; within a 90-minute forecast horizon, F score=0.90, accuracy=90.08; within a 120-minute forecast horizon, |
1 hour preceding event | No comparison |
Guillame-Bert et al, 2017 [ |
297 admissions | 127 patients (43%) exhibited at least 1 real CRI event during their stay in the step-down unit | TITAp rules, rule fusion algorithm; mapping function from rule-based features to forecast model learned using random forest classifier | Not specified | Event forecast alert within 17 minutes, 51 seconds before onset of CRI (false alert every 12 hours); event forecast alert within 10 minutes, 58 seconds before onset of CRI (false alert every 24 hours) | Random forest: event forecast alert within 11 minutes, 25 seconds before onset of CRI (false alert every 12 hours); event forecast alert within 5 minutes, 52 seconds before onset of CRI (false alert every 24 hours) | Within 17 minutes, 51 seconds preceding CRI onset | No comparison |
Ho et al, 2017 [ |
763 patients | 197 patients (25.8%) experienced a cardiac arrest event | Temporal transfer learning-based model (TTL-Reg) | Imputed values based on the median from patients of the same gender and similar ages | TTL-Reg predicts events with an AUC of 0.63 | Not specified | Within 6 hours preceding event | No comparison |
Jang et al, 2019 [ |
Non-traumatic ED visits | 374,605 eligible ED visits of 233,763 patients; 1097 (0.3%) patients with cardiac arrest | ANNq with multilayer perceptron, ANN with LSTMr, hybrid ANN; comparison with random forest and logistic regression | Not specified | Event prediction: ANN with multilayer perceptron, AUROC=0.929; ANN with LSTM, AUROC=0.933; hybrid ANN, AUROC=0.936 | Random forest, AUROC=0.923; logistic regression, AUROC=0.914 | Within 24 hours preceding event | MEWS: AUROC=0.886 |
Kwon et al, 2018 [ |
52,131 patients | 419 patients (0.8%) with cardiac arrest; 814 (1.56%) deaths without attempted resuscitation | 3 RNNs layers with LSTM to deal with time series data; compared to random forest and logistic regression | Most recent value was used; if no value available, then median value used | Event prediction: RNNs, AUROC=0.85, AUPRCt=0.044 | Random forest, AUROC=0.78, AUPRC=0.014; |
30 minutes to 24 hours preceding event | MEWS: AUROC=0.603, AUPRC=0.003 |
Kwon et al, 2018 [ |
10,967,518 ED visits | 153,217 (1.4%) in-hospital deaths; 625,117 (5.7%) critical care admissions; 2,964,367 (27.0%) hospitalizations | DTASu using multilayer perceptron with 5 hidden |
Excluded | Event prediction: DTAS using multilayer perceptron, AUROC=0.935, AUPRC=0.264 | Random forest: AUROC= 0.89, AUPRC= 0.14; logistic regression: AUROC= 0.89, AUPRC=0.16 | Not specified | Korean triage and acuity score: AUROC =0.785, AUPRC=0.192; |
Larburu et al, 2018 [ |
242 patients | 202 predictable decompensations | Naïve Bayes, decision tree, random forest, SVM | Not specified | Decompensation event prediction: naïve Bayes, AUC=67% | Decision tree, neural network, random forest, support vector machine, stochastic gradient descent | Not specified | No comparison |
Li et al, 2016 [ |
12 patients | Not specified | L-PCA (combination of just-in-time learning and PCA) | Not specified | Fault detection rate with L-PCA: 20% higher than with PCA; 47% higher than with fast moving-window PCA; best detection rate achieved was 99.8% | Not specified | Not specified | No comparison |
Liu et al, 2014 [ |
702 patients with undifferentiated, non-traumatic chest pain | 29 (4.13%) patients met primary outcome | Novel variable selection framework based on ensemble learning; random forest was the independent variable selector for creating the decision ensemble | Not specified | Event prediction with ensemble learning model: AUC=0.812, cut-off score=43, sensitivity=82.8%, specificity=63.4% | Not specified | Within 72 hours of arrival at ED | TIMIv: AUC=0.637; MEWS: AUC=0.622 |
Mao et al, 2018 [ |
UCSFw: 90,353 patients; MIMICx-III: 21,604 patients | UCSF: 1179 (1.3%) sepsis, 349 (0.39%) severe sepsis, 614 (0.68%) septic shock; MIMIC-III: sepsis (1.91%), severe sepsis (2.82%), septic shock (4.36%) | Gradient tree boosting + transfer learning using MIMIC-III as source and UCSF as target | Carry forward imputation | Detection with gradient tree boosting: AUROC=0.92 for sepsis; AUROC=0.87 for severe sepsis at onset; AUROC=0.96 for septic shock 4 hours before; AUROC=0.85 for severe sepsis prediction 4 hours before | Not specified | At onset of sepsis and severe sepsis; within 4 hours preceding septic shock and severe sepsis | MEWS: AUROC=0.76; SOFA: AUROC=0.65; SIRS: AUROC=0.72 |
Olsen et al, 2018 [ |
178 patients | 160 (89.9%) had ≥1 microevent occurring during admission; 116 patients (65.2%) had ≥1 microevent with a duration >15 minutes | Random forest classifier | Not specified | Detection of early signs of deterioration with random forest: accuracy=92.2%, sensitivity=90.6%, specificity=93.0%, AUROC=96.9% | Not specified | Not specified | Compared with hospital's current alarm system: number of false alarms decreased by 85%, number of missed early signs of deterioration decreased by 73% |
Shashikumar et al, 2017 [ |
Patients with unselected mixed surgical procedures | 242 sepsis cases | Elastic net logistic classifier | Median values (if multiple measurement were available); otherwise, the old values were kept (sample-and-hold interpolation); mean imputation for replacing all remaining missing values | Event prediction: elastic net logistic classifier using entropy features alone, AUROC=0.67, accuracy=47%; elastic net logistic classifier using social demographics + EMRy features, AUROC=0.7, accuracy=50%; elastic net logistic classifier using all features, AUROC=0.78, accuracy=61% | Not specified | 4 hours prior to onset | No comparison |
Tarassenko et al, 2006 [ |
150 general-ward patients | Not specified | Biosign; data fusion method: probabilistic model of normality in five dimensions | Historic, median filtering | 95% of Biosign alerts were classified as “True” by clinical experts | Not specified | Within 120 minutes of event | No comparison |
Van Wyk et al, 2017 [ |
2995 patients | 343 patients (11.5%) diagnosed with sepsis | CNNz (constructed images using raw patient data) with random dropout to reduce overfitting; multilayer perceptron with random dropout between layers to avoid overfitting | Not specified | Event classification with a 1-minute observation frequency: CNN, accuracy=86.1%; event classification with a 10-minute observation frequency: CNN, accuracy=78.2% | Event classification with a 1-minute observation frequency: multilayer perceptron, accuracy=76.5%; |
Not specified | No comparison |
Yoon et al, 2019 [ |
2809 subjects | 787 tachycardia episodes | Regularized logistic regression and random forest classifiers | Discrete Fourier transform, cubic-spline interpolation of heart rate and respiratory rate data for missing data as long as ≥20% of the data were available | Event prediction: random forest, AUC=0.869, accuracy=0.806 | Logistic regression with L1 regularization, AUC=0.8284, accuracy=0.7668 | Within 3 hours preceding onset | No comparison |
aEWS: early warning system.
bICU: intensive care unit.
cAUROC: area under the receiver operator characteristic.
dNEWS: National Early Warning Score.
eCRI: cardiorespiratory instability.
fAUC: area under the curve.
gPPV: positive predictive value.
hSVM: support vector machine.
iSEDS: Singapore Emergency Department Sepsis.
jqSOFA: quick Sequential Organ Failure Assessment.
kMEWS: Modified Early Warning Score.
lAPR: area under the precision-recall curve.
mSIRS: systemic inflammatory response syndrome.
nSAPS II: simplified acute physiology score.
oPCA: principal component analysis.
pTITA: temporal interval tree association.
qANN: artificial neural network.
rLSTM: long short-term memory.
sRNN: recurrent neural network.
tAUPRC: area under the precision-recall curve.
uDTAS: Deep learning–based Triage and Acuity Score.
vTIMI: Thrombolysis in Myocardial Infarction.
wUCSF: University of California, San Francisco.
xMIMIC: Medical Information Mart for Intensive Care.
yEMR: electronic medical record.
zCNN: convolutional neural network.
Nine studies compared the performance of ML-based EWS with aggregate-weighted EWS. Studies exploring cardiorespiratory outcomes, general physiological deterioration, or mortality carried out comparisons with NEWS [
In all 9 studies, the ML models performed better than the aggregate-weighted EWS systems for all clinical outcomes except for cardiac arrest in the study by Badriyah et al [
Based on this scoping review, ML-based EWS models show considerable promise, but there exist several important avenues for future research if these models are to be effectively implemented in clinical practice.
A model’s prediction window refers to how far in advance a model is predicting an adverse event. Most studies included in our review used a prediction window between 30 minutes [
The studies included in this review focused on ML model development and did not explore how the output of these models would be communicated to clinicians. Since many ML models are “black boxes” [
Nearly all the studies included in this review were conducted in inpatient settings. While EWS are highly valuable in an inpatient context, there is also considerable need in the ambulatory setting, particularly postdischarge. For example, the VISION study [
All but one study [
A key observation from this review is the lack of an agreed-upon standard among the research community for reporting performance measures across studies. This makes meaningful comparison between the outcomes of these studies difficult, and where there is overlap, it is not clear that the most clinically relevant metrics have been chosen. The majority of the studies in this review report the AUROC as the main performance metric, reflecting a common practice in the ML literature. However, AUROC may not be adequate for evaluating the performance of the EWS in a clinical setting [
As Romero-Brufau et al [
The performance of an EWS depends on the tradeoff between 2 goals: early detection of outcomes versus issuance of fewer false-positive alerts to prevent alarm fatigue [
On a related note, only 9 of the studies included in our review made comparisons between their ML-based models and a “gold standard” aggregate-weighted EWS, such as MEWS or NEWS. Future research in the area should report a commonly used aggregate-weighted EWS as a baseline model, which would aid in making effective comparisons between them. NEWS may be particularly well suited to this area of research as its input variables can all be measured automatically and continuously via devices.
The search strategy was comprehensive while not being too focused on specific clinical outcomes, sampling frequencies, or filtering for time. This allowed for the identification of as many studies as possible that examined the use of ML models and vital signs to predict the risk of patient deterioration. No additional studies were identified through citation tracking after the original search, indicating our search strategy was comprehensive. Unlike previous reviews, inclusion criteria for the review supported the examination of findings from studies conducted across a variety of clinical settings including specialty units or wards and ambulatory care. This helped in characterizing the use of ML-based prediction models in different patient-care environments with varying clinical endpoints. Wherever the original studies provided the data, comparisons were drawn between the performance of the ML models and that of aggregate-weighted EWS. This gives an indication of the differences in accuracy of the models in predicting clinical deterioration.
The findings within this review are subject to some limitations. First, the literature search, assessment of eligibility of full-text articles, inclusion in the review, and extraction of study data were carried out by only 1 author. Second, only the findings from published studies were included in this scoping review, which may affect the results due to publication bias. While studies from a variety of settings were included, the generalizability of our findings may be limited due to the heterogeneity of patient populations, clinical practices, and study methodologies. Sampling procedures and frequencies varied across studies from single to multiple observations of patient vital signs, and clinical outcome definitions were based on different criteria or aggregate-weighted EWS. Finally, due to this variation in ML methods, prediction windows, and outcome reporting, a meta-analysis was not feasible.
Our findings suggest that ML-based EWS models incorporating easily accessible vital sign measurements are effective in predicting physiological deterioration in patients. Improved prediction performance was also observed with these models when compared to traditional aggregate-based risk stratification tools. The clinical impact of these ML-based EWS could be significant for clinical staff and patients due to decreased false alerts and increased early detection of warning signs for timely intervention, though further development of these models is needed and the necessary prospective research to establish actual clinical utility does not yet exist.
Search terms.
Description of ML methods and relevant terms.
Comparison between performance of ML based EWS and aggregate EWS.
area under the receiver operating characteristic
alert, verbal, pain, unresponsive
blood pressure
emergency department
early warning system
heart rate
intensive care unit
Modified Early Warning Score
Medical Information Mart for Intensive Care
machine learning
National Early Warning Score
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
quick Sequential Organ Failure Assessment
respiratory rate
Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis
SM contributed to conceptualization, data collection, data analysis, and manuscript writing. JP contributed to conceptualization, manuscript writing, and manuscript review. WN and SD contributed equally to manuscript writing and review. PD contributed to manuscript writing and review. MM and NB contributed to manuscript review.
PJD is a member of a research group with a policy of not accepting honorariums or other payments from industry for their own personal financial gain. They do accept honorariums/payments from industry to support research endeavours and costs to participate in meetings.
Based on study questions PJD has originated and grants he has written, he has received grants from Abbott Diagnostics, AstraZeneca, Bayer, Boehringer Ingelheim, Bristol-Myers-Squibb, Coviden, Octapharma, Philips Healthcare, Roche Diagnostics, Siemens, and Stryker.
PJD has participated in advisory board meetings for GlaxoSmithKline, Boehringer Ingelheim, Bayer, and Quidel Canada. He also attended an expert panel meeting with AstraZeneca and Boehringer Ingelheim.
The other authors declare no conflicts of interest.