This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Predicting early respiratory failure due to COVID-19 can help triage patients to higher levels of care, allocate scarce resources, and reduce morbidity and mortality by appropriately monitoring and treating the patients at greatest risk for deterioration. Given the complexity of COVID-19, machine learning approaches may support clinical decision making for patients with this disease.
Our objective is to derive a machine learning model that predicts respiratory failure within 48 hours of admission based on data from the emergency department.
Data were collected from patients with COVID-19 who were admitted to Northwell Health acute care hospitals and were discharged, died, or spent a minimum of 48 hours in the hospital between March 1 and May 11, 2020. Of 11,525 patients, 933 (8.1%) were placed on invasive mechanical ventilation within 48 hours of admission. Variables used by the models included clinical and laboratory data commonly collected in the emergency department. We trained and validated three predictive models (two based on XGBoost and one that used logistic regression) using cross-hospital validation. We compared model performance among all three models as well as an established early warning score (Modified Early Warning Score) using receiver operating characteristic curves, precision-recall curves, and other metrics.
The XGBoost model had the highest mean accuracy (0.919; area under the curve=0.77), outperforming the other two models as well as the Modified Early Warning Score. Important predictor variables included the type of oxygen delivery used in the emergency department, patient age, Emergency Severity Index level, respiratory rate, serum lactate, and demographic characteristics.
The XGBoost model had high predictive accuracy, outperforming other early warning scores. The clinical plausibility and predictive ability of XGBoost suggest that the model could be used to predict 48-hour respiratory failure in admitted patients with COVID-19.
On March 11, 2020, COVID-19, the disease caused by SARS-CoV-2 infection, was declared a pandemic by the World Health Organization [
Respiratory failure is the leading cause of death among patients with COVID-19, with up to one-third of patients admitted with COVID-19 requiring invasive mechanical ventilation (IMV) [
Methods of identifying patients at high risk for or in the early stages of clinical deterioration have been actively researched for decades. The field has generated many severity-of-illness calculators, early warning scores, and, more recently, predictive analytic tools that use advanced machine learning and artificial intelligence [
This retrospective observational cohort drew data from 13 acute care hospitals of Northwell Health, the largest health care system in New York State. Data were extracted from the electronic health record (EHR) Sunrise Clinical Manager (Allscripts). EHRs were screened for adult patients (aged ≥21 years) who received a positive test result for SARS-CoV-2 based on a nasopharyngeal sample tested using polymerase chain reaction assays. Included patients were hospitalized and were discharged, died, or spent a minimum of 48 hours in the hospital between March 1, 2020, and May 11, 2020. For patients who had multiple qualifying hospital admissions, only the first hospitalization was included. Patients who were transferred between hospitals within the health system were treated as one hospital encounter. A total of 11,919 patients were identified. Patients were excluded if they were placed on mechanical ventilation prior to inpatient admission. A total of 11,525 patients remained for analysis. The Institutional Review Board of Northwell Health approved the study protocol and waived the requirement for informed consent.
Data collected from EHRs included patient demographics, comorbidities, home medications, initial vitals and laboratory values, treatments (eg, oxygen therapy, mechanical ventilation), and clinical outcomes (eg, length of stay, discharge, mortality). Vitals and laboratory testing were restricted to those obtained while the patient was in the ED.
The target outcome variable was defined as intubation and mechanical ventilation within 48 hours of admission. In the EHR, the admission time was recorded, and the intubation event was defined as the first time mechanical ventilation was recorded.
We evaluated three predictive models: XGBoost, XGBoost + SMOTEENN (combined oversampling using SMOTE and undersampling using edited nearest neighbors) [
The XGBoost + SMOTEENN method involves combined oversampling using SMOTE and undersampling using edited nearest neighbors on the training set before training an XGBoost model [
For every learning framework, we validated the model with external validation using each hospital (ie, for each fold, one hospital was picked as a testing set and the others as a training set). Only hospitals with >1000 patients with COVID-19 in the data set were picked for the testing sets, and a random sample of 1000 patients was picked to be our testing set for each fold. Grid search was used to hypertune the parameters of the respective models. The XGBoost model was tuned based on min_child_weight, gamma, subsample, colsample_bytree, and max_depth parameters, and the ranges of the values were 1-20, 0.5-20, 0.2-1.0, 0.2-1.0, and 2-40, respectively.
When data were missing, we imputed weighted k-nearest neighbors [
Calibration curves (reliability curves) were plotted by dividing the testing sets (for each hospital fold) into 10 bins randomly with an increasing fraction of patients that had respiratory failure in the sample. The fraction positives (patients who had respiratory failure) and their mean corresponding predicted value from the corresponding model were depicted and averaged into 10 bins. The Brier score was calculated for each external hospital fold and the mean Brier score and standard deviation were calculated and depicted in the legend of the calibration curve. For further explanation of these measures and how they were calculated, see
Python 2.6 (Python Software Foundation) was used to implement our machine learning framework. The respective prediction models of XGBoost and logistic regression from the scikit-learn application programming interface (API) in Python were used [
The Modified Early Warning Score (MEWS) was computed from patient vital signs (
During the study period, we identified 11,525 patients admitted from the ED with a diagnosis of COVID-19. Of these, 933 (8.0%) were placed on IMV within 48 hours of admission. Baseline characteristics (demographics, baseline vital signs, and laboratory measurements) for all patients are shown in
Demographic, clinical, and laboratory data from hospitalized patients.
Variables | Not intubated (n=10,592) | Intubated (n=933) | Missing (%) | |
|
||||
|
Age (years), median (IQR) | 65.00 (54.00-77.00) | 66.00 (56.00-75.00) | 0 |
|
Female, n (%) | 4530 (42.8) | 327 (35.0) | 0 |
|
Primary language, English, n (%) | 8498 (80.2) | 746 (80.0) | 0 |
|
0 | |||
|
Black | 2199 (20.8) | 236 (25.3) | N/Aa |
|
Asian | 889 (8.4) | 77 (8.3) | N/A |
|
White | 4148 (39.2) | 310 (33.2) | N/A |
|
Declined | 71 (0.7) | 8 (0.9) | N/A |
|
Other | 2884 (27.2) | 268 (28.7) | N/A |
|
Unknown | 401 (3.8) | 34 (3.6) | N/A |
|
0.1 | |||
|
Hispanic or Latino | 2238 (21.1) | 202 (21.7) | N/A |
|
Not Hispanic or Latino | 7685 (72.6) | 648 (69.5) | N/A |
|
Declined | 43 (0.4) | 1 (0.1) | N/A |
|
Unknown | 618 (5.8) | 82 (8.8) | N/A |
|
||||
|
Systolic blood pressure (mm Hg), median (IQR) | 134.00 (118.00-150.00) | 134.00 (115.00-151.75) | 0.5 |
|
Diastolic blood pressure (mm Hg), median (IQR) | 79.00 (70.50-87.00) | 77.00 (69.00-86.00) | 0.6 |
|
Heart rate (beats/minute), median (IQR) | 94.00 (85.00-102.00) | 97.00 (88.50-112.00) | 0.4 |
|
Respiratory rate (breaths/minute), median (IQR) | 21.00 (18.00-25.00) | 24.00 (20.00-32.00) | 0.8 |
|
Temperature (°C), mean (SD) | 37.77 (0.97) | 37.86 (1.11) | 1.6 |
|
Oxygen saturation (%), median (IQR) | 97.00 (95.00-98.00) | 96.00 (93.00-98.00) | 1.7 |
|
BMI, mean (SD) | 29.12 (7.79) | 30.39 (9.21) | 47.1 |
|
||||
|
White blood cell count (× 109/L), median (IQR) | 7.34 (5.45-9.92) | 8.25 (6.20-11.50) | 9 |
|
Absolute neutrophil count (× 109/L), median (IQR) | 5.68 (3.95-8.11) | 6.84 (4.76-9.62) | 11.5 |
|
Absolute lymphocyte count (× 109/L), median (IQR) | 0.90 (0.63-1.27) | 0.80 (0.56-1.13) | 11.5 |
|
Hemoglobin (g/dL), mean (SD) | 12.93 (2.12) | 13.14 (2.11) | 9 |
|
Platelets (K/μL), mean (SD) | 230.17 (101.93) | 217.19 (87.45) | 9.1 |
|
Sodium (mmol/L), mean (SD), | 136.64 (6.21) | 135.38 (5.74) | 11.9 |
|
Carbon dioxide (mmol/L), mean (SD) | 23.61 (3.79) | 22.67 (4.68) | 11.9 |
|
Creatinine (mg/dL), median (IQR) | 1.03 (0.80-1.46) | 1.20 (0.92-1.75) | 12 |
|
Bilirubin (mg/dL), median (IQR) | 0.50 (0.40-0.70) | 0.60 (0.40-0.80) | 12.5 |
|
Ferritin (ng/mL), mean (SD) | 1283.50 (2732.65) | 1731.05 (2631.38) | 73.2 |
|
Procalcitonin (ng/mL), mean (SD) | 1.22 (10.96) | 2.12 (8.16) | 66.3 |
|
D-dimer (ng/mL), mean (SD) | 1871.84 (5306.42) | 2659.09 (6798.96) | 65.4 |
|
Lactate dehydrogenase (U/L), mean (SD) | 455.61 (213.04) | 611.05 (272.16) | 71 |
|
pH (arterial), mean (SD) | 7.42 (0.09) | 7.39 (0.11) | 96.7 |
|
Partial pressure of oxygen (arterial, mm Hg), mean (SD) | 99.90 (65.17) | 85.26 (61.42) | 94.8 |
|
Partial pressure of carbon dioxide (arterial, mm Hg), mean (SD) | 34.66 (9.38) | 35.38 (11.45) | 94.7 |
|
||||
|
Hypertension, n (%) | 1183 (11.2) | 115 (12.3) | 0 |
|
Diabetes, n (%) | 685 (6.5) | 77 (8.3) | 0 |
|
Coronary artery disease, n (%) | 148 (1.4) | 15 (1.6) | 0 |
|
Asthma/chronic obstructive pulmonary disease, n (%) | 242 (2.3) | 20 (2.1) | 0 |
|
Chronic kidney disease, n (%) | 99 (0.9) | 8 (0.9) | 0 |
|
HIV, n (%) | 26 (0.2) | 1 (0.1) | 0 |
aN/A: not applicable.
Based on XGBoost, the mean area under the curve (AUC) of the ROC (AUCROC) curve was 0.77 (SD 0.05) and the mean AUC of the PR curve (AUCPR) was 0.26 (SD 0.04;
Based on the XGBoost + SMOTEENN model, the mean AUCs of the ROC and PR curves were 0.76 (SD 0.03) and 0.24 (SD 0.06), respectively (
The XGBoost model for predicting respiratory failure within 48 hours. (A) ROC curve and (B) PR curve based on a cross-hospital validation performed by leaving a hospital out as a testing set and using the rest in the training set. Only hospitals with >1000 patients with COVID-19 were selected for testing sets. The mean ROC and PR curves are shown in dark blue and their corresponding standard deviations are shown in gray. The MEWS metrics are shown in light yellow. (C) Measurement of the 10 variables with the highest relative importance based on the amount they reduced the Gini coefficient for the largest hospital testing set. (D) Confusion matrix visually represents the predicted values versus actual prediction for the largest hospital testing set. AUC: area under the curve of ROC; AUCPR: area under the curve of the precision-recall curve; ED: emergency department; LIJ: Long Island Jewish; MEWS: Modified Early Warning Score; PR: precision-recall; ROC: receiver operating characteristic.
Mean area under the curve of the receiver operating characteristic curve, area under the curve of the precision-recall curve, accuracies, precisions, recalls, specificities, geometric means, and Fβ-score (β=4) for models investigated.
Measure | XGBoost, mean (SD) | XGBoost + SMOTEENN, mean (SD) | Logistic regression, mean (SD) | Modified Early Warning Score |
Area under the curve of the receiver operating characteristic curve | 0.77 (0.05) | 0.76 (0.03) | 0.70 (0.05) | 0.61 |
Area under the curve of the precision-recall curve | 0.26 (0.04) | 0.24 (0.06) | 0.18 (0.06) | 0.12 |
Accuracy | 0.919 (0.028) | 0.893 (0.016) | 0.915 (0.027) | 0.913 |
Precision | 0.521 (0.329) | 0.303 (0.089) | 0.322 (0.375) | 0.165 |
Recall | 0.051 (0.030) | 0.228 (0.095) | 0.009 (0.013) | 0.017 |
Specificity | 0.994 (0.005) | 0.955 (0.005) | 0.998 (0.002) | 0.992 |
Geometric mean | 0.337 (0.042) | 0.506 (0.063) | 0.285 (0.051) | 0.296 |
Fβ-score | 0.054 (0.029) | 0.226 (0.088) | 0.010(0.014) | 0.018 |
The XGBoost + SMOTEENN model for predicting respiratory failure within 48 hours. (A) ROC curve and (B) PR curve based on a cross-hospital validation performed by leaving one hospital out as a testing set and using the remaining hospitals for the training set. Only hospitals with >1000 patients with COVID-19 were selected for testing sets. The mean ROC and PR curves are shown in dark blue and their corresponding standard deviations are shown in gray. The MEWS metrics are shown in light yellow. (C) The 10 variables with the highest relative importance measured by the amount the variable reduced the Gini coefficient. (D) Mean confusion matrix visually represents the predicted values versus actual prediction. AUC: area under the curve of ROC; AUCPR: area under the curve of the precision-recall curve; ED: emergency department; LIJ: Long Island Jewish; MEWS: Modified Early Warning Score; PR: precision-recall; ROC: receiver operating characteristic.
We also examined the performance of a logistic regression model. The mean AUCs of the ROC and PR curves were 0.70 (SD 0.05) and 0.18 (SD 0.06), respectively. Mean accuracy, precision, recall, specificity, geometric mean, and Fβ-score were 0.915 (SD 0.027), 0.322 (SD 0.375), 0.009 (SD 0.013), 0.994 (SD 0.005), 0.285 (SD 0.051), and 0.010 (SD 0.014), respectively (
The calibration curves showed that all three models were well calibrated among all hospital folds, although all three deviated from perfect calibration as the fraction of positives increased (
The logistic regression model for predicting respiratory failure within 48 hours. (A) ROC curve and (B) PR curve based on a cross-hospital validation performed by leaving a hospital out as a testing set and using the rest for the training set. Only hospitals with >1000 patients with COVID-19 were selected for testing sets. The mean ROC and PR curves are shown in dark blue and their corresponding standard deviations are shown in gray. The MEWS metrics are shown in light yellow. (C) The 10 variables with the highest relative importance measured by the absolute value of the regression coefficient. (D) Mean confusion matrix visually represents the predicted values versus actual prediction. AUC: area under the curve of ROC; AUCPR: area under the curve of the precision-recall curve; LIJ: Long Island Jewish; MEWS: Modified Early Warning Score; PR: precision-recall; ROC: receiver operating characteristic.
We presented three models (two of which were based on XGBoost) for predicting early respiratory failure in patients given a diagnosis of COVID-19 and admitted to the hospital from the ED. One model was tilted toward precision and specificity (XGBoost) and the other was tilted toward recall (XGBoost + SMOTEENN). These models are based on baseline characteristics, ED vital signs, and laboratory measurements. Using an automated tool to estimate the probability of respiratory failure could identify at-risk patients for earlier interventions (eg, closer monitoring, critical care consultation, earlier discussions about goals of care) and improve patient outcomes.
We evaluated three machine learning models: XGBoost, XGBoost + SMOTEENN, and logistic regression [
We also constructed an XGBoost + SMOTEENN model. SMOTEENN was used to improve the sensitivity of our prediction, as our data set was imbalanced (ie, only ~8% of our COVID-19 cohort were intubated), while keeping deviation from accuracy and calibration of the model to a minimum. Compared to XGBoost, the XGBoost + SMOTEENN model had lower accuracy and precision, but greater recall (or sensitivity; 0.228 [SD 0.095];
We also examined the performance of a logistic regression model to determine whether a compact, linear model could accurately predict patient risk (
Using the most important variables for our models, we identified clinically relevant measures that can best inform clinical decision making (
Variable importance metrics revealed that the linear logistic regression models use laboratory variables primarily, whereas nonlinear XGBoost-based models prioritize clinical and demographic variables that better capture hospital-specific behavior (eg, oxygen delivery types prior to intubation) and increase the robustness of the model. However, we need to validate whether providing these variables along with the probability of respiratory failure would decrease the rate of identifying at-risk patients. Further prospective studies and randomized clinical trials are needed for this validation.
When examining the calibration of the models (
Calibration plots (reliability curve) of the XGBoost, XGBoost + SMOTEENN, and logistic regression models for respiratory failure within 48 hours. Calibration is based on the precision probability (using predict_proba in Python). For creating the plots, sklearn.calibration.CalibratedClassifierCV (in Python) was used by inserting a fraction of positives and mean predicted values into 10 bins with an increasing fraction of positives (respiratory failures) for each hospital fold. The mean Brier score (SD) across all hospitals tested corresponding to the model is shown in the figure legend in parentheses.
Our study has several limitations. We extracted data on intubation timing from our EHR, which may have minor inaccuracies. Although a consistent temporal inaccuracy could create bias in underestimating/overestimating the intubation rate, we believe that these small inaccuracies are overcome by the average calculated from our large number of cases. Another limitation is that we relied on data from a multicenter, single health system for both implementation and validation. Thus, we were unable to externally validate the models in other health systems and hospitals with different protocols, which might affect the model’s performance. In addition, because the study is retrospective, we can only suggest associations and correlations rather than identify the main contributors that lead to intubation and mechanical ventilation. Furthermore, the numerical missing variables were imputed with weighted k-nearest neighbors. Thus, the conclusions made from these variables assume uniformity in patient data based on those missing values. In the case of nonuniformity, the order of variable importance might change. Additionally, some clinical variables included in the model may appear to be obvious correlates of the clinical decision for intubation within 48 hours (eg, having nonrebreather oxygen flow as the most invasive form of ventilation). However, the association of all included variables is not deterministic: only 453 of 2633 patients on nonrebreather oxygen flow in the ED were intubated within 48 hours. In addition, given that these variables are available to clinicians and part of their decision making, we included them in our model. Finally, we used supervised learning on a homogenous database. Although we used cross-hospital validation and retrospectively validated our learning method, external generalizability of these learning methods to other health systems requires validation in prospective studies and randomized trials. Such high-quality evidence could provide more clues on clinical and economic impacts, as well as measures to improve them.
COVID-19 has evolved into an extremely challenging clinical and public emergency worldwide, especially in the New York City metropolitan area. As public health measures attempt to mitigate this disaster by slowing the spread and alleviating the heavy burden placed on health care systems, clinicians must make important decisions quickly and hospital administrators must manage resources and personnel. Furthermore, as predicted by many models [
Definitions of accuracy, precision, recall, specificity, geometric means, and Fβ-score.
Modified Early Warning Score calculation based on vital sign measurements.
area under the curve
area under the curve of the precision-recall curve
area under the curve of the receiver operating characteristic curve
emergency department
electronic health record
Emergency Severity Index
intensive care unit
invasive mechanical ventilation
Modified Early Warning Score
precision-recall
receiver operating characteristic
synthetic minority oversampling
oversampling using SMOTE and cleaning using edited nearest neighbors
This work was supported by R24AG064191 from the National Institute on Aging, R01LM012836 from the National Library of Medicine, and R35GM118337 from the National Institute for General Medical Sciences (all National Institutes of Health).
Members of the Northwell COVID-19 Research Consortium include: Matthew Barish, Stuart L Cohen, Kevin Coppa, Karina W Davidson, Shubham Debnath, Lawrence Lau, Todd J Levy, Alexander Makhnevich, Marc D Paradis, and Viktor Tóth.
We acknowledge and honor all of our Northwell team members who consistently put themselves in harm’s way during the COVID-19 pandemic. We dedicate this article to them, as well as all patients, as their vital contributions to knowledge about COVID-19 made it possible.
SB, DPB, and TPZ conceptualized and designed the study. SB, JSH, and TPZ had full access to all data in the study and are responsible for the integrity of the data. SB and JSH performed data extraction and cleaning. MB, PW, TM, and JSH contributed to many discussions during manuscript development. SB, DPB, and TPZ contributed to drafts of the manuscript. SB trained and validated the models. SB and TPZ designed and created the figures. DBP and TPZ critically reviewed the paper, and PW and TPZ obtained funding. The Northwell COVID-19 Research Consortium prioritized this manuscript, organized meetings between contributing authors, and provided support in finalizing the manuscript for submission. All named authors read and approved the final submitted manuscript.
None declared.