predCOVID-19: A Systematic Study of Clinical Predictive Models for Coronavirus Disease 2019

Coronavirus Disease 2019 (COVID-19) is a rapidly emerging respiratory disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to the rapid human-to-human transmission of SARS-CoV-2, many healthcare systems are at risk of exceeding their healthcare capacities, in particular in terms of SARS-CoV-2 tests, hospital and intensive care unit (ICU) beds and mechanical ventilators. Predictive algorithms could potentially ease the strain on healthcare systems by identifying those who are most likely to receive a positive SARS-CoV-2 test, be hospitalised or admitted to the ICU. Here, we study clinical predictive models that estimate, using machine learning and based on routinely collected clinical data, which patients are likely to receive a positive SARS-CoV-2 test, require hospitalisation or intensive care. To evaluate the predictive performance of our models, we perform a retrospective evaluation on clinical and blood analysis data from a cohort of 5644 patients. Our experimental results indicate that our predictive models identify (i) patients that test positive for SARS-CoV-2 a priori at a sensitivity of 75% (95% CI: 67%, 81%) and a specificity of 49% (95% CI: 46%, 51%), (ii) SARS-CoV-2 positive patients that require hospitalisation with 0.92 AUC (95% CI: 0.81, 0.98), and (iii) SARS-CoV-2 positive patients that require critical care with 0.98 AUC (95% CI: 0.95, 1.00). In addition, we determine which clinical features are predictive to what degree for each of the aforementioned clinical tasks. Our results indicate that predictive models trained on routinely collected clinical data could be used to predict clinical pathways for COVID-19, and therefore help inform care and prioritise resources.


I. INTRODUCTION
C ORONAVIRUS Disease 2019 (COVID- 19) was first discovered in December 2019 in China, and has since rapidly spread to over 200 countries [1]. The COVID-19 pandemic challenges healthcare systems worldwide as a high peak capacity for testing and hospitalisation is necessary to diagnose and treat affected patients, particularly if the spread of SARS-CoV-2 is not mitigated. To avoid exceeding the available healthcare capacities, many countries have adopted social distancing policies, imposed travel restrictions, and postponed non-essential care and surgeries in order to reduce peak demand on their healthcare systems [2], [3], [4].  We study the use of predictive models (light purple) to estimate whether patients are likely (i) to be SARS-CoV-2 positive, and whether SARS-CoV-2 positive patients are likely (ii) to be admitted to the hospital and (iii) to require critical care based on clinical, demographic and blood analysis data. Accurate clinical predictive models stratify patients according to individual risk, and, in this manner, help prioritise healthcare resources, such as testing, hospital and critical care capacity.
The adoption of clinical predictive models that accurately predict who is likely to require testing, hospitalisation and intensive care from routinely collected clinical data could potentially further reduce peak demand by ensuring resources are prioritised to those individuals with the highest risk ( Figure  1). For example, a clinical predictive model that accurately identifies patients that are likely to test positive for SARS-CoV-2 a priori could help prioritise limited SARS-CoV-2 testing capacity. However, developing accurate clinical prediction models for SARS-CoV-2 is difficult as relationships between clinical data, hospitalisation, and intensive care unit (ICU) admission have not yet been established conclusively due to the recent emergence of SARS-CoV-2.
In this systematic study, we develop and evaluate clinical predictive models that use routinely collected clinical data to identify (i) patients that are likely to receive a positive SARS-CoV-2 test, (ii) SARS-CoV-2 positive patients that are likely to require hospitalisation, and (iii) SARS-CoV-2 positive patients that are likely to require intensive care. Using the developed predictive models, we additionally determine which clinical features are most predictive for each of the aforementioned clinical tasks. Our results indicate that predictive models could be used to predict clinical pathways for COVID-19 patients. Such predictive models may be of significant utility for healthcare systems as preserving healthcare capacity has been linked to successfully combating SARS-CoV-2 [5], [6]. This work contains the following contributions: • We develop and systematically study predictive models for estimating the likelihoods of (i) a positive SARS-CoV-2 test in patients presenting at hospitals, (ii) hospital admission in SARS-CoV-2 positive patients, and (iii) critical care admission in SARS-CoV-2 positive patients. • We validate the performance of the developed clinical predictive models in a retrospective evaluation using realworld data from a cohort of 5644 patients. • We determine and quantify the predictive power of routinely-collected clinical, demographic, and blood analysis data for the aforementioned clinical prediction tasks.

II. RELATED WORK
A substantial body of work is dedicated to the study, validation and implementation of predictive models for clinical tasks. Clinical predictive models have, for example, been used to predict risk of septic shock [7], [8], risk of heart failure [9], readmission following heart failure [10], [11], [12], false alarms in critical care [13], risk scores [14], outcomes [15] and mortality in pneumonia [16], [17], and mortality risk in critical care [18], [19], [20]. Predicting clinical outcomes for individual patients is difficult because a large number of confounding factors may influence patient outcomes, and collecting and accounting for these factors in an unbiased way remains an open challenge in clinical practice [21]. Systematic studies, such as the one presented in this work, enable medical practitioners to better understand, assess and potentially overcome these issues by systematically evaluating generalisation ability, expected predictive performance, and influential predictors of various clinical predictive models. Beyond the need for systematic evaluation, missingness [22], [23], [24], [25], noise [26], [27], multivariate input data [13], [28], [29], [30], and the need for interpretability [31], [32], [33], [34] have been highlighted as particularly important considerations in healthcare settings. In this work, we build on recent methodological advances to develop and systematically study clinical predictive models that may aid in prioritising healthcare resources [35] for COVID-19, and thereby help prevent a potential overextension of healthcare system capacity.

A. Clinical Predictive Models for COVID-19
Several clinical predictive models have recently been proposed for COVID-19, for example, for predicting potential COVID-19 diagnoses using data from emergency care admission exams [36] and chest imaging data [37], [38], [39], [40], [41], [42], for predicting COVID-19 related mortality from clinical risk factors [43], [44], and for predicting which patients will develop acute respiratory distress syndrome (ARDS) from patients' clinical characteristics [45]. [46] presented a review of epidemiology and clinical features associated with COVID-19, and [47] a critical review that assessed limitations and risk of bias in diagnostic and prognostic models for COVID-19. In addition, [48] performed a cohort study for clinical and laboratory predictors of COVID-19 related inhospital mortality that identified baseline neutrophil count, age  Fig. 2: The presented multistage machine-learning pipeline consists of preprocessing (light purple) the input data x, developing multiple candidate models using the given dataset (orange), selecting the best candidate model for evaluation (blue), and evaluating the selected best model's outputsŷ.
and several other clinical features as top predictors of mortality. Beyond prediction, [49] have argued for the responsible use of data in tackling the challenges posed by SARS-CoV-2.
Owing to the recent emergence of SARS-CoV-2, there currently exists, to the best of our knowledge, no prior systematic study on clinical predictive models that predict likelihood of a positive SARS-CoV-2 test, hospital and intensive care unit admission from clinical, demographic and blood analysis data that accounts for the missingness that is characteristic for the clinical setting. We additionally assess the influence of various clinical, demographic, and blood analysis measurements on the predictions of the developed clinical predictive models.

III. METHODS AND MATERIALS
1) Problem Setting: In the given setting, we are given 106 routine clinical, laboratory and demographic measurements, or features, x i ∈ x for presenting patients. Features may be discrete or continuous, and some features may be missing as not all tests are necessarily performed on all patients. The clinical predictive tasks consist of utilising the routine clinical features x i to predict, for a newly presenting patient, (i) the likelihoodŷ SARS-CoV-2 of receiving a positive SARS-CoV-2 test result, (ii) the likelihoodŷ admission of requiring hospital admission, and the (iii) likelihoodŷ ICU of requiring intensive care. In addition, we are given a development dataset consisting of N patients, their corresponding observed routine clinical features x i , SARS-CoV-2 test results y SARS-CoV-2 ∈ {0, 1}, hospital admissions y admission ∈ {0, 1}, and ICU admissions y ICU ∈ {0, 1}, where 1 indicates the presence of an outcome. Using this development dataset, our goal is to derive clinical predictive modelsf SARS-CoV-2 ,f admission andf ICU for the respective before-mentioned tasks in order to inform care and help prioritise scarce healthcare resources.
2) Methodology: To derive the clinical predictive modelŝ f SARS-CoV-2 ,f admission andf ICU from the given development dataset, we set up a systematic model development, validation, and evaluation pipeline (Fig. 2). To evaluate the generalisation ability of the developed clinical predictive models and to rule out overfitting to patients in the evaluation cohort, the development data is initially split into independent and stratified training, validation, and test folds without any patient overlap. Concretely, the multistage pipeline consists of (i) preprocessing, (ii) model development, (iii) model selection, and (iv) model evaluation stages. For preprocessing and model development, only the training fold is used, and only the validation and test folds of the development data are used for model selection and model evaluation, respectively. We outline the pipeline stages in detail in the following paragraphs.
3) Preprocessing: In the preprocessing stage, we first drop all input features that are missing for more than 99.8% of all training set patients to ensure we have a minimal amount of data for each feature. This removes a total of 9 features from the original 106 routine clinical, laboratory and demographic features. We then transform all discrete features for each patient into their one-hot encoded representation with one out of p indicator variables set to 1 to indicate the discrete value for this patient, and all others set to 0 with p being the number of unique values for the discrete feature. We defined those features as discrete that have fewer than 6 unique values across all patients in the training fold. For discrete features, missing features were counted as a separate category in the one hot representation. Next, we standardised all continuous features to have zero mean and unit standard deviation across the training fold data. Lastly, we performed multiple imputation by chained equations (MICE) to impute all missing values of every continuous feature from the respective other features in an iterative fashion [50]. We additionally added a missing indicator that indicates 1 if the feature was imputed by MICE and 0 if it was originally present in order to preserve missingness information in the data after imputation. After the preprocessing stage, continuous input features are standardised and fully imputed, and discrete input features are one-hot encoded. All preprocessing operations are derived only from the training fold, and naïvely applied without adjustment to validation and test folds in order to avoid information leakage. 4) Model Development: In the model development stage, we train candidate clinical predictive modelsf SARS-CoV-2 ,f admission andf ICU using supervised learning on the training fold of the preprocessed data. To derive the models from the preprocessed training fold data, we optimise various types of predictive models, and perform a hyperparameter search with m runs for each of them. The model development process yields m candidate models with different hyperparameter choices and predictive performances for each model category.
5) Model Selection: In order to select the best model amongst the set of candidate models, we evaluate their predictive performance against the held-out validation fold that had not been used for model development. We choose the top candidate model by ranking all models by their evaluated predictive performance. The model selection stage using the independent validation fold enables us to optimise hyperparameters without utilising test fold data. 6) Model Evaluation: In the model evaluation stage, we evaluate the selected best clinical predictive model against the held-out test fold that had not been used neither for training nor model selection in order to estimate the expected generalisation error of the models on previously unseen data.
Using this approach, every selected best model from the model selection stage is evaluated exactly once against the test fold.
Using the presented standardised model development, selection and evaluation pipeline, we compare various types of clinical predictive models in the same test setting, with exactly the same amount of hyperparameter optimisation and input features against the same test fold. This process enables us to systematically study the expected generalisation ability, predictive performance and influential features of clinical predictive models for predicting SARS-CoV-2 test results, hospital admission for SARS-CoV-2 positive patients, and ICU admission for SARS-CoV-2 positive patients.

IV. EXPERIMENTS
We conducted retrospective experiments to evaluate the predictive performance of a number of clinical predictive models on each of the presented clinical prediction tasks using the standardised development, validation and evaluation pipeline. Concretely, our experiments aimed to answer the following questions: 1 What is the expected predictive performance of the various clinical predictive models in predicting (i) SARS-CoV-2 test results for presenting patients, (ii) hospital admission for SARS-CoV-2 positive patients, and (iii) ICU admission for SARS-CoV-2 positive patients? 2 Which clinical, demographic and blood analysis features were most important for the respective best encountered predictive models for each clinical prediction task? The following subsections describe the conducted experimental evaluation in detail.

A. Dataset and Study Cohort
We used anonymised data from a cohort of 5644 patients seen at the Hospital Israelita Albert Einstein in São Paulo, Brazil in the early months of 2020 1 . Over the data collection time frame, the rate of SARS-CoV-2 positive patients at the hospital was around 10% of which around 6.5% and 2.5% required hospitalisation and critical care, respectively (Table I).
Notably, younger patients were underrepresented in the SARS-CoV-2 positive group relative to the general patient population which may have been caused by the reportedly more severe disease progression in older patients [51]. Information on patient sex was not included in our dataset. We randomly split the entire available patient cohort into training (50%), validation (20%) and test folds (30%) within strata of patient age, SARS-CoV-2 test result, hospital admission status, and ICU admission status. After stratification, the three folds were approximately balanced across the stratification dimensions.

B. Models
Using the presented systematic evaluation methodology, we trained five different model types: Logistic Regression (LR), Neural Network (NN), Random Forest (RF), Support Vector Machine (SVM), and Gradient Boosting (XGB) [52]. The NN was a multi-layer perceptron (MLP) consisting of L hidden layers with N hidden units each followed by a non-linear activation function (ReLU [53], SELU [54], or ELU [55]) and batch normalisation [56], and was trained using the Adam optimiser [57] for up to 300 epochs with an early stopping patience of 12 epochs on the validation set loss.

C. Hyperparameters
We followed an unbiased, systematic approach to hyperparameter selection and optimisation. For each type of clinical predictive model, we performed a maximum of 30 hyperparameter optimisation runs with hyperparameters chosen from predefined ranges (Table II). The performance of each hyperparameter optimisation run was evaluated against the validation cohort. After computing the validation set performance, we selected the best candidate predictive model across the 30 hyperparameter optimisation runs by area under the receiver operator curve for further evaluation against the test set.
D. Metrics 1) Predictive Performance: To assess the predictive performance of each of the developed clinical predictive models, we evaluated their performance in terms of area under the receiver operator curve (AUC), area under the precision recall curve (AUPR), sensitivity, specificity, and specificity at greater than 95% sensitivity (Spec.@95%Sens.) on the held-out test set cohorts for each task (Table I). After model development and hyperparameter optimisation, we evaluated each model type exactly once against the test set to calculate the final performance metrics. Operating thresholds for each model were the operating points on the receiver operator characteristic curve closest to the top left coordinate as calculated for the validation cohort. We chose a variety of complementary evaluation metrics in order to give a comprehensive picture of the expected performance of each clinical predictive model on the evaluated tasks. For each of the performance metrics, we additionally computed 95% confidence intervals (CIs) using bootstrap resampling with 100 bootstrap samples on the test set cohort in order to quantify the uncertainty of our analysis results. We also assessed whether differences between clinical predictive models were statistically significant at significance level α = 0.05 using pairwise t-tests with the respective best models for each task as measured by AUC.
2) Importance of Test Types: To quantify the importance of specific clinical, demographic and blood analysis features on each of the predicted outcomes, we utilised causal explanation (CXPlain) models [34]. CXPlain provides standardised relative feature importance attributions for any predictive model by computing the marginal contribution of each input feature towards the predictive performance of a model [58], and is therefore particularly well-suited for assessing feature importance in our diverse set of models. We used the test fold's ground truth labels to compute the exact marginal contribution of each input feature without any estimation uncertainty.

A. Predictive Performance
In terms of predictive performance (Table III), we found that the overall best identified models by AUC were XGB for predicting SARS-CoV-2 test results, RF for predicting hospital admissions for SARS-CoV-2 positive patients, and SVM for predicting ICU admission for SARS-CoV-2 positive patients . Notably, we found that predicting positive SARS-CoV-2 results from routinely collected clinical measurements was a considerably more difficult task for clinical predictive models than predicting hospitalisation and ICU admission. Nonetheless, the best encountered clinical predictive model for predicting SARS-CoV-2 test results (XGB) achieved a respectable sensitivity of 75% (95% CI: 67%, 81%) and specificity of 49% (95% CI: 46%, 51%). After fixing the operating threshold of the model to meet a sensitivity level of at least 95% (Spec.@95% Sens.), the best XGB model for predicting SARS-CoV-2 test results would achieve a specificity of 23% (95% CI: 7%, 32%). We additionally found that the differences in predictive performance between the best XGB model for predicting SARS-CoV-2 test results and the other predictive models was significant at a pre-specified significance level of α = 0.05 (ttest) for all but the AUPR metric, where NN achieved a significantly better AUPR of 0.22 and the difference to SVM was not significant at the pre-specified significance level. On the task of predicting hospital admissions for SARS-CoV-2 positive patients, the best encountered RF model achieved a sensitivity of 55% (95% CI: 19%, 85%), a high specificity of 96% (95% CI: 92%, 98%), and a specificity at a fixed sensitivity of at least 95% (Spec.@95% Sens.) of 34% (95% CI: 29%, 97%).
Owing to the lower sample size due to the smaller cohort of SARS-CoV-2 positive patients, the performance results for predicting hospital admission generally had wider uncertainty bounds but were nonetheless significantly better for RF than the other predictive models at the pre-specified significance level of α = 0.05 (t-test) for most performance metrics with the exception of AUC where XGB achieved an AUC of 0.91 and AUPR where LR achieved an AUPR of 0.44. On the task of predicting ICU admission for SARS-CoV-2 positive patients, SVM had a sensitivity of 80% (95% CI: 36%, 100%), a specificity of 96% (95% CI: 92%, 98%), and a specificity at a fixed sensitivity of at least 95% (Spec.@95% Sens.) of 95% (95% CI: 91%, 100%). Due to the small percentage of around 3% of SARS-CoV-2 positive patients that were admitted to the ICU (Table I), uncertainty bounds were wider than for the models predicting hospital admissions, and the results of the best encountered SVM were found to be not significantly better than LR and RF in terms of AUC, LR and NN in terms of sensitivity, and NN in terms of Spec.@95% Sens. at the pre-specified significance level of α = 0.05 (t-test).

B. Feature Importance
In terms of feature importance, we found that importance scores were distributed highly unequally, relatively uniform and highly uniform for the best models encountered for predicting SARS-CoV-2 test results, for predicting hospital admissions for SARS-CoV-2 positive patients, and for predicting ICU admission, respectively (Figure 4). Most notably, we found that 71.7% of the importance for the best XGB model for predicting SARS-CoV-2 test results was assigned to the missing indicator corresponding to the Arterial Lactic Acid measurement, i.e. much of the marginal predictive performance gain of the XGB model was attributed to whether or not the Arterial Lactic Acid test had been ordered. Beyond Arterial Lactic Acid being missing, age, leukocyte count, platelet count, and creatinine were implied to be associated with a positive SARS-CoV-2 test result by the best encountered predictive model, which further substantiates recent independent reports of those factors being potentially associated with SARS-CoV-2 [59], [60], [61], [62], [48]. Similarly to the best encountered XGB model for predicting SARS-CoV-2 test results, the top encountered predictive models for hospital admission and ICU admission for SARS-CoV-2 positive patients assigned a considerable degree of importance to missingness patterns associated with a number of measurements. A possible explanation for missingness appearing as a top predictor across the different tasks is that decisions whether or not to order a certain test to be performed for a given patient were influenced by patient characteristics that were not captured in the set of clinical measurements that were available to the predictive models. A controlled setting with standardised testing guidelines would be required to determine which confounding factors are behind the predictive power of the missingness patterns that have been implied to be associated with COVID-19 by the predictive models. Beyond missingness patterns, top predictors for predicting hospital admission were lactate dehydrogenase [63], gammaglutamyltransferase, which through abnormal liver function has been reported to be implicated in COVID-19 severity [64], and HCO 3 [65]. For predicting ICU admission in SARS-CoV-2 positive patients, pCO 2 and pH [48] were top predictors. Blood pH, and in particular respiratory alkalosis, has been reported to be associated with severe COVID-19 [66].

VI. DISCUSSION
We presented a systematic study of predictive models that predict SARS-CoV-2 test results, hospital admission for SARS-CoV-2 positive patients, and ICU admission for SARS-CoV-2 positive patients using routinely collected clinical measurements. Models that predict SARS-CoV-2 test results could help prioritise scarce testing capacity by identifying those individuals that are more likely to receive a positive result. Similarly, predictive models that predict which SARS-CoV-2 positive patients would be most likely to require hospital and critical care beds could help better utilise existing hospital capacity by prioritising those patients that have the highest risk of deterioration. Facilitating the efficient utilisation of scarce healthcare resources is particularly important in dealing with SARS-CoV-2 as its rapid transmission significantly increases demand for healthcare services worldwide. The main limitation of the presented study is that its experimental evaluation was based on data collected from a single study site, and its results may therefore not generalise to settings with significantly different patient populations, admission criteria, patterns of missingness, and testing guidelines. In addition, we did not have access to mortality data for the analysed cohort, and we were therefore not able to correlate our predicted individual risk scores with patient mortality, which is another related prediction task that may be of clinical importance. Future studies should include a broader set of clinical measurements and outcomes, cohorts from multiple distinct geographical sites and under varying patterns of missingness in order to determine the robustness of the clinical predictive models to these confounding factors. Finally, we believe that the inclusion of data from other modalities, such as genomic profiling and medical imaging, and data on co-morbidities, symptoms and treatment histories could potentially further improve predictive performance of clinical predictive models across the presented prediction tasks.

VII. CONCLUSION
We presented a systematic study in which we developed and evaluated clinical predictive models for COVID-19 that estimate (i) the likelihood of a positive SARS-CoV-2 test in patients presenting at hospitals, (ii) the likelihood of hospital admission and (iii) intensive care unit admission in SARS-CoV-2 positive patients. We evaluated our developed clinical predictive models in a retrospective evaluation using a cohort of 5644 hospital patients seen in São Paulo, Brazil. In addition, we determined the clinical, demographic and blood analysis measurements that were most important for accurately predicting SARS-CoV-2 status, hospital admissions, and ICU admissions. Our experimental results indicate that clinical predictive models may in the future potentially be used to inform care and help prioritise scarce healthcare resources by assigning personalised risk scores for individual patients using routinely collected clinical, demographic and blood analysis data. Furthermore, our findings on the importance of routine clinical measurements towards predicting clinical pathways for patients increase our understanding of the interrelations of individual risk profiles and outcomes in SARS-CoV-2. Based on our study's results, we conclude that healthcare systems should explore the use of predictive models that assess individual COVID-19 risk in order to improve healthcare resource prioritisation and inform patient care.

ACKNOWLEDGMENTS
The anonymised data used in this manuscript were generously contributed by patients at Hospital Israelita Albert Einstein in São Paulo, Brazil, and are freely available at https://www.kaggle.com/einsteindata4u/covid19.