Using Automated-Machine Learning to Predict COVID-19 Patient Survival: Identify Influential Biomarkers

: Background : In a pandemic, it is important for clinicians to stratify patients and decide who receives limited medical resources. In this study, we used automated machine learning (autoML) to develop and compare between multiple machine learning (ML) models that predict the chance of patient survival from COVID-19 infection and identified the best-performing model. In addition, we investigated which biomarkers are the most influential in generating an accurate model. We believe an ML model such as this could be a useful tool for clinicians stratifying hospitalized SARS-CoV-2 patients. Methods : The data was retrospectively collected from Clinical Looking Glass (CLG) on all patients testing positive for COVID-19 through a nasopharyngeal specimen by real-time RT-PCR and admitted between 3/1/2020-7/3/2020 (4376 patients) at our institution. We collected 47 biomarkers from each patient within 36 hours before or after the index time: RT-PCR positivity, and tracked whether a patient survived or not for one month following this time. We utilized the autoML from H2O.ai, an open source package for R language. The autoML generated 20 ML models and ranked them by area under the precision-recall curve (AUCPR) on the test set. We selected the best model (model_var_47) and chose a threshold probability that maximized F2 score to make a binary classifier: dead or alive. Subsequently, we ranked the relative importance of variables that generated model_var_47 and chose the 10 most influential variables. Next, we reran the autoML with these 10 variables and likewise selected the model with the best AUCPR on the test set (model_var_10). Again, threshold probability that maximized F2 score for model_var_10 was chosen to make a binary classifier. We calculated and compared the sensitivity, specificity, and positive predicate value (PPV) for model_var_10 and model_var_47. Results : The best model that autoML generated using all 47 variables was the stacked ensemble model of all models (AUCPR = 0.836). The most influential variables were: systolic and diastolic blood pressure, age, respiratory rate, pulse oximetry, blood urea nitrogen, lactate dehydrogenase, d-dimer, troponin, and glucose. When the autoML was retrained with these 10 most important variables, it did not significantly affect the performance (AUCPR= 0.828). For the binary classifiers, sensitivity, specificity, and PPV of model_var_47 was 83.6%, 87.7%, and 69.8% respectively, while for model_var_10 they were 90.9%, 71.1%, and 51.8% respectively. Conclusions : By using autoML, we developed high-performing models that predict patient mortality from COVID-19 infection. In addition, we identified the most important biomarkers correlated with mortality. This ML model can be used as a decision supporting tool for medical practitioners to efficiently triage COVID-19 infected patients. From our literature review, this will be the largest COVID-19 patient cohort to train ML models and the first to utilize autoML. The COVID-19 survival calculator based on this study can be found at https://www.tsubomitech.com/.


Introduction:
On January 30 th , 2020, the WHO declared a COVID-19 outbreak which began in Wuhan, China. As of August 1st, 2020, the CDC reported more than 4.5 million cases and 150,000 deaths from COVID-19 in the United States alone [1]. New York City (NYC) became the epicenter in the U.S. with the highest case number and deaths per capita [2]. At our institution, between 3/1/2020-7/3/2020, we admitted 4375 patients who tested positive for COVID-19 and 1088 patients died within 30 days of infection ( Figure 1). Many regions worldwide are still fighting the first wave of the pandemic, while other areas that reopened are seeing a resurgence of new cases. In such an emergent situation, it is important for clinicians to triage patients effectively to maximize limited medical resources. In this study, we aimed to find the most important prognostic biomarkers and develop a COVID-19 mortality risk assessment tool using automated machine learning (autoML). Some of the most frequently reported prognostic indicators in COVID-19 patients include gender, Creactive protein (CRP), lactic dehydrogenase (LDH), and lymphocyte count. Other inflammatory markers also often appear to be elevated, such as ferritin, inteleukin-6, tumour necrosis factor-alpha, and interferon gamma [3][4][5]. For example, Zhou et al were one of the first groups that showed many correlations with specific biomarkers that are predictive of morbidity/mortality in 191 patients from Wuhan. They found that older patients with d-dimer levels greater than 1ug/mL, and a higher SOFA (Sequential Organ Failure Assessment) score were associated with greater odds of in-hospital death [6]. However, many papers that have investigated biomarkers are largely focused on Chinese hospitals due to the disease's initial emergence in Wuhan [7][8][9][10]. It is essential to understand the clinical characteristics for more recent and diverse cases considering that the virus strain may have mutated since its first appearance [11].
In recent months, several ML models have been proposed to predict COVID-19 mortality as well as disease severity. In many such ML studies, researchers are only training on one kind of ML model (XGBoost, Random Forest, etc.) and on relatively small datasets with cohorts mainly from China. In addition, many studies evaluated their model performance only on area under the curve (AUC) and without area under the precision recall curve (AUCPR) [3,12,13]. AUC should be used if the outcome class (dead or alive) is relatively balanced. If the outcome is skewed, as it is in data showing that most people survive COVID-19 infection, a model should be evaluated with AUCPR.
In our study, we used autoML to generate various machine learning models and automated the hyperparameter tuning and optimization. With autoML, we generated various ML models to compare and choose the best-performing model based on AUCPR. We also used the F2-score to evaluate the binary classifier (dead or alive). Unlike the F1-score, which gives equal weight to positive predictive value (PPV) and sensitivity, the F2-score gives more weight to sensitivity and penalizes the model more for false negatives than false positives. Furthermore, we ranked the variables by importance to understand which variables were the most influential in developing the top-performing models.
While there are many autoML platforms, we chose to use the open source H2O.ai for its diverse ML models in their autoML package [14,15]. In addition, since the package can be downloaded to a local device, one does not have to upload patient data to the cloud, reducing the risk of exposing it to a third party. The H2O.ai autoML trains and cross-validates on the following ML algorithms: XGBoost, GBM (Gradient Boosting Machine) models, fixed grid of GLMs, a Default Random Forest (DRF), five pre-specified H2O GBMs, near-default Deep Neural Net, an Extremely Randomized Forest (XRT), random grid of XGBoost GBMs, a random grid of H2O GBMs, random grid of Deep Neural Nets, and two Stacked Ensembles -one based on all previously trained models, and another based on the best model of each family [15].
Based on our literature review, this will be the largest COVID-19 patient cohort to train a ML model and the first to utilize autoML.

Methods:
Data collection and analysis was approved by the Albert Einstein College of Medicine Institutional Review Board. The data was collected using Clinical Looking Glass (CLG), an interactive software application developed at Montefiore Medical Center for the evaluation of health care quality, effectiveness, and efficiency. The system integrates clinical and administrative datasets allowing nonstatisticians to build temporally sophisticated cohorts and outcomes [16][17][18][19].
We queried CLG to find all patients who tested positive for COVID-19 through a nasopharyngeal specimen by real-time RT-PCR and admitted at our institution from 3/1/2020 -7/3/2020 (4376 patients). The admitted patients tested positive within 24 hours before or after admission. The index time is when RT-PCR resulted positive for COVID-19.
We investigated a total of 47 unique biomarkers after conducting literature reviews, and used the earliest biomarker values available within 36 hours before or after the index time. The outcome of interest was mortality from any cause at one month from index time. We obtained the cycle threshold values (Ct-values) from positive nasopharyngeal specimens collected in accordance with CDC guidance. Real-time RT-PCR was performed using the Hologic Panther Fusion SARS-CoV-2 assay. The variable NLratio is the ratio between neutrophil and lymphocyte count.
We used the open source autoML package from H2O.ai for R programming language [14,15,20]. For the purpose of reproducibility, we excluded deep learning method in this study (discussed more in the Discussion section). The autoML is trained on a randomly selected 80% of the dataset (3517 patients, training set) with 10-fold cross-validation. We assigned the autoML to generate 20 machine learning models and rank them in order of performance by AUCPR on the remaining 20% of the dataset (859 patients, test set). As mentioned above, we evaluated the models with AUCPR because there are more patients in our cohort who survived COVID-19 versus those who died. For convenience, we named this best model, generated with 47 variables, "model_var_47." To make a binary classifier-dead or alive within 30 days-we chose a threshold probability that maximizes F2 score of the model_var_47. Sensitivity, specificity and PPV were calculated for the binary classifier. As mentioned earlier, F2-score was chosen because, unlike the F1 score which gives equal weight to precision and sensitivity (or recall), the F2-score gives more weight to sensitivity (penalizing the model more for false negatives then false positives).
In addition, we generated variable importance for the top models and chose the ten variables that had the greatest influence in making effective models. Variable importance was determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result [15].
With the 10 chosen variables, we retained the autoML. Again, we assigned the autoML to generate 20 machine learning models and rank them in order of AUCPR on the test set. For convenience, we named the best model, generated with 10 variables, "model_var_10." Next, to make a binary classifier, we chose a threshold probability that maximizes the F2 score of the model_var_10. Sensitivity, specificity, and PPV were calculated for the binary classifier. Comparison was made between model_var_45 and model_var10.
Regarding how autoML handles missing values for each model is explained in the H20.ai documentation [15].
The workflow of our method is depicted in figure 2. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.12.20211086 doi: medRxiv preprint perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.12.20211086 doi: medRxiv preprint

Results:
Data summary of cases, survival, and biomarkers are presented in Table 1 Table 1: Summary of variables from the cohort. NLratio is the ratio of neutrophil and lymphocyte. The ct_value is the cycle threshold from Hologic Panther Fusion SARS-CoV-2 assay. rr: respiratory rate, mpv: mean platelet volume, NLratio: neutrophil-lymphocyte ratio.

Model_var_47 Performance:
The best performing model with 47 variables was Stacked Ensemble of all models with AUCPR = 0.836. This is our model_var_47. This was followed by Stacked Ensemble of best from each ML family (AUCPR = 0.834). After the two Stacked Ensemble models ranked GBM and XGBoost models with AUCPR of 0.830 and 0.825, respectively ( preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.12.20211086 doi: medRxiv preprint Max F2-score of our model_var_47 was 0.804 with the threshold probability of 0.215. The binary classifier with this threshold had sensitivity, specificity, and PPV of 83.6%, 87.7%, and 69.8%, respectively (Table 3).  Table 3: Confusion matrix of model_var_47. To make a binary classifier, we chose a threshold of 0.215 that maximized F2 score of 0.804. Sensitivity = 83.6%, specificity = 87.7%, positive predictive value = 69.8%, negative predictive value = 94.0%.

Variable Importance Ranking:
Ensemble Models cannot generate a variable importance ranking. However, we can generate the variable importance ranking for the XGBoost and GBM models (Table 4). For both models, age and vital signs (blood pressure, pulse Ox, and respiratory rate (RR)) ranked high. In addition, biomarkers such as BUN, LDH, D-dimer, creatinine (Cr), EGFR, troponin, pro-BNP, glucose, and procalcitonin also appeared high in the rank for many models. To have at least one biomarker for cardiac and renal function, we chose troponin and BUN to be included in the 10 variables. Glucose was also chosen for its high rank and ease of measuring in clinical setting. Therefore, our chosen 10 variables were: systolic and diastolic blood pressure, age, pulse Ox, respiratory rate, LDH, BUN, D-dimer, troponin, and glucose.  Table 4: Side by side comparison of variable importance ranking between best performed GBM and XGBoost model. rr: respiratory rate, mpv: mean platelet volume, NLratio: neutrophil-lymphocyte ratio.

Model_var_10 Performance:
Next, we reran the autoML with the chosen 10 variables and ranked the models in order of AUCPR. The GBM model performed the best with AUCPR = 0.828. This is our model_var_10. Table 5  perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. .  Table 5: Output of autoML with ran with 10 variables. It is showing the rank order of models by AUCPR. In addition, it is informing AUC and Logloss.
Max F2-score of our model_var_10 was 0.790 with the threshold probability of 0.151. The binary classifier with this threshold had sensitivity, specificity, and positive predictive value of 90.9%, 71.1%, and 51.8%. (Table 6).  Table 6: Confusion matrix of model_var_10. To make a binary classifier, we chose a threshold of 0.151 that maximized F2 score of 0.790. Sensitivity = 90.9%, specificity = 71.1%, positive predictive value = 51.8%, negative predictive value = 95.8%. Figure 3 shows the comparison of model_var_47 and model_var_10 by AUCPR.

True outcome
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020. 10.12.20211086 doi: medRxiv preprint Figure3: Comparing AUCPR of model_var_47 and model_var_10.
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020. 10.12.20211086 doi: medRxiv preprint Discussion: Principal Results: By using autoML, we were able to generate multiple ML models, automate the optimization of hyperparameter tuning, and compare models by their performance. In addition, we were able to extract the variables that are most predictive of patient mortality.
The model_var_47 and model_var_10 both predict mortality with high performance using clinical measurements collected early within a patient's hospital admission. When the autoML was retrained with the 10 chosen variables, it did not significantly affect the performance of the model. This suggests that these biomarkers are essential in gauging patient's severity from the infection. Our models use commonly available laboratory results and do not require imaging results or advanced testing. We believe an early and convenient risk assessment of patient mortality can allow physicians to triage and prioritize resources in a highly congested medical system.
Of the models generated by autoML, Stacked Ensemble performed the best with all 47 variables and GBM model performed the best with the 10 chosen variables. Deep learning was eliminated in this study for the purpose of reproducibility. However, when autoML ran with deep learning, it did not perform as well as GBM or XGBoost models. Deep learning's performance was at best AUCPR of 0.807 on the test set.
For both GBM and XGBoost, age and vital signs had significant influence in predicting patient mortality. LDH came at the top as the most reliable inflammatory marker. Cardiac (troponin and pro-BNP) and renal (Cr and EGFR) markers were influential, supporting studies from [21] and [22]. Ddimer ranked high in many models, which supports studies that found COVID-19 can promote coagulopathy [23,24]. Glucose also ranked relatively high in our models supporting the findings that fasting blood glucose, regardless of previous diagnosis of diabetes or not, can be affected from COVID-19 infection [25]. Other variables that often came up within top 15 in importance ranking were, creatine kinase, fibrinogen, hemoglobin, and platelet. Regarding Ct-value, because it is not standardized across RT-PCR platforms, results can not be compared across different assays. In this study, we only used one platform (the Hologic Panther Fusion) and therefore there were more missing values (74% missing). It requires larger dataset to find if magnitude of a Ct-value have clinical implications.
Some variables, especially in the lower rank, seem to be contributing to the model but it is not clear how they are clinically influenced from COVID-19 infection. For example, MCV and calcium level (rank 15 and 24 in GBM model) appear to affect the model, but because ML models are a non-linear combination of variables, it is difficult to understand how they are contributing and if there is clinical significance. Further investigation is needed on how these biomarkers are altered in COVID-19 infected patients. However, it is interesting to note that ML models can identify these alterations.

Limitations:
We recognize there are limitations to our study. Fibrinogen, procalcitonin, and Ct-values were missing in more than 50% of our cohort. H2O.ai has internal imputation method to fill in these missing values [15]. Our cohort was limited to patients severe enough to be admitted, and findings may not generalize to all COVID-19 infected patients. Finally, the nature of training ML models involves randomization. In this study, we presented one representative autoML run. To check for consistency, we tested multiple random splitting between training and test set and retrained the autoML each time. We found that Stacked Ensemble models usually perform the best, followed by GBM and XGBoost. The overall performance of each model was minimally altered from the randomization process. Variable ranking shifted slightly with each autoML run, especially in the lower-ranking variables. However, the 10 most important biomarkers we chose consistently ranked at the top.

Future Work:
We are working towards additional goals to make the model more robust and user friendly for clinicians. First, we are in process of gathering data from other institutions and countries to make our model more generalizable. We imagine that institutions with many COVID-19 patients can develop their own ML models specialized for their population; institutions with minimal COVID-19 cases can use a more generalized model trained from a metadata. Second, we are working to implement reinforcement learning into our model. Reinforcement learning will allow us to update the model in real time as we accumulate more data. This will make the model responsive to a rapidly changing environment. Lastly, we are working to open the black box of ML models to understand how they are making such highly accurate decisions. For example, we would like to know: what is the cut-off value for each variable to be considered important? How are variables related to each other? Deciphering the black box of ML models is a research field of its own. However, it will allow us to use the models more practically, and possibly provide insight into the mechanism of disease of COVID-19 infection [26].

Conclusion:
We generated high-performing ML models that predicts mortality of COVID-19 infected patients using autoML. We also identified important variables that are strongly associated with patient outcomes. This is a proof of concept that autoML is an efficient, effective, and informative method to generate ML models and gain insight into the disease process. A model such as this may help clinicians triage patients in the current pandemic.

Conflict of Interest: None declared.
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.12.20211086 doi: medRxiv preprint