Background: Limited information is available about the present characteristics and dynamic clinical changes that occur in patients with COVID-19 during the early phase of the illness.
Objective: This study aimed to develop and validate machine learning models based on clinical features to assess the risk of severe disease and triage for COVID-19 patients upon hospital admission.
Methods: This retrospective multicenter cohort study included patients with COVID-19 who were released from quarantine until April 30, 2020, in Korea. A total of 5628 patients were included in the training and testing cohorts to train and validate the models that predict clinical severity and the duration of hospitalization, and the clinical severity score was defined at four levels: mild, moderate, severe, and critical.
Results: Out of a total of 5601 patients, 4455 (79.5%), 330 (5.9%), 512 (9.1%), and 301 (5.4%) were included in the mild, moderate, severe, and critical levels, respectively. As risk factors for predicting critical patients, we selected older age, shortness of breath, a high white blood cell count, low hemoglobin levels, a low lymphocyte count, and a low platelet count. We developed 3 prediction models to classify clinical severity levels. For example, the prediction model with 6 variables yielded a predictive power of >0.93 for the area under the receiver operating characteristic curve. We developed a web-based nomogram, using these models.
Conclusions: Our prediction models, along with the web-based nomogram, are expected to be useful for the assessment of the onset of severe and critical illness among patients with COVID-19 and triage patients upon hospital admission.
COVID-19, an infectious disease, is currently spreading at an unprecedented pace. The World Health Organization declared COVID-19 a public health emergency of worldwide concern on January 30, 2020, and subsequently a pandemic on March 11, 2020. The COVID-19 pandemic has posed challenges to public health systems worldwide [, ].
The clinical spectrum of SARS-CoV-2 infection ranges from asymptomatic to fatal, requiring mechanical ventilation . According to initial data from China, the clinical spectrum of COVID-19 is broad, with most infected individuals experiencing only mild or subclinical illnesses, especially in the early phase of the disease [ ]. However, a recent study reported that approximately 14%-30% of hospitalized patients diagnosed with COVID-19 develop a severe respiratory failure that requires intensive care [ - ]. The wide range of outcomes observed, ranging from subpopulations that are mainly asymptomatic to those with substantial fatality rates, calls for risk stratification.
Although dexamethasone and remdesivir have recently been considered a preferred treatment strategy, it is still difficult to use them universally for all patients with COVID-19 ; hence, supportive treatments to protect multiorgan functions are a major resource for reducing mortality [ , ]. Several promising innovative drugs and treatment strategies are under investigation; however, until they become commercially available, the capacity of the medical system remains limited, prompting the need for making rationing decisions [ , ]. We argue that early identification of patients at the risk of severe respiratory failure would facilitate better resource planning and help set up effective organizational and clinical interventions, including early pharmacotherapy to prevent admission to the intensive care unit.
Since COVID-19 is a pandemic, many studies have assessed regional clinical features among patients. Pandemic preparedness and strategies differ among countries, and the clinical characteristics of patients admitted to medical facilities seem to vary in different cohorts.
We obtained data on 5628 confirmed patients with COVID-19 admitted to hospitals in Korea and analyzed their clinical features and clinical findings upon admission. Therefore, the objectives of this study are to (1) develop models that predict which individuals are at a high risk of severe disease and their duration of hospitalization in a cohort of hospitalized patients with a confirmed diagnosis of COVID-19 and (b) generate a web-based nomogram based on these models. Our results are expected to provide clinicians with a better understanding of the clinical course of COVID-19 and a guideline for critical care rationing.
Data Source and Study Design
This is a retrospective, multicenter cohort study conducted in Korea. The data used in this study were public data provided by the Korea Disease Control and Prevention Agency (KDCA) in Korea. Data were collected by the KDCA from physicians at multiple centers. The study cohort included 5628 patients with COVID-19 confirmed through the RT–PCR test and hospitalized or released from quarantine upon recovery by April 30, 2020.
A total of 41 variables were recorded for each patient. These 41 variables are classified into 7 types (, Table S1). Among the 41 variables provided by KDCA, 35 were used as predictors, including demographics, physical measurements, initial vital signs, comorbidities, and laboratory findings collected upon admission. We excluded 6 pregnancy-related variables because they were applicable only to women. This study was approved by the institutional review board of Seoul National University (protocol# E2008/003-004).
Definitions of the Primary and Secondary Outcomes
In this study, the primary outcome of interest is the maximum clinical severity score (CSS). The original CSSs provided by the KDCA have 8 levels (, Table S2). The CSSs contain ordered information about the clinical severity of patients with COVID-19. For example, the lowest level (ie, level 1) represents no activity restrictions and the highest level (ie, level 8) represents death. As shown in , each patient may go through different CSS levels during the course of hospitalization. For each patient, the “max CSS” was defined as the maximum level of CSS reported through their hospital duration ( ). Instead of the original 8 levels, the severity was reclassified into 4 levels depending on the patient’s condition to determine the appropriate treatment in our study. Accordingly, the modified CSS (mCSS) was defined as mild, moderate, severe, and critical ( , Table S2). The mild group included patients with no activity restrictions, which corresponded to 1 in the original CSS levels. The moderate group displayed limited activity but did not require oxygen therapy. This group corresponded to the original CSS level of 2. Patients who received oxygen therapy were classified under the severe group and those who received ventilation or extracorporeal membrane oxygenation or those who died were classified under the critical group. The severe group corresponded to original CSS levels of 3 and 4, and the critical group corresponded to original CSS levels of 5, 6, 7, and 8.
The secondary outcome was the total duration of hospitalization from the time of admission to discharge. In Korea, once a patient tests positive for COVID-19 on the RT–PCR test, he/she would be admitted to hospital or an isolation facility immediately. Our data set contains data on only the hospitalized patients with COVID-19 having clinical findings such as blood test results.
Among 35 predictor variables, 7 variables including body temperature, heart rate, and 5 laboratory results were continuous variables, while all the other variables were categorical variables. Among the 7 continuous variables, body temperature and heart rate were recategorized. Specifically, body temperature was divided into 2 categories with 37.5°C considered the threshold, and the heart rate was divided into 3 groups of <60 beats/min, 60-100 beats/min, and ≥100 beats/min. Among the 28 original categorical variables, age, body mass index (BMI), systolic blood pressure (SBP), and diastolic blood pressure (DBP) were recategorized. Age was originally grouped into 10-year-old intervals: <10 years, 10-19 years, 20-29 years, 30-39 years, 40-49 years, 50-59 years, 60-69 years, 70-79 years, and ≥80 years. Of these groups, the values of the age groups of 0-9 years and 10-19 years were merged into 1 group. For BMI, 5 groups were formed: <18.5 kg/m2, 18.5-22.9 kg/m2, 23-24.9 kg/m2, 25-29.9 kg/m2, and ≥30 kg/m2. Of these groups, values ranging 25-29.9 kg/m2 and ≥30 kg/m2 were merged into 1 group. For SBP, 5 groups were initially formed: <120 mmHg, 120-129 mmHg, 130-139 mmHg, 140-159 mmHg, and ≥160 mmHg. For DBP, 4 groups were initially formed: <80 mmHg, 80-89 mmHg, 90-99 mmHg, and ≥100 mmHg. SBP and DBP were divided into 2 groups based on the values of 140 mmHg and 90 mmHg, respectively.
To analyze the primary outcome (ie, mCSS), 5601 samples were used, excluding missing observations. To analyze the secondary outcome (ie, the duration of hospitalization), 5387 samples were used after excluding patients who died through the course of hospitalization. The median duration of hospitalization was 24 days. Accordingly, we classified the duration of hospitalization into two treatment groups: short-term and long-term.
Predictive Marker Selection Through Univariate Analysis
To identify candidate predictive markers related to the primary and secondary outcomes, univariate analysis was first performed. On univariate analysis, mCSS was considered a continuous variable. We performed correlation analysis between mCSS and continuous predictors using the Pearson, Spearman, and Kendall rank correlation tests [, ], two-tailed t test for binary predictors, and analysis of variance for multilevel categorical predictors. Furthermore, we performed the Cochran–Armitage Trend test [ ] to identify predictors with a linear trend of mCSS. For the duration of hospitalization, we used a Cox proportional hazards (CoxPH) model to identify candidate predictive markers [ ].
Development and Evaluation of the Prediction Model
shows the workflow for model development and evaluation. To avoid overfitting, we evaluated testing errors by splitting the total data set into training and testing data sets in a ratio of 2:1 in a stratified manner, by considering the ratio of the max CSS 4 levels and the long- or short-term group. To maintain the same scale for predictor variables, we standardized each predictor variable.
In order to develop models that predict the max CSS, the 4-level mCSS was combined into two levels in three ways such as (1) y1: mild (mCSS=1) vs above moderate (mCSS≥2), (2) y2: below moderate (mCSS≤2) vs above severe (mCSS≥3), and (3) y3: below severe (mCSS≤3) vs critical (mCSS=4). We fit 3 logistic regression models for binary responses. For multiple marker selection, stepwise variable selection was performed on the basis of the area under the receiver operating characteristic curve (AUC) , and we used the least absolute shrinkage and selection operator (LASSO) regression method [ , ]. For both stepwise and LASSO variable selections, 5-fold cross-validation was performed. For prediction models, we considered logistic regression, random forest (RF) classification, and a support vector regression machine [ , ]. Each prediction model was fit using markers selected through stepwise and LASSO regression analyses. The performance of each model was evaluated on the basis of the AUC, sensitivity, and specificity. The optimal threshold for sensitivity and specificity was selected as the threshold value with the maximum balanced accuracy. All analyses were implemented in the R package (version 3.6.1, The R Foundation).
Demographics and Clinical Characteristics
The demographics and clinical characteristics of the 5628 patients, with particular focus on the risk predictors for mCSS or the duration of hospitalization, are presented in. A complete list of all predictors is provided in , Table S3. Among them, 1785 (31.8%) patients were aged over 60 years, and 2320 (41.2%) were male. In total, 1299 (29.4%) patients were overweight or obese by Asia-Pacific BMI criteria. At the time of initial admission, 1936 (35.3%) patients had an SBP of ≥140 mmHg, and 887 (15.9%) had a body temperature of ≥37.5°C. At the time of diagnosis, the patients experienced the following symptoms: fever (n=1305, 23.2%), sputum production (n=1619, 28.8%), shortness of breath (SOB) (n=666, 11.8%), and altered consciousness or confusion (ACC) (n=35, 0.6%).
The patients had the following underlying comorbidities: diabetes mellitus (DM) (n=691, 12.3%), hypertension (HTN) (n=1201, 21.4%), heart failure (HF) (n=59, 1.0%), asthma (n=128, 2.3%), and chronic obstructive pulmonary disease (COPD) (n=40, 0.7%). Initial mean laboratory values were 13.3 (SD 1.8) g/dL for hemoglobin, 39.2% (SD 5%) for hematocrit, 29.1% (SD 11.7%) for the proportion of lymphocytes, 236,733/µL (SD 82,921/µL) for the platelet count, and 6126/µL (SD 2824/µL) for the white blood cell (WBC) count.
|Variables||Value||P value for differences in the modified clinical severity score||P value for differences in the duration of hospitalization|
|Age (years), n (%)||<.001||<.001|
|Sex, n (%)||<.001||N/Ab|
|BMI (kg/m2), n (%)||.002||N/A|
|Systolic blood pressure (mmHg), n (%)||<.001||.005|
|Heart rate (beats/min), n (%)||.003||N/A|
|Body temperature, (°C), n (%)||<.001||<.001|
|Fever, n (%)||<.001||<.001|
|Cough, n (%)||.06||<.001|
|Sputum, n (%)||.002||<.001|
|Sore throat, n (%)||<.001||N/A|
|Runny nose or rhinorrhea, n (%)||<.001||N/A|
|Muscle aches or myalgia, n (%)||N/A||.001|
|Fatigue or malaise, n (%)||<.001||.09|
|Shortness of breath or dyspnea, n (%)||<.001||<.001|
|Headache, n (%)||<.001||N/A|
|Altered consciousness or confusion, n (%)||<.001||.04|
|Vomiting or nausea, n (%)||<.001||<.001|
|Diabetes mellitus, n (%)||<.001||<.001|
|Hypertension, n (%)||<.001||<.001|
|Heart failure, n (%)||<.001||.03|
|Chronic cardiovascular disease (except heart failure), n (%)||<.001||.006|
|Asthma, n (%)||.003||N/A|
|Chronic obstructive pulmonary disease, n (%)||<.001||.02|
|Chronic kidney disease, n (%)||<.001||N/A|
|Cancer, n (%)||<.001||.07|
|Chronic liver disease, n (%)||.004||N/A|
|Dementia, n (%)||<.001||.002|
|Hemoglobin (g/dL), mean (SD)||13.3 (1.8)||<.001||<.001|
|Hematocrit (%), mean (SD)||39.2 (5.0)||<.001||<.001|
|Lymphocytes (%), mean (SD)||29.1 (11.7)||<.001||<.001|
|Platelet count (/μL), mean (SD)||236734 (82921)||<.001||<.001|
|White blood cell count (/μL), mean (SD)||6126 (2824)||<.001||N/A|
aP values were obtained through Pearson correlation analysis for the modified clinical severity score and with the Cox proportional hazards model for the duration of hospitalization.
bN/A: not applicable.
The Severity of COVID-19
Based on the severity of COVID-19, determined from the mCSS, patients were divided into four levels: mild (n=4455, 79.5%), moderate (n=330, 5.9%), severe (n=512, 9.1%), and critical (n=304, 5.4%). Among patients aged >60 years, 1157 (64.8%) 1567 (87.8%) belonged to the severe and critical levels, respectively. Specifically, patients in the severe and critical levels and aged ≥60 years accounted for 135 (26.4%) and 58 (19.1%), respectively, of the mCSS cohort. Patients in the severe and critical levels and aged ≥70 years accounted for 125 (24.4%) and 89 (29.3%), and those aged ≥80 years accounted for 72 (14.1%) and 120 (39.5%), respectively. Furthermore, with respect to the duration of hospitalization, patients aged >60 years were more frequently found in the long-term treatment group than in the short-term treatment group.
shows the association between the 30 key prediction markers and the mCSS, determined through univariate analysis at a 5% significance level. Patients with an older age; high BMI; SBP of ≥140 mmHg; high heart rate; body temperature of ≥37.5°C; 6 subjective clinical findings including fever, sputum, fatigue or malaise, SOB, ACC, and vomiting or nausea (VN); or 10 comorbidities including DM, HTN, HF, chronic cardiovascular disease (CCD), asthma, COPD, chronic kidney disease, cancer, chronic liver disease, and dementia were likely to have a higher risk of severe disease. Men were found to be at a higher risk of having a high mCSS than women (P<.001). Furthermore, patients with a high WBC count or low values of 4 laboratory findings (hemoglobin, hematocrit, lymphocytes, and platelets) tended to be at a higher the risk of severe disease (P<.001).
|Variable||t test (or ANOVA)||Cochran–Armitage trend test||Pearson correlation analysis||Spearman correlation analysis||Kendall rank correlation analysis|
|t (or F)||P value||T||P value||r||P value||ρ||P value||T||P value|
|Systolic blood pressure||0.11||<.001||N/A||<.001||0.06 (P<.001)||<.001||0.05||<.001||0.05||<.001|
|Heart rate||N/A||<.001||N/A||N/A||0.04||.003||0.03 (.03)||N/A||0.03||.03|
|Runny nose or rhinorrhea||–0.19||<.001||1||N/A||–0.07||<.001||–0.07||<.001||–0.06||<.001|
|Fatigue or malaise||0.24||<.001||N/A||<.001||0.06||<.001||0.05||<.001||0.05||<.001|
|Shortness of breath||0.95||<.001||N/A||<.001||0.36||<.001||0.31||<.001||0.30||<.001|
|Altered consciousness or confusion||1.98||<.001||N/A||<.001||0.18||<.001||0.14||<.001||0.14||<.001|
|Vomiting or nausea||0.26||<.001||N/A||<.001||0.06||<.001||0.06||<.001||0.06||<.001|
|Chronic cardiovascular disease||0.6||<.001||N/A||<.001||0.12||<.001||0.12||<.001||0.11||<.001|
|Chronic obstructive pulmonary disease||0.93||<.001||N/A||<.001||0.09||<.001||0.08||<.001||0.08||<.001|
|Chronic kidney disease||1.19||<.001||N/A||<.001||0.14||<.001||0.12||<.001||0.12||<.001|
|Chronic liver disease||0.28||.02||N/A||.002||0.04||.004||0.04||.005||0.04||.005|
|White blood cells||N/A||N/A||N/A||N/A||0.12||<.001||0.04||.02||0.03||.02|
aFor each test, a variable with positive coefficient represents the predictor positively associated with an increase in clinical severity.
bN/A: not applicable.
cSex: Female=1, Male=0; clinical findings or comorbidities; Yes=1, No=0.
The results of univariate analysis for the duration of hospitalization are shown in, Table S4. We identified 20 key prediction markers associated with the duration of hospitalization. Patients with an older age; SBP of ≥140 mmHg; body temperature of ≥37.5°C; 7 subjective clinical findings including fever, cough, sputum, muscle aches or myalgia, SOB, ACC, and VN; 6 comorbidities including DM, HTN, HF, CCD, COPD, and dementia; or low values of blood parameters including hemoglobin, hematocrit, lymphocytes, and platelets tended to have a long duration of hospitalization.
Development and Evaluation of the Prediction Model
To develop prediction models for mCSS and the duration of hospitalization, we selected multiple markers using AUC-based stepwise selection and the LASSO method. For the application of statistical and machine learning models, we defined three binary response variables by regrouping the 4 levels of mCSS into two levels as follows; (1) y1: mild (mCSS=1) vs above moderate (mCSS≥2), (2) y2: below moderate (mCSS≤2) vs above severe (mCSS≥3), and (3) y3: below severe (mCSS≤3) vs critical (mCSS=4).shows the results of variable selection and evaluation for each y. Details regarding the selected variables are provided in , Table S5. For each case, we aimed to develop a parsimonious model with higher predictive power. Variables selected through the LASSO method were determined as the final model for each y. For y1, predictors including older age, high body temperature, SOB, low lymphocyte value, and low platelet count were selected as risk factors. The prediction model with these 5 predictors had an AUC of ≥0.83 (ie, AUC=0.830, sensitivity=0.710, and specificity=0.843 for the RF model). For y2, older age, high body temperature, SOB, low hematocrit and lymphocyte values, and low platelet count were selected as risk factors. The prediction model with these 6 predictors yielded an AUC of ≥0.865 (ie, AUC=0.865, sensitivity=0.772, and specificity=0.842 for the RF model). For y3, older age, SOB, high WBC count, low hemoglobin and lymphocyte values, and low platelet count were selected as risk factors. The prediction model with these 6 predictors yielded an AUC of ≥0.933 (ie, AUC=0.933, sensitivity=0.895, and specificity=0.865 for the RF model).
Based on these 3 prediction models, we developed a prognostic nomogram to predict the mCSS for each patient. The nomogram is available on the internet for clinical use . shows an example of the developed nomogram. The fitted results of the logistic model used to develop the nomogram are shown in , Table S6. Based on the standardized β coefficients of the fitted results, we ranked the importance of the predictors for each model ( ) [ ]. In , the x-axis represents the standardized β coefficient, and the relative importance of the predictors is shown in descending order for each model. In all 3 prediction models, the SOB ranked first. The temperature selected in the 2 prediction models ranked second in both models, y1 and y2. In all 3 prediction models, lymphocytes ranked third.
We performed similar analyses for the duration of hospitalization. The results of variable selection and evaluation are summarized in, Table S7. The prediction model selected 13 predictors through stepwise selection, including age, hematocrit, cough, FM, platelets, muscle aches or myalgia, dementia, asthma, VN, lymphocytes, WBC count, diarrhea, and body temperature. This model yielded an AUC of ≥0.601. With the LASSO method, only age was selected, and the prediction model yielded a performance of up to 0.571.
|Response||Variable selection method||Variables, n||Sample size||Model||Training||Testing|
|Training||Testing||Area under the curve||Sensitivity||Specificity||Area under the curve||Sensitivity||Specificity|
|Support vector machine||0.87||0.783||0.806||0.856||0.83||0.75|
|y1||Least absolute shrinkage and selection operator||5||2686||1354||Logistic regression||0.85||0.755||0.794||0.847||0.745||0.812|
|Support vector machine||0.854||0.77||0.787||0.848||0.739||0.84|
|Support vector machine||0.887||0.834||0.771||0.854||0.868||0.687|
|y2||Least absolute shrinkage and selection operator||6||2683||1348||Logistic regression||0.881||0.757||0.832||0.877||0.812||0.793|
|Support vector machine||0.886||0.835||0.768||0.879||0.812||0.816|
|Support vector machine||0.933||0.944||0.789||0.935||0.842||0.895|
|y3||Least absolute shrinkage and selection operator||6||2691||1357||Logistic regression||0.923||0.86||0.873||0.944||0.884||0.88|
|Support vector machine||0.918||0.812||0.918||0.943||0.874||0.906|
ay1: mild vs above moderate.
by2: below oderate vs above severe.
cy3: below severe vs critical.
In this study, we retrospectively assessed the characteristics of 5628 patients with COVID-19 from multiple hospitals in Korea and identified the risk factors for predicting the maximum clinical severity and duration of hospitalization. Older patients aged >60 years accounted for 31.8% of the total, and patients with mild disease accounted for 79.5% of the total. Through univariate analysis for each outcome, we identified 30 risk factors for mCSS and 20 risk factors for the duration of hospitalization. Common risk factors between mCSS and the duration of hospitalization included age, SBP, body temperature, fever, sputum, SOB, ACC, VN, DM, HTN, HF, CCD, COPD, dementia, hemoglobin, hematocrit, lymphocytes, and platelets.
We successfully developed 3 prediction models for mCSS by combining mCSS with 4 levels into 2 levels and developed a web-based nomogram  by using these models. Our results indicate that age, body temperature, SOB, lymphopenia, a low hematocrit, low hemoglobin, a low platelet count, and a high WBC count were risk factors positively associated with the maximum clinical severity of COVID-19. These 8 variables have been reported as important predictor variables in the medical literature [ - ]. Specifically, age, shortness of breath, body temperature, lymphocytes, and hemoglobin have been reported as variables for predicting admission to the intensive care unit [ , ], critical illness [ , ], or severe disease [ , , ]. In particular, Wu et al [ ] reported that the severe group had a significantly lower platelet and higher WBC counts than the nonsevere group. Furthermore, Zhang et al [ ] reported that hematocrit was significantly lower in the severe group than in the nonsevere group. Our study provides a list of useful risk predictors that can be widely used in a large health care organization during the pandemic.
With an increase in the number of confirmed patients, the number of severely symptomatic patients is also increasing, thus posing a challenge to the management of severe patients during COVID-19 outbreaks. The wide range of outcomes observed, ranging from subpopulations that are mainly asymptomatic to those with markedly high fatality rates, calls for risk stratification. Timely identification of patients at a high risk of developing acute respiratory distress syndrome or multiple organ failure and performing risk stratification management can facilitate more personalized treatment plans and optimized use of medical resources and help prevent further deterioration. To define identify individuals at a high risk of severe disease, the Centers for Disease Control and Prevention defined the following criteria for a high risk of severe disease: age ≥65 years, living in nursing homes, and having at least one underlying comorbidity including chronic lung disease, serious heart conditions, severe obesity, diabetes, chronic kidney disease, liver disease, or an immunocompromised status.
Age and the male gender identified as risk factors of severe COVID-19 in our study have been previously confirmed as risk factors in other countries [, ]. An elevation in the body temperature is the result of the progression of the infection; hence, if the body temperature is high (≥37.5°C), the prognosis is likely to be poor. In addition, shortness of breath can be considered a symptom that occurs in the course of the disease, since COVID-19 is a type of respiratory disease [ , ]. Among the hematologic abnormalities we observed, we shall consider 2 variables: lymphocytes and platelets. Because lymphopenia and immune dysregulation may impact disease severity, especially because SARS-CoV-2 can directly infect T-lymphocytes, which may be the underlying mechanism of lymphopenia [ ]. Regarding the finding of platelet abnormalities, it can be explained that the development of autoimmune antibodies or immune complexes induced by viral infection may play an important role in inducing thrombocytopenia. In addition, SARS-CoV-2 can also directly infect hematopoietic stem or progenitor cells, megakaryocytes, and platelets to inhibit growth and induce apoptosis; furthermore, increased platelet consumption or decreased platelet production in damaged lungs is a potential alternative mechanism that may contribute to thrombocytopenia in severe critical pulmonary conditions [ ].
Wynants et al  reviewed 50 COVID-19 prediction models and reported that most of the models have a high risk of bias when evaluated with the prediction model risk of bias assessment tool [ ]. They found that 2 common causes of bias in prediction models for COVID-19 were the lack of external validation and selection bias. Our study also has these limitations. Since the cohort of patients with COVID-19 in this study includes those whose clinical course has not yet been completed and those who may still potentially develop severe disease, there is a chance that discharged patients without any indication of severe disease during hospitalization would later develop severe disease outside of hospital. In addition, our model was not validated with an external cohort (including foreign data), even though we divided the cohort into a training and testing set to evaluate the predictive power of the developed models. This limitation is mainly due to the limited research environment and the time provided by KDCA to prevent data leakage. Another study limitation is that the data did not include smoking status, which is a very important aspect of an individual's lifestyle, and medication history, especially their history of taking corticosteroids, was not identified. This is an important factor that is closely associated with the exacerbation of the clinical course of COVID-19. The KDCA did not provide information on the smoking status of these individuals because these data were largely missing. In the future, it is expected that more variables from a larger set of patients with COVID-19 be included in the data set to increase the accuracy of the analysis [ , ].
Recently, these prediction tools have been presented in various ways worldwide, but variables with predictive power are identified slightly differently depending on the characteristics of the study population, including nationality and race. In addition, these prediction models can be updated in the current situation where the number of patients continues to rise. Therefore, to develop a model with higher predictive power, it is necessary to constantly compare and validate the results of various studies.
In this study, we developed models that predict the clinical severity of patients with COVID-19. Compared to previous studies that focused on predicting admission to the intensive care unit [, ], critical illness [ , ], or severe disease [ , , ], our model used the largest cohort and showed higher performances, even with a limited number of laboratory variables. Specifically, in the case of the model for predicting the critical group, the predictive power was >0.93. Furthermore, we developed a web-based nomogram [ ] that can be easily applied visually.
These models are expected to be used as decision supporting tools at the initial stage of treatment; that is, they can be used to predict patients who might need intensive care owing to deterioration among most patients hospitalized with mild or asymptomatic conditions. They can also help hospitals that manage in-patients acquire and use facilities such as negative pressure beds, mechanical ventilation systems, and extracorporeal membrane oxygenation equipment that must be provided to patients with severe symptoms. If further validated through a prospective study, our prediction model might serve for both rationing decisions at health care levels and selecting patients for randomized controlled trials on new treatment options.
We thank the members of the Korea Centers for Disease Control and Prevention who provided valuable COVID-19 data during the study. This study was supported by a grant for COVID-19 from the Korea Centers for Disease Control and Prevention (grant# 2020061277D).
BO, SH, TJ, and TP conceived and designed the study. SH and TJ contributed to data analysis and generated the tables and figures. CL developed the web-based nomogram. BO and SH drafted the manuscript and contributed to the literature search. BO, SH, TJ, SK, and TP interpreted the data. All authors critically reviewed and approved the final version of the manuscript.
Conflicts of Interest
Supplementary tables.DOC File , 249 KB
- Mahase E. Coronavirus covid-19 has killed more people than SARS and MERS combined, despite lower case fatality rate. BMJ 2020 Feb 18;368:m641. [CrossRef] [Medline]
- Ranney M, Griffeth V, Jha A. Critical Supply Shortages - The Need for Ventilators and Personal Protective Equipment during the Covid-19 Pandemic. N Engl J Med 2020 Apr 30;382(18):e41. [CrossRef] [Medline]
- Grasselli G, Zangrillo A, Zanella A, Antonelli M, Cabrini L, Castelli A, COVID-19 Lombardy ICU Network. Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region, Italy. JAMA 2020 Apr 28;323(16):1574-1581 [FREE Full text] [CrossRef] [Medline]
- Wu Z, McGoogan J. Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases From the Chinese Center for Disease Control and Prevention. JAMA 2020 Apr 07;323(13):1239-1242. [CrossRef] [Medline]
- Grasselli G, Pesenti A, Cecconi M. Critical Care Utilization for the COVID-19 Outbreak in Lombardy, Italy: Early Experience and Forecast During an Emergency Response. JAMA 2020 Apr 28;323(16):1545-1546. [CrossRef] [Medline]
- Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020 Feb 15;395(10223):497-506 [FREE Full text] [CrossRef] [Medline]
- Richardson S, Hirsch J, Narasimhan M, Crawford JM, McGinn T, Davidson KW, the Northwell COVID-19 Research Consortium, et al. Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area. JAMA 2020 May 26;323(20):2052-2059 [FREE Full text] [CrossRef] [Medline]
- Vetter P, Kaiser L, Calmy A, Agoritsas T, Huttner A. Dexamethasone and remdesivir: finding method in the COVID-19 madness. The Lancet Microbe 2020 Dec;1(8):e309-e310. [CrossRef]
- Alhazzani W, Møller MH, Arabi Y, Loeb M, Gong MN, Fan E, et al. Surviving Sepsis Campaign: Guidelines on the Management of Critically Ill Adults with Coronavirus Disease 2019 (COVID-19). Crit Care Med 2020 Jun;48(6):e440-e469 [FREE Full text] [CrossRef] [Medline]
- White D, Lo B. A Framework for Rationing Ventilators and Critical Care Beds During the COVID-19 Pandemic. JAMA 2020 May 12;323(18):1773-1774. [CrossRef] [Medline]
- Sanders J, Monogue M, Jodlowski T, Cutrell J. Pharmacologic Treatments for Coronavirus Disease 2019 (COVID-19): A Review. JAMA 2020 May 12;323(18):1824-1836. [CrossRef] [Medline]
- Spearman C. "General Intelligence," Objectively Determined and Measured. Am J Psychol 1904 Apr;15(2):1961.
- Kendall MG. A New Measure of Rank Correlation. Biometrika 1938;30:81-93.
- Armitage P. Tests for Linear Trends in Proportions and Frequencies. Biometrics 1955;11(3):375-386.
- Cox DR. Regression Models and Life-Tables. J R Stat Soc Series B 1972;34(2):187-202.
- Kim S, Song M, Hwangbo S, Lee S, Cho U, Kim JH, et al. Development of Web-Based Nomograms to Predict Treatment Response and Prognosis of Epithelial Ovarian Cancer. Cancer Res Treat 2019 Jul;51(3):1144-1155 [FREE Full text] [CrossRef] [Medline]
- Meier L, Van de Geer S, Bühlmann P. The group lasso for logistic regression. J R Statist Soc B 2008;70(1):53-71 [FREE Full text]
- Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statist Soc B 1996;58(1):267-288.
- Liaw A, Wiener M. Classification and Regression by randomForest. R news 2002;2(3):18-22 [FREE Full text]
- Drucker H, Burges C, Kaufman L, Smola A, Vapnik V. Support Vector Regression Machines. In: Proceedings of the 9th International Conference on Neural Information Processing Systems. 1996 Dec Presented at: 9th International Conference on Neural Information Processing Systems p. 155-161.
- Lee C. COVID-19 Nomogram: Prediction of maximum clinical severity for a patient. Bioinformatics and Biostatistics Laboratory, Seoul National University. 2020 Oct 15. URL: http://statgen.snu.ac.kr/covid19/nomogram/maxcss/ [accessed 2021-04-12]
- Hong C, Ryu H. Information Theoretic Standardized Logistic Regression Coefficients with Various Coefficients of Determination. Commun Stat Appl Methods 2006;13(1):49-60 [FREE Full text]
- Zhao Z, Chen A, Hou W, Graham JM, Li H, Richman PS, et al. Prediction model and risk scores of ICU admission and mortality in COVID-19. PLoS One 2020;15(7):e0236618 [FREE Full text] [CrossRef] [Medline]
- Liang W, Liang H, Ou L, Chen B, Chen A, Li C, China Medical Treatment Expert Group for COVID-19. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Intern Med 2020 Aug 01;180(8):1081-1089 [FREE Full text] [CrossRef] [Medline]
- Kim H, Han D, Kim JH, Kim D, Ha B, Seog W, et al. An Easy-to-Use Machine Learning Model to Predict the Prognosis of Patients With COVID-19: Retrospective Cohort Study. J Med Internet Res 2020 Nov 09;22(11):e24225 [FREE Full text] [CrossRef] [Medline]
- Wu G, Yang P, Xie Y, Woodruff HC, Rao X, Guiot J, et al. Development of a clinical decision support system for severity risk prediction and triage of COVID-19 patients at hospital admission: an international multicentre study. Eur Respir J 2020 Aug;56(2):2001104 [FREE Full text] [CrossRef] [Medline]
- Zhang W, Zhang Z, Ye Y, Luo Y, Pan S, Qi H, et al. Lymphocyte percentage and hemoglobin as a joint parameter for the prediction of severe and nonsevere COVID-19: a preliminary study. Ann Transl Med 2020 Oct;8(19):1231 [FREE Full text] [CrossRef] [Medline]
- Zhang H, Wang X, Fu Z, Luo M, Zhang Z, Zhang K, et al. Potential Factors for Prediction of Disease Severity of COVID-19 Patients. medRxiv. Preprint posted online March 23, 2020 2020 Mar 23.
- Schalekamp S, Huisman M, van Dijk RA, Boomsma MF, Freire Jorge PJ, de Boer WS, et al. Model-based Prediction of Critical Illness in Hospitalized Patients with COVID-19. Radiology 2021 Jan;298(1):E46-E54 [FREE Full text] [CrossRef] [Medline]
- Gao Y, Li T, Han M, Li X, Wu D, Xu Y, et al. Diagnostic utility of clinical laboratory data determinations for patients with the severe COVID-19. J Med Virol 2020 Jul;92(7):791-796 [FREE Full text] [CrossRef] [Medline]
- Liu F, Zhang Q, Huang C, Shi C, Wang L, Shi N, et al. CT quantification of pneumonia lesions in early days predicts progression to severe illness in a cohort of COVID-19 patients. Theranostics 2020;10(12):5613-5622 [FREE Full text] [CrossRef] [Medline]
- Mo J, Liu J, Wu S, Lü A, Xiao L, Chen D, et al. Predictive role of clinical features in patients with coronavirus disease 2019 for severe disease. Zhong Nan Da Xue Xue Bao Yi Xue Ban 2020 May 28;45(5):536-541 [FREE Full text] [CrossRef] [Medline]
- Ji D, Zhang D, Xu J, Chen Z, Yang T, Zhao P, et al. Prediction for Progression Risk in Patients With COVID-19 Pneumonia: The CALL Score. Clin Infect Dis 2020 Sep 12;71(6):1393-1399 [FREE Full text] [CrossRef] [Medline]
- Wynants L, Van Calster B, Collins G, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 2020 Apr 07;369:m1328 [FREE Full text] [CrossRef] [Medline]
- Henry B, Benoit S, de Oliveira MHS, Hsieh WC, Benoit J, Ballout RA, et al. Laboratory abnormalities in children with mild and severe coronavirus disease 2019 (COVID-19): A pooled analysis and review. Clin Biochem 2020 Jul;81:1-8 [FREE Full text] [CrossRef] [Medline]
- Yang M, Ng M, Li C. Thrombocytopenia in patients with severe acute respiratory syndrome (review). Hematology 2005;10(2):101-105. [Medline]
- Wolff R, Moons K, Riley R, Whiting PF, Westwood M, Collins GS, PROBAST Group†. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med 2019 Jan 01;170(1):51-58. [CrossRef] [Medline]
- Polverino F. Cigarette Smoking and COVID-19: A Complex Interaction. Am J Respir Crit Care Med 2020 Aug 01;202(3):471-472 [FREE Full text] [CrossRef] [Medline]
- Figliozzi S, Masci P, Ahmadi N, Tondi L, Koutli E, Aimo A, et al. Predictors of adverse prognosis in COVID-19: A systematic review and meta-analysis. Eur J Clin Invest 2020 Oct;50(10):e13362. [CrossRef] [Medline]
|ACC: altered consciousness or confusion|
|AUC: area under the receiver operating characteristic curve|
|CCD: chronic cardiovascular disease|
|COPD: chronic obstructive pulmonary disease|
|CoxPH: Cox proportional hazards|
|CSS: clinical severity score|
|DBP: diastolic blood pressure|
|DM: diabetes mellitus|
|HF: heart failure|
|KDCA: Korea Disease Control and Prevention Agency|
|LASSO: least absolute shrinkage and selection operator|
|mCSS: modified CSS|
|RF: random forest|
|SBP: systolic blood pressure|
|SOB: shortness of breath|
|VN: vomiting or nausea|
|WBC: white blood cell|
Edited by C Basch; submitted 18.11.20; peer-reviewed by W Han, M Basit, S Sabarguna; comments to author 14.01.21; revised version received 04.02.21; accepted 18.03.21; published 16.04.21Copyright
©Bumjo Oh, Suhyun Hwangbo, Taeyeong Jung, Kyungha Min, Chanhee Lee, Catherine Apio, Hyejin Lee, Seungyeoun Lee, Min Kyong Moon, Shin-Woo Kim, Taesung Park. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.04.2021.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.