Model for Predicting In-Hospital Mortality of Physical Trauma Patients Using Artificial Intelligence Techniques: Nationwide Population-Based Study in Korea

Background Physical trauma–related mortality places a heavy burden on society. Estimating the mortality risk in physical trauma patients is crucial to enhance treatment efficiency and reduce this burden. The most popular and accurate model is the Injury Severity Score (ISS), which is based on the Abbreviated Injury Scale (AIS), an anatomical injury severity scoring system. However, the AIS requires specialists to code the injury scale by reviewing a patient's medical record; therefore, applying the model to every hospital is impossible. Objective We aimed to develop an artificial intelligence (AI) model to predict in-hospital mortality in physical trauma patients using the International Classification of Disease 10th Revision (ICD-10), triage scale, procedure codes, and other clinical features. Methods We used the Korean National Emergency Department Information System (NEDIS) data set (N=778,111) compiled from over 400 hospitals between 2016 and 2019. To predict in-hospital mortality, we used the following as input features: ICD-10, patient age, gender, intentionality, injury mechanism, and emergent symptom, Alert/Verbal/Painful/Unresponsive (AVPU) scale, Korean Triage and Acuity Scale (KTAS), and procedure codes. We proposed the ensemble of deep neural networks (EDNN) via 5-fold cross-validation and compared them with other state-of-the-art machine learning models, including traditional prediction models. We further investigated the effect of the features. Results Our proposed EDNN with all features provided the highest area under the receiver operating characteristic (AUROC) curve of 0.9507, outperforming other state-of-the-art models, including the following traditional prediction models: Adaptive Boosting (AdaBoost; AUROC of 0.9433), Extreme Gradient Boosting (XGBoost; AUROC of 0.9331), ICD-based ISS (AUROC of 0.8699 for an inclusive model and AUROC of 0.8224 for an exclusive model), and KTAS (AUROC of 0.1841). In addition, using all features yielded a higher AUROC than any other partial features, namely, EDNN with the features of ICD-10 only (AUROC of 0.8964) and EDNN with the features excluding ICD-10 (AUROC of 0.9383). Conclusions Our proposed EDNN with all features outperforms other state-of-the-art models, including the traditional diagnostic code-based prediction model and triage scale.


Introduction
Physical trauma-related mortality places a heavy burden on individuals and society. Accurately estimating mortality risk enhances treatment efficiency and reduces this burden. To date, there are various models to predict the severity of physical trauma patients [1][2][3][4][5][6][7]. Among them, the most popular and accurate model is the Injury Severity Score (ISS) developed in the 1970s and based on the Abbreviated Injury Scale (AIS), an anatomical injury severity scoring system [1,8]. However, the AIS requires specialists to code the injury scale by reviewing a patient's medical record; therefore, applying the model to every hospital is impossible. To overcome these shortcomings, the following International Classification of Diseases (ICD)-based severity models have been introduced: ICD-based Injury Severity Score (ICISS) [9], trauma mortality models using International Classification of Disease 10th Revision (ICD-10) (TMPM-ICD10) [10], and Mortality Ratio-adjusted Injury Severity Score (EMR-ISS) [11]. However, ICD-based models are not as accurate as AIS-based models [8]. Since 2016, all emergency medical institutions in Korea have introduced the Korean Triage and Acuity Scale (KTAS), an emergency department (ED) triage system composed of 5 levels [12]. However, the KTAS relies on the practitioner's judgment and may introduce bias and be prone to human error [13].
Artificial intelligence (AI) is widely used to find complex associations between various features in medical applications [14][15][16], such as individual injuries and mortality. We recently proposed AI technology utilizing AIS codes that outperformed conventional ISS [1], providing a favorable area under the receiver operating characteristic (AUROC) of 0.908 [17]. Tran et al [18] also used AI technology for mortality prediction using the ICD-10 from the National Trauma Database (NTDB) data set, but the AUROC value was not as high as that of our previous proposed AI model.
We aimed to construct an AI model to predict in-hospital mortality in physical trauma patients using the National Emergency Department Information System (NEDIS) data set. We hypothesized that an AI model based on ICD-10 with other clinical features is a useful alternative. We compared the predictive performance of our model with other ICD-10-based models, such as the ICISS [9], EMR-ISS [11], and the AI-driven ICD-10-only based model. Finally, we deployed our AI-driven public website to predict in-hospital mortality in physical trauma patients to benefit end users.

Ethics Approval
This study was conducted according to the TRIPOD (Transparent Reporting of a Multivariable Model for Individual Prognosis or Diagnosis) statement [19]. NEDIS data were provided by the National Emergency Medical Center (data acquisition number N20212920825).

Patients and Data Set for AI Model
The NEDIS data set was collected mandatorily from 2016 to 2019 from over 400 hospitals in South Korea. The inclusion criteria were as follows: (1) physical trauma patients (but not psychological) with a diagnostic code of S or T based on the Korean version of the ICD-10; (2) patients admitted to the intensive care unit (ICU) or general ward from the ED; and (3) patients admitted to the ICU or general ward after surgery or a procedure from the ED. The exclusion criteria were as follows: (1) patients without diagnostic codes starting with S or T (eg, S001, T063; all physical traumatic patients include S or T code. The S code represents the trauma in a single body region, and the T code represents the trauma in multiple or unspecified regions); (2) patients with diagnostic code of frostbite (T33-T35.6), intoxication (T36-T65), and unspecified injury or complication (T66-T78, T80-T88); (3) patients transferred to another hospital or discharged from the ED after treatment; (4) patients transferred to another hospital or discharged without notification to staffs at hospitals; (5) patients who died in the ED before ICU or general ward admission; and (6) missing information.
More specifically, we first collected 7,664,443 patients with a nondisease identifier comprising trauma patients. Since our primary outcome was to predict in-hospital mortality in trauma patients, we had to exclude unrelated patients. We then excluded all nonhospitalized patient information (n=6,464,432, 84.34%). The second most commonly excluded data were from patients transferred to another hospital (n=241,778, 3.15%). For transferred patients, the NEDIS policy of deidentification is to assign a new anonymous ID number; thus, the data is redundant. In addition, we excluded deceased ED patients (n=49,357, 0.64%) due to insufficient information about diagnostic codes, procedure codes, and other clinical features. Moreover, we excluded escaped patients during hospitalization (n=889, 0.01%) and patients with missing data (n=35,885, 4.68%), not including mortality information. A final total of 778,111 patient data were used for training and testing our AI model ( Figure 1). We used the following variables in NEDIS data: age, gender, intentionality, injury mechanism, emergent symptom, Alert/Verbal/Painful/Unresponsive (AVPU) scale, initial KTAS, altered KTAS, ICD-10 codes, procedure codes of surgical operation or interventional radiology, and in-hospital mortality. All included variables for the AI model are summarized in Table  1. A total of 938 AI model input features (categories) were considered from 10 variables. The AVPU scale is a simplified version of the Glasgow Coma Scale (GCS) [20,21] and includes 4 categories: A, alert; V, verbal responsive (drowsy); P, painful response (stupor, semicoma); and U, unresponsive (coma). KTAS was developed as a severity triage in the ED in 2012, based on the Canadian Triage and Acuity Scale (CTAS) [12]. KTAS is a standardized triage tool to avoid complexity and ambiguity and includes 5 categories: level 1, resuscitation; level 2, emergent; level 3, urgent; level 4, less urgent; level 5, nonurgent. According to NEDIS policy, KTAS should be conducted by a certified faculty, and the initial KTAS should be assessed within 2 minutes of ED admission. The altered KTAS should be assessed when the ED patient deteriorates before moving to the operating room, ICU, or general ward. Regarding ICD-10, we considered 856 codes starting with S or T. The procedure codes, which are used to claim from the National Health Insurance Review and Assessment Service, include surgery and angioembolization and are more specifically categorized as follows and summarized in Table S1 in Multimedia Appendix 1: (1) head procedure; (2) torso procedure-vascular; (3) torso procedure-abdomen; (4) torso procedure-chest; (5) torso procedure-heart; and (6) extracorporeal membrane oxygenation (ECMO). The primary outcome was in-hospital mortality, defined as a patient with a dead result code and discharged with medical futility in NEDIS.

Data Split, Data Balancing, and Cross-Validation
The data set in this study comprised both training and testing data (Table S2 in Multimedia Appendix 1). Data from 778,111 patients were divided into training and testing data with a ratio of 8:2 in a stratified fashion. The testing set was used only to independently test our developed AI model and not for training or internal validation.
We first performed 5-fold cross-validation using the training data to confirm its generalization ability. The training data set (n=622,488, 80%) was randomly shuffled and stratified into 5 equal groups, of which 4 groups were selected from training the model, and the remaining group was used for internal validation. This process was repeated 5 times by shifting the internal validation group. Our finalized AI model is described in the subsequent sections and was used to evaluate performance using the isolated testing data.
Since the number of survived patients (n=611,481, 98.23%) was much higher than that of deceased patients (n=11,007, 1.77%), we upsampled the survived patient data using the Synthetic Minority Oversampling Technique (SMOTE) during the model update [22]. By balancing the 2 groups, we prevented bias toward the survived patient data.

Feature Analysis
To analyze the effects on mortality prediction from 914 features, we applied 3 machine-learning algorithms: Adaptive Boosting (AdaBoost) [23], Extreme Gradient Boosting (XGBoost) [24], and light gradient boosting machine (LightGBM) [25]. We also considered 4 ensemble models: AdaBoost with XGBoost, AdaBoost with LightGBM, XGBoost with LightGBM, and a combination of the 3 models. Finally, among 7 machine learning models, we chose the best prediction model and presented its feature importance analysis, listing features in the order that they contributed to the mortality prediction.

AI Prediction Model Development and Statistical Analysis
We developed a deep neural network (DNN)-based AI model using 914 features, including ICD-10 as an input layer. To find the best model, we searched hyperparameters, such as layer depth and width for fully connected (FC) layers. The last FC output layer was fed into a sigmoid layer, which provided the mortality probability. After the hyperparameter search, we found the best model with a 9-layer DNN, which comprised an input layer, 7 FC layers as hidden layers, and an output layer. The input layer was fed into a series of 7 FC layers, consisting of 512, 256, 128, 64, 32, 16, and 8 nodes, respectively. We applied dropout with a rate of 0.3 and L2 regularization for the FC hidden layers. Figure 2 shows the process flow of the AI development and DNN architecture. The prediction performance of our proposed 9-layer DNN model was evaluated with 5-fold cross-validation. Subsequently, for the final DNN-based AI model, we adopted an ensemble approach to combine the 5 models from the 5-fold cross-validation. The 914 features were inputs to 5 cross-validation models, and each provided mortality probabilities. A total of 5 probabilities were averaged, known as soft voting. Based on the ensemble DNN model, the prediction performance was evaluated with the isolated testing data set (n=155,623, 20%).
We trained the models with an Adam optimizer and binary cross-entropy cost function with a learning rate of 0.001 and a batch size of 32. We implemented the models using Python (version 3.

Conventional Metrics Based on Diagnostic Code
We applied conventional metrics based on ICD-10. ICISS utilizes survival risk ratios (SRRs) to calculate the probability of survival [9]. SRR is defined as the number of survived patients with a specific injury code divided by the number of all patients with the specific same injury code. A patient's probability of survival (Ps) is determined by multiplying all SRRs of the injury codes from the patient [9]. The traditional ICISS was calculated as the product of Ps for as many as 10 injuries [26]. Two different methods were performed to calculate ICISS. First, the inclusive SRR was calculated for each injury irrespective of the associated injury [9]. Second, the exclusive SRR was calculated by the number of survivors who had an isolated specific injury divided by the total number of patients who only had that injury [9]. Thus, patients with multiple injuries were excluded from the calculation of exclusive SRR [9]. Regarding EMR-ISS, an injury severity grade similar to AIS was produced from ICD-10 codes based on the quintile of the EMR for each ICD-10 code [11]. The EMR-ISS was calculated from 3 maximum severity grades using data from the National Health Insurance data set, the Industrial Accident Compensation Insurance data set, and the National Death Certificate database from 2001 to 2003 [11].

Initial Findings
Of the 778,111 patients included in the final analysis, 13,760 (1.77%) died during hospitalization (13,667 had a deceased code, and 93 were discharged with a medical futility code). Table 2 shows a comparison of included variables between deceased and surviving patients, and Table S3 in Multimedia Appendix 1 shows the ICD-10 comparison between deceased and surviving patients.

Ranked Feature Importance: Explainable AI
To analyze the effects of features, we first applied the data to 3 different machine learning algorithms: AdaBoost, XGBoost, and LightGBM. As summarized in Table 3, the AdaBoost model was the best classifier for predicting mortality in trauma patients. We then performed the feature importance analysis (see Figure  3 for ranked normalized feature importance) to confirm the contribution of each feature. Based on the AdaBoost, gender had the highest importance value, followed by age, unresponsive (coma), S721 (pertrochanteric fracture of the femur), S720 (fracture of neck of femur), painful response (stupor, semicoma), injury mechanism-slip down, and torso procedure-chest. Among the 914 features, only 71 (7.77%) features had nonzero values indicating that the other 843 features did not contribute to mortality prediction. Table S4 in Multimedia Appendix 1 shows the complete ranked normalized feature importance values. All features with the highest importance value showed a statistically significant difference between the deceased and surviving group ( Table 2 and Table S3 in Multimedia Appendix 1).

Cross-Validation Result of DNNs Using a Different Set of Features According to Importance
We investigated the cross-validation performance from our DNN model with 2 input conditions: (1) the top 71 features having nonzero feature importance values from the AdaBoost, the best among the machine learning models; and (2) all 914 features (Table S5 in Multimedia Appendix 1). The DNN with all 914 features provided a higher balanced accuracy of 0.8718 and AUROC of 0.9513 compared to the DNN with the top 71 features, which had a balanced accuracy of 0.8389 and AUROC of 0.9386. Features with 0 values of feature importance can contribute to mortality prediction. Sensitivity increased by more than 0.1 for the former, whereas specificity decreased by less than 0.05. For the latter, sensitivity increased to 0.8599 from 0.7480, and specificity decreased to 0.8838 from 0.9299. Therefore, we considered all features in our AI model and validated the performance with the isolated testing data.

Testing Data Results
With the testing data set (n=155,623), our proposed ensemble-  (Table 4). Models with 48 features provided the next most accurate prediction results. These results showed the same trend as the cross-validation results. Figure 4 shows the AUROC curves for our model, AdaBoost, XGBoost, and LightGBM, which are plotted according to the following features: all 914 features, 48 features excluding ICD-10, and 866 features with ICD-10 only. Our model outperformed the traditional methods such as inclusive SRR, exclusive SRR, EMR-ISS, and KTAS. Figure 5 shows the AUROC curves for our model and 4 traditional models. The calculated inclusive SRR and exclusive SRR are shown in Table  S6 in Multimedia Appendix 1. Finally, the model using the top 71 features from the AdaBoost also provided a lower balanced accuracy of 0.8245 and AUROC of 0.9194, similar to the cross-validation results.

AI-Driven Public Website Development
We deployed our AI on a public website [27] to allow public access to the mortality prediction results in trauma patients ( Figure S1 in Multimedia Appendix 1). Figure S1(a) shows a user's web interface to enter information. A user inputs age, gender, intentionality, injury mechanism, emergent symptoms, AVPU scale, initial KTAS, altered KTAS, torso procedures (chest, abdomen, vascular, and heart), head surgery, ECMO, and ICD-10 codes. Especially for ICD codes, a user can input multiple codes with a comma (eg, S072, S224, T083). As shown in Figure S1(b), after entering information in the web application, the user can immediately obtain the mortality results. The prediction results also include the probability of mortality.

Principal Findings
Our AI model outperformed traditional ICD-10-based models and KTAS. Traditional methods produced high sensitivity and low specificity, with substantial bias in predicting mortality.
Prediction performance was optimal when using all features, including ICD-10, as input features. The similarity between the cross-validation result and the testing data set indicates that overfitting or underfitting was minimal. In terms of ranked normalized feature importance, gender had the highest value, followed by age, coma, femur fracture, stupor, slip down, rib fracture, and head procedure. We used a population-based data set from all types of ED in South Korea, producing more robust and reliable results. To the best of our knowledge, our study is the first to demonstrate an AI model that drastically outperforms conventional ICD-based models and triage scales using a population-based data set. Our future goal is to construct a more comprehensive model incorporating both NEDIS-based and AIS-based AI [17]. Our proposed AI model has several advantages in clinical practice. First, a specialist is not required for AIS coding, so our AI model does not require additional burden. Second, our AI model demonstrates the ability to augment the KTAS provider's decision. Third, the feature importance used may benefit clinical decision-making and future research. Deep learning is generally considered a "black box," hence the feature importance analysis based on a machine learning algorithm provides meaningful insight to clinicians and researchers. Finally, we aspire for the global application of our model and have produced a publicly available web application for hospitals to utilize for the benefit of the entire trauma system [28,29].
Currently, ISS and ICISS are the most popular risk estimation models of trauma-related mortality. More complex models containing physiologic and demographic parameters are available [2,4,5,7], but none supersedes ISS or ICISS [1,9]. ISS is simple to use, but AIS coding is time consuming and expensive, whereas ICISS utilizes diagnostic code to claim charges. Therefore, ICISS is more useful for population-based data sets than ISS [8]. The results from ICISS in our study were comparable to those from previous studies [26,30]. We also applied EMR-ISS to the NEDIS data set, which showed good performance in a previous study [11] but poor accuracy here.
Recently, several AI models were proposed to predict trauma-related mortality. Previously, in a multicenter retrospective study in South Korea, we investigated a deep learning model using the AIS code for predicting mortality [17]. We reanalyzed the ISS system and redefine 46 new regions to discriminate the risk among different internal organs. The DNN with 46 features from the 46 new regions produced the highest accuracy. We found that the AI model can augment the performance of the AIS system. Recently, Tran et al [18] reported a machine-learning model that predicted trauma-related mortality using ICD-10. The authors used the NTDB data set and compared machine learning with ISS and TMPM10 [10], an ICD-10-based metric. However, the accuracy of each model was comparable. In this study, our AI model drastically outperformed ICISS and EMR-ISS. Kwon et al [31], in a retrospective observational study using a NEDIS data set including trauma and nontrauma patients, reported a deep learning-based model that showed a higher accuracy than KTAS for predicting in-hospital mortality. To the best of our knowledge, our AI model is the most accurate model and outperforms both diagnostic code-based metrics and triage scales in trauma patients.

Limitations and Future Works
Our study has several limitations. First, this is a retrospective study and may induce substantial selection and survival bias; further prospective trials and validation are needed. Second, we used procedure codes as 1 of the input features. However, they are not practically available during ED admission. Thus, in a prospective study, unconfirmed procedure codes may be used for predicting in-hospitality mortality. Third, in this study, we did not consider physiological signals, such as blood pressure, heart rate, and body temperature. We tried to train and develop an AI model using the information of physiological signals. However, the model's performance was poor because limited physiological signals were recorded in NEDIS; only blood pressure, heart rate, and temperature values at the time of admission were recorded. We believe that time-series physiological signals, such as electrocardiogram, photoplethysmogram, and blood pressure waveform, could improve our proposed model. Fourth, due to the structure of the NEDIS data set, some data, such as age, are collected as categorized data instead of continuous data. Thus, our proposed AI model could enhance the prediction performance with age as a continuous value. Fifth, some categorized input variables in the injury mechanism may appear inappropriate. For instance, the term "traffic accident-pedestrian, train, airplane, ship, etc" is considered 1 variable. However, pedestrians are not associated with an airplane and a ship. In addition, pedestrians have the highest mortality in road traffic collisions. Thus, the term should be separated into multiple variables. In future work, we plan to separate the variable into multiple categories and investigate the impact of each category. Sixth, we could not compare the prediction performances from our AI model with those from AIS code-based approaches such as ISS and NISS, as NEDIS does not provide AIS codes. Recently, we presented an AI model using AIS codes to predict in-hospital mortality [17]. The model outperformed conventional methods such as ISS and NISS for all accuracy metrics of sensitivity, specificity, balanced accuracy, and AUROC. As in the previous study, this study used ICD-10 and several clinical features instead of AIS codes and showed that the AI model outperformed conventional methods. Our goal is to construct a more comprehensive model incorporating both NEDIS-based and AIS-based AI models. Finally, our data did not include other races or data from other countries. Currently, our public website includes the following text: "This AI model was trained and evaluated from Korean trauma patients and may not be applicable to patients in other countries." Thus, future external validation is warranted, wherein we consider using global data to further improve our proposed AI model.

Conclusions
Our proposed AI model shows high accuracy and outperforms traditional diagnostic code-based prediction models and triage scales. We believe that our population-based AI model can facilitaite better understanding and practice in physical trauma care. Moreover, this AI and data-driven prediction model may minimize the bias and workload of humans. However, future external validation and prospective studies are warranted to prove the true effect size.