This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Scalable and accurate health outcome prediction using electronic health record (EHR) data has gained much attention in research recently. Previous machine learning models mostly ignore relations between different types of clinical data (ie, laboratory components, International Classification of Diseases codes, and medications).
This study aimed to model such relations and build predictive models using the EHR data from intensive care units. We developed innovative neural network models and compared them with the widely used logistic regression model and other stateoftheart neural network models to predict the patient’s mortality using their longitudinal EHR data.
We built a set of neural network models that we collectively called as long shortterm memory (LSTM) outcome prediction using comprehensive feature relations or in short, CLOUT. Our CLOUT models use a correlational neural network model to identify a latent space representation between different types of discrete clinical features during a patient’s encounter and integrate the latent representation into an LSTMbased predictive model framework. In addition, we designed an ablation experiment to identify risk factors from our CLOUT models. Using physicians’ input as the gold standard, we compared the risk factors identified by both CLOUT and logistic regression models.
Experiments on the Medical Information Mart for Intensive CareIII dataset (selected patient population: 7537) show that CLOUT (area under the receiver operating characteristic curve=0.89) has surpassed logistic regression (0.82) and other baseline NN models (<0.86). In addition, physicians’ agreement with the CLOUTderived risk factor rankings was statistically significantly higher than the agreement with the logistic regression model.
Our results support the applicability of CLOUT for realworld clinical use in identifying patients at high risk of mortality.
Highprecision predictive modeling of clinical outcomes (eg, adverse events such as the onset of disease and death) is a clinically important but computationally challenging task. If physicians can be notified about the risks of adverse events in advance, they may be able to take steps to prevent them. Electronic health records (EHRs) are widely used in US hospitals and are becoming more mature over time [
Almost 6 million patients are admitted annually to intensive care units (ICUs) in the United States for airway support, for hemodynamic or respiratory monitoring, and to stabilize acute or lifethreatening medical problems [
During the past several years, neural network models have shown a great success for many artificial intelligence applications including computer vision, natural language processing, and predictive modeling [
Although studies show that CNN models do not necessarily outperform conventional predictive models such as regression models [
Although NNbased predictive models have been developed, most models are based on
The main objective of this work is to develop innovative prediction models to accurately predict patient mortality using patients’ longitudinal EHR data. An important component of our models is a correlational neural network, which is a special neural network model that accounts for correlations between different types of features. We modeled the relationships between different types of clinical features in the EHR through a correlational neural network and integrated them into LSTMbased predictive models for improved performance.
Our main contributions include learning of latent features from different clinical data types and integrating the learned latent features for outcome prediction using longitudinal EHR data. Our results show that the integration of latent features yielded the highest results for predicting patient mortality using the ICU data.
In addition to evaluating our CLOUT models using the traditional evaluation metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve, we studied the interpretability of our predictive models. Specifically, we designed a simple ablation experiment [
In summary, our contributions are twofold: (1) We developed an innovative long shortterm memory (LSTM)–based predictive model where a correlational neural network is integrated to identify relationships and latent representations of different clinical features. Our CLOUT model has stateoftheart performance in mortality prediction, surpassing other competitive NN models and a logistic regression model. (2) We provide a comprehensive evaluation of risk factors identified by our neural network models. Our results show that the risk factors identified by the CLOUT model agree with physicians’ assessment, suggesting that CLOUT could be used in realworld clinical settings.
All models are trained and evaluated on the Medical Information Mart for Intensive CareIII (MIMICIII) dataset; an EHR dataset made publicly available by the Massachusetts Institute of Technology Laboratory for Computational Physiology. MIMICIII has been widely used for predictive models [
Patient demographic information (N=7537).
Characteristic  Values  



Mean  74.74 

Median  66.00 



Male  4190 (55.59) 

Female  3347 (44.41) 



White  5644 (74.88) 

Black  867 (11.50) 

Hispanic  277 (3.68) 

Asian  226 (3.00) 

Other/unknown  523 (6.94) 
We require two or more encounters because we remove the last encounter while making predictions, requiring us to have at least one other encounter with data. We use patient mortality as our outcome label. This label is obtained in the MIMIC dataset from the hospital records and the social security death records. In our dataset of 7537 patients, we have 2825 (37.9%) documented deaths. Further details about MIMIC are covered in
The dataset was further divided into train, validation, and test splits, each containing approximately 69.99% (5275/7537), 9.99% (753/7537), and 20.02% (1509/7537) of the patients, respectively. Once we picked the optimal model hyperparameters using the validation set, the model was retrained on the combined trainvalidation set, which contained 79.98% (6028/7537) of the data.
Our first set of baseline models are versions of the
RETAIN by itself does not incorporate temporal information beyond the RNN framework; such finegrained temporal information may be important to patient outcomes. For example, the severity of 2 acute myocardial infarctions separated by different durations could have different clinical implications. On the other hand, there is an option to include the time features to the encounter vector. Therefore, we implemented timeaware RETAIN (TaRETAIN) models as additional baselines by concatenating time information to the input features. We experimented with two different approaches to create the time feature: number of days elapsed since the first encounter and number of days elapsed since the previous encounter. We call these 2 models
Another baseline model is
The CLOUT models are built upon the stateoftheart LSTM framework. We provide a description of relevant concepts or components that are built into our CLOUT models in
Unlike other RNN models, LSTMs can learn dependencies over longer intervals more efficiently [
Given a patient with encounters, the encounter vectors derived from a CLOUT model are
Note that our LSTM architecture is commonly used for sequence data. The innovation of this work is the representation of the encounter vector that integrates different types of EHR data, which we will describe below.
Our model architecture. LSTM: long shortterm memory.
In this version of CLOUT, the encounter vector was derived by a simple concatenation of different types of features. Every patient encounter had a set of documented International Classification of Diseases (ICD) codes, medications, and laboratory components. We converted these to 3 bitvectors, , , and , respectively, each of the size of the vocabularies. Bitvectors are vectors of size equal to the length of the vocabulary with 1 at the index where the feature is documented and 0 everywhere else. We passed these bitvectors through linear embedding layers to get their dense vector representations. We concatenated these dense vector representations and passed the resultant vector through a nonlinear function such as the rectified linear unit [
Recent work on word embeddings called ELMo [
ICD codes, medications, and laboratory results are not isolated unrelated clinical information. They are clinically intertwined or correlated. For example, as stated earlier, medications depend upon the diagnoses of the patient in that encounter. To capture the correlations among EHR data, we added a multiview latent space component, as shown in
We used a correlational neural network for 3 views (ICD codes, medications, and laboratory components) to construct the latent representation for our latent space model. This component is graphically shown in
The latent space representation is a measure of the patient condition—a combination of related information from diagnosis codes, medications, and laboratory components. The details of this component are further described in
To integrate latent space representation into the encounter vector, we first projected the encounter into this latent space to get the latent space vector,
To evaluate the effectiveness of the correlational neural network, we also implemented a traditional autoencoder with one hidden layer
Model for constructing the encounter vector. ReLU: rectified linear unit; ICD: International Classification of Diseases.
The correlational neural network for our 3 views. ICD: International Classification of Diseases.
We evaluated each of the baseline and CLOUT models on the pMIMIC dataset. We obtain truepositives (TP), falsepositives (FP), truenegatives (TN), and falsenegatives (FN). We report area under the receiver operating characteristic curve (AUCROC) scores for all models, and precision
Predictive models would be of limited clinical use if the models are not interpretable. To interpret or identify the risk factors in our CLOUT models, we conduct an ablation experiment, which has been widely used for feature engineering. We perturb the patient data to zero out the contribution of a feature and calculate the corresponding difference in output. This classical method shows the contribution of each feature, which may correspond to the risk score.
Recall that each of our CLOUT models outputs a probability score that indicates mortality risk. So, the difference in output would be the reduction in this probability, which we call the attribution weight of the given feature. We calculated the attribution weight for each ICD code, medication, and laboratory component that is documented in the patient's EHRs. These features would then constitute the risk factors associated with the mortality, and the attribution weight represents the strength of the association.
Although ablation experiments have been widely used for feature engineering and interpretation of neural network models in many applications [
Therefore, we designed a comprehensive evaluation of the risk factors ranked by CLOUT and compared them with ones ranked by a logistic regression model. Specifically, we ranked the risk factors at the patient level and population level. At the patient level, each risk factor (ie, feature or variable) is weighted by its contribution to the correct prediction to the patient. We ranked the risk factors at the population level by aggregating and normalizing the attribution weights of features across the patient population.
Using stratified random sampling, we selected a subset of risk factors from the prediction models CLOUT and logistic regression, respectively, and asked 5 unbiased physicians (4 internists and 1 cardiologist), who were not privy to the reasons for doing the ranking, to independently judge the clinical relevance of those risk factors.
To reduce the total number of features that the physicians need to evaluate, we selected features from CLOUT. Specifically, for each feature, the ablation experiment output a relevance score. We bin the features into 3 groups: (1) top 20 features, (2) 2050 features, and (3) the remaining features. From each bin, we randomly selected 4 features. We then randomly selected 1 patient and accordingly obtained a total of 12 features for that patient. We also obtained the ranked list of features by population and followed a similar bin strategy to select another 18 features distributed across the different feature sets (we purposely selected those features that differ from the features we selected from the sample patient so that we could maximize our evaluation features). Therefore, we selected a total of 30 features (12 by a patient and 18 by the population).
We randomized those 30 features and asked the 5 physicians who are blinded to the CLOUT rankings to evaluate, for each feature, its clinical relevance. Specifically, we asked each physician to score the feature (15, with 1 as the least relevant and 5 the most relevant) based on their clinical knowledge or guidelines.
We calculated the Pearson correlation coefficient between physicians’ scores for pairwise agreements between physicians, and between the CLOUT scores and physicians’ scores. We also performed a
Finally, we performed another evaluation where we first averaged the scores of all the physicians to obtain a representative gold standard. We then computed the correlation coefficients between these scores and the scores from our models and the logistic regression baseline.
During our experiments, we found that models using abnormal laboratory components as input (ie, binary coding of normal/abnormal) performed better than those using all the laboratory components. Therefore, the results presented here for the pMIMIC dataset used only the abnormal labs recorded in patient encounters through an abnormal flag.
As shown in
Our latent space representation model also slightly outperformed the traditional autoencoder CLOUT model, although it is not statistically significant. An important result here is the integration of different levels of representations (input space and from either an autoencoder or a correlational autoencoder) substantially improves the performance of a model, which outperforms one that uses autoencoder alone. The code for our models and experiments can be found at our CLOUT repository [
Area under the receiver operating characteristic curve scores for different models.
Method  Area under the receiver operating characteristic curve, mean (SD) 
Logistic regression  0.82 (0.0103) 
RETAIN^{a} (only ICD^{b})  0.82 (0.0924) 
TaRETAIN^{c} 
0.82 (0.0118) 
TaRETAIN 
0.82 (0.0919) 
RETAIN (all codes)  0.86 (0.0105) 
Long shortterm memory with only ICD codes  0.83 (0.0104) 
CLOUT^{d}—only autoencoder  0.80 (0.0116) 
CLOUT—only latent space  0.81 (0.0082) 
CLOUT—simple concatenation  0.88 (0.0096) 
CLOUT—autoencoder concatenation  0.88 (0.0107) 
CLOUT—latent space concatenation 

^{a}RETAIN: Reverse Time Attention model.
^{b}ICD: International Classification of Diseases.
^{c}TaRETAIN: timeaware RETAIN.
^{d}CLOUT: L(STM) Outcome prediction using Comprehensive features relations.
^{e}Best performing model.
The area under the receiver operating characteristic curves for various models. RETAIN: Reverse Time Attention model; CLOUT: L(STM) Outcome prediction using Comprehensive feature relations.
Precision, recall, and Fscores for top CLOUT^{a} models.
Method and class  Precision  Recall  Fscore  



0  0.85  0.82  0.83 

1  0.71  0.76  0.73 

Average  0.80  0.79  0.80 



0  0.85  0.85  0.85 

1  0.74  0.74  0.74 

Average  0.81  0.81  0.81 



0  0.84  0.88  0.86 

1  0.78  0.72  0.72 

Average  0.82  0.82  0.82 
^{a}CLOUT: L(STM) Outcome prediction using Comprehensive features relations.
To measure agreements among physicians, we compute the Pearson correlation coefficient between their scores. For patientspecific features,
Pearson correlation coefficients for agreement between physicians and models.
Agreement  Physician 1, 
Physician 2, 
Physician 3, 
Physician 4, 
Physician 5, 
Mean (SD)  



Physician 1  1.00  0.81  0.56  0.61  0.88  0.72 (0.13)  

Physician 2  0.81  1.00  0.87  0.65  0.86  0.80 (0.09)  

Physician 3  0.56  0.87  1.00  0.49  0.69  0.65 (0.14)  

Physician 4  0.61  0.65  0.49  1.00  0.61  0.59 (0.06)  

Physician 5  0.88  0.86  0.69  0.61  1.00  0.76 (0.11)  



Logistic regression  0.60  0.63  0.53  0.32  0.52  0.52 (0.11)  

RETAIN^{a}  0.65  0.72  0.61  0.30  0.58  0.57 (0.14)  

CLOUT^{b}—only autoencoder  −0.07  0.13  0.21 

0.17  0.20 (0.20)  

CLOUT—only latent space  0.42  0.77 

0.35  0.53  0.54 (0.15)  

CLOUT—simple concatenation  0.52  0.64  0.70  0.19 

0.54 (0.19)  

CLOUT—autoencoder concatenation  0.54  0.70  0.64  0.14  0.62  0.53 (0.20)  

CLOUT—latent space concatenation 


0.59  0.18 


^{a}RETAIN: Reverse Time Attention model.
^{b}CLOUT: L(STM) Outcome prediction using Comprehensive features relations.
^{c}Italicization signifies highest physicianmodel agreement in the column.
In this study, we have developed innovative CLOUT models and compared them with other stateoftheart predictive models with respect to performance on mortality prediction. We found that the performance of almost every CLOUT model surpassed the competitive baseline models (eg, RETAIN). The results support that LSTM is a stateoftheart framework for EHRbased predictive modeling.
Our results showed that the integration of different levels of latent representations (input space and from either an autoencoder or a correlational autoencoder) substantially improves the performance from 0.80 to 0.88 AUCROC. The rich representation may provide extra information to the model, which in turn helps the model make better predictions. The integration of different types of features (ie, ICD codes, laboratories, and medications) however had a mixed result. Specifically, the CLOUT model that incorporated only the abnormal laboratory results slightly surpassed the CLOUT model that incorporated all 3 features. This supported the importance of laboratory results for predicting mortality. Our results also suggested that there may be noisy information in the features. When CLOUT was implemented with the latent vectors included, it had the highest performance, an AUCROC score of 0.89 and an F1 score of 0.82. The result supports our approach of using the correlational neural network to identify latent vectors to best represent different but related clinical observations or variables.
On the other hand, when we incorporated temporal information as a feature, we showed little improvement in performance using RETAIN. A possible future direction is to explore timedependent attentions, which may allow the model to integrate the temporal information in the architecture.
For the risk factors identified by our models, the average correlation coefficient between the physicians is mean 0.71 (SD 0.13), and the average Pearson correlation coefficients between CLOUT and the physicians and between logistic regression and physicians were 0.58 (SD 0.21) and 0.52 (SD 0.11), respectively. These results show a significant difference between the agreement among physicians and the agreement between the logistic regression model and the physicians (
We also calculated the agreement with RETAIN for reference, and we found that the average was 0.57 (SD 0.14), which is still slightly less than the CLOUT model, with CLOUT losing out a lot with physician 4. Other CLOUT models also have slightly lower scores as reported in
Our results show that physician 4 had a low correlation score with other physicians as well as with our CLOUT models. For example, lactulose enema, and encephalopathy not otherwise specified were scored as 2 by physician 4, whereas all the other physicians gave scores of 4 or greater. When we removed physician 4, the correlation between the latent space CLOUT model and the physicians improved from 0.58 to 0.68.
For populationlevel features, we performed similar evaluation between physician scores and the CLOUT model scores, and the average correlation coefficient values were ICD codes −0.19, medications −0.43, and laboratory components −0.37, which are lower than patientspecific interpretations. This is not surprising as many risk factors (eg, severe diseases) are rare events that are not present for patients in general.
Furthermore, CLOUT models captured important risk factors while making predictions. In general, our CLOUT models show that patients with diagnosis codes representing cranial nerve disorder and cystic liver disease were marked with a high risk of mortality. This is reasonable as those are diseases with a high risk of mortality.
Our dataset was constructed from EHR data and is, hence, prone to standard data quality issues that EHRs typically have, as documented in the literature. EHRs are known to have missing diagnoses and medication codes for patients when compared with insurance claims. Furthermore, our analysis of ICU admissions does not account for death because of accidental circumstances such as car crashes. We used all the information exactly as it appears as it is infeasible to comb through all the records to pick patients for the study. Another limitation we would like to report is the absence of vital sign features in our dataset, which we ignore because of the involved preprocessing steps that are required to handle missing numerical values.
The CLOUT models have significant limitations as well. First, similar to most predictive models, the risk factors identified by the CLOUT models include cofounding variables. For example, we found that patients who have a prescription for a scopolamine patch have highrisk scores. This is a medication prescribed to terminally ill patients as part of palliative care regimen to reduce excessive airway secretion. So, in this case, the actual reason for palliative care is a strong risk factor for death, not the medication, which is a confounding factor. Another limitation of our work is that our models are very dependent on the population size. Bias could be introduced when the size is small. However, such limitations exist in most predictive models not reviewed or guided by physician oversight.
We surveyed a variety of approaches to compare our models. This includes statistical approaches [
EHRs are widely available and have enormous untapped potential to predict patients’ health outcome. EHRbased predictive models are potentially hugely useful for clinical decision support. Our experiments show that incorporating comprehensive clinical information is useful and can improve predictions and that integrating latent space representations learned through a correlational neural network to clinical information led to the best performing CLOUT model. Our risk factor experiment with physicians also suggests that CLOUT models find more clinically relevant risk factors. Our results strongly support that CLOUT may be a useful tool to generate clinical prediction models, especially among hospitalized and critically ill patient populations.
The future directions include new models to incorporate the temporal information and methods to integrate clinical notes for predictive models. We may also explore other models to integrate different views, including the Capsule network model [
The Medical Information Mart for Intensive CareIII, preprocessing, and outcome label.
Relevant machine learning components.
Correlational neural network.
Comparison with prior work.
convolutional neural network
L(STM) Outcome prediction using Comprehensive features relations
electronic heath record
International Classification of Diseases
intensive care unit
long shortterm memory
Medical Information Mart for Intensive CareIII
Reverse Time Attention model
recurrent neural network
timeaware RETAIN
This work was supported in part by the grant R01HL137794 from the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. HY’s time was also supported by grants 5R01HL125089, R01HL135129, R01DA045816, and R01LM012817. DM’s time was also supported by grants R01HL137734, R01HL126911, R01HL13660, and R01HL141434 from the National Heart, Lung, and Blood Institute.
DM has received research grant support from Apple Computer, BristolMyers Squibb, BoehringherIngelheim, Pfizer, Samsung, Philips Healthcare, Care Evolution, and Biotronik; has received consultancy fees from BristolMyers Squibb, Pfizer, Flexcon, and Boston Biomedical Associates; and has inventor equity in Mobile Sense Technologies, Inc, Connecticut.