This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The key to effective stroke management is timely diagnosis and triage. Machine learning (ML) methods developed to assist in detecting stroke have focused on interpreting detailed clinical data such as clinical notes and diagnostic imaging results. However, such information may not be readily available when patients are initially triaged, particularly in rural and underserved communities.
This study aimed to develop an ML stroke prediction algorithm based on data widely available at the time of patients’ hospital presentations and assess the added value of social determinants of health (SDoH) in stroke prediction.
We conducted a retrospective study of the emergency department and hospitalization records from 2012 to 2014 from all the acute care hospitals in the state of Florida, merged with the SDoH data from the American Community Survey. A case-control design was adopted to construct stroke and stroke mimic cohorts. We compared the algorithm performance and feature importance measures of the ML models (ie, gradient boosting machine and random forest) with those of the logistic regression model based on 3 sets of predictors. To provide insights into the prediction and ultimately assist care providers in decision-making, we used TreeSHAP for tree-based ML models to explain the stroke prediction.
Our analysis included 143,203 hospital visits of unique patients, and it was confirmed based on the principal diagnosis at discharge that 73% (n=104,662) of these patients had a stroke. The approach proposed in this study has high sensitivity and is particularly effective at reducing the misdiagnosis of dangerous stroke chameleons (false-negative rate <4%). ML classifiers consistently outperformed the benchmark logistic regression in all 3 input combinations. We found significant consistency across the models in the features that explain their performance. The most important features are age, the number of chronic conditions on admission, and primary payer (eg, Medicare or private insurance). Although both the individual- and community-level SDoH features helped improve the predictive performance of the models, the inclusion of the individual-level SDoH features led to a much larger improvement (area under the receiver operating characteristic curve increased from 0.694 to 0.823) than the inclusion of the community-level SDoH features (area under the receiver operating characteristic curve increased from 0.823 to 0.829).
Using data widely available at the time of patients’ hospital presentations, we developed a stroke prediction model with high sensitivity and reasonable specificity. The prediction algorithm uses variables that are routinely collected by providers and payers and might be useful in underresourced hospitals with limited availability of sensitive diagnostic tools or incomplete data-gathering capabilities.
Diagnostic errors have emerged as a major public health problem, contributing to preventable patient harm and excess health spending. A recent US National Academies report titled “Improving Diagnosis in Healthcare” suggested that medical misdiagnosis is likely to affect almost everyone at least once in their lifetime, sometimes with devastating consequences [
The diagnosis of stroke is complicated by the abundance of stroke mimics and stroke chameleons. Approximately 30% of patients admitted to hospitals with typical stroke symptoms ended up having nonstroke conditions (ie, stroke mimics) [
Machine learning (ML), a crucial branch of artificial intelligence, has the potential to identify hidden insights from a large volume of data and generate predictions on unseen data (ie, test data) by iteratively learning from example inputs (ie, training data). ML problems can generally be divided into 3 main types: classification and regression, which are known as supervised learning, and unsupervised learning, which in the context of ML applications often refers to clustering. In the literature on stroke research, ML algorithms have been applied in different tasks, such as identifying factors associated with future stroke risk [
The first brain imaging for most patients with suspected stroke is a noncontrast computed tomography (CT) scan, which is completed within minutes of the arrival of the patient to the ED. However, a noncontrast CT scan is not sufficient to diagnose acute stroke, as the head CT test cannot reveal a hyperacute stroke in most cases, and it has reduced sensitivity for lacunar strokes [
Besides medical risk factors, social determinants of health (SDoH) have been shown to be associated with the risk of stroke and many other diseases [
In this study, we aimed to develop an ML stroke prediction algorithm based on data widely available at the time of patients’ hospital presentations and to assess the added value of SDoH in stroke prediction. Because the prediction model does not require clinical notes or diagnostic test results, it might be particularly useful in addressing the misdiagnosis challenges faced when dealing with walk-in patients with stroke with milder and atypical symptoms; in low-volume or nonstroke centers’ EDs, where emergency providers have limited daily exposure to stroke [
The secondary hospital discharge data this study examined was from the Healthcare Cost and Utilization Project State-specific databases, Agency for Healthcare Research and Quality. Healthcare Cost and Utilization Project databases conform to the definition of a limited data set, and review by an institutional review board is not required for use of limited data sets [
Our data were obtained from 2 primary sources. We obtained longitudinal administrative data that contained encounter-level information on inpatient stays and ED visits from hospitals in the state of Florida. The second data source was the American Community Survey (ACS) conducted by the US Census Bureau [
We adopted a case-control design, and the initial phase of our approach was to create representative examples for model training and ensure that stroke cases and controls have clear separation. We retrospectively extracted 127,114 hospitalization records from 2012 to 2014 with a principal diagnosis of acute cerebrovascular disease in Florida using the clinical classification tool developed by the Agency for Healthcare Research and Quality [
The key for a model to accurately predict stroke is to distinguish between stroke and stroke-like conditions (“stroke mimics”). We carefully created a stroke mimic data set to simulate tricky diagnostic decision-making and distinguish between actual stroke events and stroke-like events. Using all the records involving patients with nonstroke conditions to construct a prediction model will result in the inclusion of completely irrelevant cases, such as childbirth and hip replacement, and create a highly unbalanced data set. Hence, we consulted physicians about what conditions may show initial symptoms similar to those of a patient with stroke. On the basis of their suggestions, we obtained a list of conditions using Epocrates, a mobile app that health care providers use at the point of care for clinical reference information [
We pooled the stroke and stroke mimic data sets and retained only the data collected during the first admission of the patients. We performed data deduplication once again after combining stroke data and stroke mimic data because a patient may have been first admitted with stroke and readmitted with a stroke mimic condition and vice versa. If a patient appeared in both data sets, we kept only the first occurrence. Because patients may have returned to the hospital multiple times, providers may have obtained more information about patients who are readmitted. Retaining only the index encounter of the patients ensures that our models predict stroke based solely on the information available at the time of a patient’s initial presentation at the hospital. We obtained data from 2010 to 2014, and hence we have 2 years before 2012 as our “cushion period.” The patients included in the analysis were those with no records in 2010 or 2011. The “confirmed stroke” data set contains all the patients whose hospital discharge records confirmed that they had a stroke; thus, it includes not only patients with typical stroke symptoms but also those with mild and atypical symptoms. The stroke mimic data set includes patients with general presentations similar to those of patients with actual stroke, including patients with a discharge diagnosis of epilepsy, diabetes, alcohol, and drug withdrawal.
The original SDoH data we extracted from the ACS contained a large number of features. We adopted several methods to reduce noise and dimensionality and avoid overfitting. First, we conducted exploratory data analysis such as the principal component analysis to understand the feature distribution and identify patterns and multicollinearity among features. We then combined domain knowledge and a sparse regression method (least absolute shrinkage and selection operator) to remove irrelevant features and merge highly sparse features.
Overall, 4 categories were constructed from a large set of 431 variables in the ACS data for the 983 zip codes in Florida. These categories represent social, economic, housing, occupation, health insurance, and demographic characteristics referenced in the literature as being associated with stroke-related and cardiovascular health status (
We also performed a Markov blanket feature selection method to determine a minimal subset of relevant features that yields the optimal classification performance [
The final analysis data set was formed by merging the patient-level data with the community-level ACS data based on the patients’ zip code information.
Data processing pipeline. ACS: American Community Survey; NA: not available; SDoH: social determinants of health; SID: State Inpatient Database.
We started by using the patient-level information available at the time of hospital presentation to predict a binary outcome that indicates whether the patient’s final diagnosis at discharge is stroke. We ran three different models that are well established in the literature for the training process: (1) logistic regression, (2) RF, and (3) gradient boosting machine (GBM). Each model was run with different combinations of predictor variables to assess the added predictive value of the different variables.
Logistic regression is a popular method for modeling the relationship between a set of predictor variables and a binary outcome variable and for benchmarking [
We first tuned the hyperparameters of all 3 models to find the optimal configurations using a grid search and 5-fold cross-validation on the entire data set. The evaluation metric used in the cross-validation was the area under the receiver operating characteristic curve (AUC). We used an 80-20 random split on the data set because this is a standard split method used in ML models and is typically performed to test the model performance in designing the ML-enabled diagnostic tool for providers in EDs [
As a robustness check, we adopted an alternative data split method by using historical 2012 data to predict for 2013 and using both 2012 and 2013 data to predict for 2014.
Although ML models can produce accurate predictions, they are often treated as black-box models that lack interpretability. This is an important problem, especially in medical care because clinicians are often unwilling to accept machine recommendations without clarity regarding the underlying reasoning [
Analysis pipeline. ACS: American Community Survey; GBM: gradient boosting machine; LR: logistic regression; RF: random forest; SDoH: social determinants of health; SID: State Inpatient Database.
In the final data set, there were 143,203 hospital visits of unique patients, and it was confirmed based on the hospital discharge records that 73% (n=104,662) of them had a stroke. The prediction models included 12 patient-level features from the hospital administrative data set, joined by 16 community-level features from the ACS data set. We summarized the patient-level predictors into 3 categories: patient demographics, visit-level features, and individual-level SDoH; their summary statistics are presented in the following table (
Descriptive statistics of the patient-level predictors.
Features | Total sample (n=143,203), mean (SD) | Stroke cohort (n=104,662), mean (SD) | Stroke mimic cohort (n=38,541), mean (SD) | |||
|
||||||
|
Age (years) | 65.2843 (19.97) | 71.1259 (14.68) | 49.4207 (23.49) | <.001 | |
|
Sex (female) | 0.5019 (0.50) | 0.5014 (0.50) | 0.5031 (0.50) | .03 | |
|
Number of chronic conditions | 6.5066 (3.21) | 7.1200 (3.00) | 4.8410 (3.17) | <.001 | |
|
|
|||||
|
|
White | 0.6594 (0.47) | 0.6736 (0.47) | 0.6209 (0.49) | <.001 |
|
|
Black | 0.1802 (0.38) | 0.1706 (0.38) | 0.2064 (0.40) | <.001 |
|
|
Hispanic | 0.1348 (0.34) | 0.1302 (0.34) | 0.1472 (0.35) | <.001 |
|
|
Other races | 0.0256 (0.16) | 0.0257 (0.16) | 0.0255 (0.16) | .04 |
|
||||||
|
Emergency admission | 0.9030 (0.30) | 0.9094 (0.29) | 0.8859 (0.32) | <.001 | |
|
Elective admission | 0.0403 (0.20) | 0.0214 (0.14) | 0.0914 (0.29) | <.001 | |
|
Transfer in indicator | 0.0913 (0.37) | 0.0929 (0.37) | 0.0869 (0.36) | <.001 | |
|
Night shifta | 0.3409 (0.47) | 0.3257 (0.47) | 0.3821 (0.49) | <.001 | |
|
Weekend indicator | 0.2558 (0.44) | 0.2581 (0.44) | 0.2496 (0.43) | <.001 | |
|
||||||
|
Urban residence | 0.9529 (0.21) | 0.9515 (0.21) | 0.9567 (0.20) | <.001 | |
|
|
|||||
|
|
Medicare | 0.6239 (0.48) | 0.7027 (0.46) | 0.4099 (0.49) | <.001 |
|
|
Medicaid | 0.1103 (0.31) | 0.0714 (0.26) | 0.2159 (0.41) | <.001 |
|
|
Private insurance | 0.1505 (0.36) | 0.1331 (0.34) | 0.1980 (0.40) | <.001 |
|
|
Other payers | 0.1153 (0.32) | 0.0929 (0.29) | 0.1762 (0.38) | <.001 |
|
|
|||||
|
|
0-25th percentile | 0.4025 (0.49) | 0.3984 (0.49) | 0.4134 (0.49) | <.001 |
|
|
26th-50th percentile | 0.3261 (0.47) | 0.3289 (0.47) | 0.3186 (0.47) | <.001 |
|
|
51st-75th percentile | 0.1992 (0.40) | 0.1994 (0.40) | 0.1986 (0.40) | .04 |
|
|
76th-100th percentile | 0.0722 (0.26) | 0.0733 (0.26) | 0.0694 (0.25) | <.001 |
aAdmission between 7 PM and 7 AM.
bSDoH: social determinants of health.
We based our model selection on both the performance metrics and the clinical needs in actual care settings. Note that the cost of misdiagnosis is asymmetrical. Misdiagnosis of a stroke (labeling a true stroke as a nonstroke condition) might have more severe adverse consequences for both patients and providers than overdiagnosis (ie, false-positive stroke diagnosis). Hence, the selected model should provide high sensitivity while maintaining specificity within a reasonable range. Both ML models (RF and GBM) correctly detected at least 97% (101,522/104,662) of all the patients that did have a stroke and thus significantly outperformed the prehospital stroke prediction scales (ranging between 0.38 and 0.62) [
Performance of stroke prediction models.
Input combinations and model number | Classifier | Accuracy | AUCa | Sensitivity | Specificity | ||||||||
|
|||||||||||||
|
1 | Logit | 0.828 | 0.693 | 0.960 | 0.626 | 0.893 | ||||||
|
2 | RFb | 0.804 | 0.680 | 0.928 |
|
0.877 | ||||||
|
3 | GBMd |
|
|
|
0.619 |
|
||||||
|
|||||||||||||
|
4 | Logit | 0.830 | 0.810 | 0.960 | 0.630 | 0.895 | ||||||
|
5 | RF | 0.794 | 0.724 | 0.899 |
|
0.868 | ||||||
|
6 | GBM |
|
|
|
0.631 |
|
||||||
|
|||||||||||||
|
7 | Logit | 0.822 | 0.810 | 0.967 | 0.629 | 0.891 | ||||||
|
8 | RF | 0.831 | 0.828 |
|
0.626 | 0.896 | ||||||
|
9 | GBM |
|
|
0.970 |
|
|
aAUC: area under the receiver operating characteristic curve.
bRF: random forest.
cFor each input combination, the best performance among the 3 classifiers has been italicized.
dGBM: gradient boosting machine.
eSDoH: social determinants of health.
We found consistency across the 3 models in the most important features that explain their performance (
It is interesting to note that the patients’ admission type (eg, whether it is an emergency or elective admission) and timing of admission (ie, whether they were admitted during the night shift) contributed to the accuracy of stroke prediction. Existing studies have investigated the presence of a “weekend effect” on mortality [
In addition to age, other patient-level demographic and socioeconomic factors, including gender, race, and primary payer (ie, whether the medical expenses were covered by Medicare, Medicaid, private insurance, or other payers), contribute to the models’ prediction. These findings complement the recently observed diverging stroke risk patterns among different racial and gender groups [
Comparison of feature importance: 20 most important features for gradient boosting machine (GBM; upper left), random forest (upper right), and logistic regression (bottom). ACS: American Community Survey; Qrtl: quartile.
Some community-level SDoH variables (eg, percentage of single women; percentage of people with occupations closely related to finance, retail, and manufacturing industries; and mean travel time to work) were also among the top 20 features. However, the magnitude of their impact on stroke prediction was much less than that of patient-level demographic and socioeconomic features. This is consistent with the literature [
Ablation studies are commonly used for assigning importance scores to features [
To provide insights into prediction and ultimately assist care providers in decision-making, we sought to explain the stroke prediction model using TreeSHAP [
Shapley Additive Explanations values for example patients. ACS: American Community Survey.
These examples demonstrate that individual-level predictors of stroke can differ significantly from one case to another and can be used for personalized diagnostic and treatment decisions at the point of care, whereas the population-level analysis provides an overall ranking of the important predictors of stroke at hospital presentation and can be used to develop best practice guidelines and patient management programs.
In this study, we developed an ML-based approach using routinely collected administrative data to help reduce stroke misdiagnosis. Our findings suggest that before obtaining diagnostic imaging or laboratory test results, it is possible to predict stroke based on patients’ demographics and SDoH information available at the time of hospital presentation. The algorithm had an AUC of 83%, provided accurate results (high precision of 84%), and returned a supermajority (101,522/104,662, 97%) of all positive results (high sensitivity).
This study fills a critical gap in the current efforts to support stroke triage, which either focuses on improving specificity in the prehospital setting or requires detailed neurological assessments and imaging results. On the one hand, advanced ML techniques have been applied to assist in automatically interpreting clinical notes and imaging, but this is based on the availability of these information sources. On the other hand, because emergency medical service personnel lack the necessary time and training to perform detailed neurological assessments, short and simple clinical methods known as prehospital stroke scales have been developed to support the initial triage in the field, such as the Cincinnati Prehospital Stroke Scale, Los Angeles Prehospital Stroke Scale, and Conveniently Grasped Field Assessment Stroke Triage. These scales have demonstrated wide performance variability in clinical practice; however, in general, they were found to have acceptable-to-good specificity but low sensitivity [
Decision support for stroke prediction. ED: emergency department.
This model can be integrated with other AI-enabled prediction or decision support systems based on EHRs in the ED to further improve stroke triage and diagnosis. Although EHR data contain rich and detailed clinical information, certain social and behavioral determinants that can also be important risk factors (eg, race) are both poorly represented (including a category for “Unknown”) and inadequately characterized in the EHR [
It is important to consider specific clinical needs and care settings when comparing the various forms of performance measures reported across studies. In the case of strokes, misdiagnosis of a stroke (labeling a true stroke as a nonstroke condition) usually leads to more severe adverse patient outcomes than overdiagnosis. Although false-positive stroke mimics rarely lead to legal consequences, false negatives can cause delays in critical treatments and often give rise to accusations of medical errors. Moreover, given the inherent trade-off between sensitivity and specificity, the prehospital stroke scales’ focus on specificity (ie, reducing overdiagnosis) may result in a substantial number of misdiagnoses of strokes that need to be addressed at patients’ hospital presentations. Therefore, minimizing the false-negative rate or maximizing the sensitivity is paramount in acute care settings for both patients and providers. Several recent studies have compared the currently available clinical assessment tools such as the field stroke triage scale, National Institute of Health Stroke Scale, Los Angeles Motor Scale, and Rapid Arterial Occlusion Evaluation, which incorporate cortical signs (eg, gaze deviation, aphasia, and neglect) as well as motor dysfunction, and found that these tools had better diagnostic accuracy for detecting patients with large vessel occlusion than for distinguishing between acute stroke and stroke mimics [
This study is also one of the first large-scale studies to systematically assess the added value of SDoH information in a population-based risk-prediction setting using administrative data. Although many studies have shown that various social or behavioral factors are associated with health outcomes, very few have explicitly examined whether the knowledge of these factors improves the prediction of clinical events or health outcomes. Our results are consistent with the findings of nascent studies that link SDoH data with EHR data to predict ED visits [
This study has room for further improvement, which is left for future research. First, this was a retrospective study, and confirmation of stroke cases relied on International Classification of Diseases codes. It is desirable to have patients’ complicated medical records reviewed to ascertain stroke diagnosis; however, this process is labor intensive and expensive, especially when it is a large-scale study with hundreds of thousands of patients across different health systems. Our results require further validation but have the potential for improving stroke triage and diagnosis.
Second, the algorithm we proposed should not be considered as the gold standard for stroke diagnosis. Rather, we believe that the algorithm complements the existing stroke scoring systems used in the prehospital or emergency room settings and can be integrated into ML-enabled decision support systems that combine patients’ medical history, SDoH, and clinical data. Such a decision system would have the advantage of being agile and iterative, in the sense that the model outcome can be reassessed at regular intervals as more data are collected in the ED, as well as the integration of variables with the most promising relevance.
Third, the focus of this study was to predict stroke solely based on the information available at the time of a patient’s initial presentation at the hospital. This is because first-time or new patients with stroke make up the supermajority (77%) of the yearly US patient population with stroke [
Finally, our findings are limited to the SDoH variables available in administrative data, suggesting the importance of developing standards and tools to routinely collect and screen individual-level SDoH data and effectively integrate them into both EHR and structured claims data. Our current prediction does not require any additional effort to collect additional individual-level SDoH. The community-level ACS variables have already been incorporated as part of the best pretrained model. The patient-level details used in our prediction are (1) basic demographics including age, gender, race and ethnicity, and primary payer (ie, Medicare, Medicaid, private insurance, or others); (2) arrival information (eg, whether it was an emergency or elective admission and whether the patient was admitted during a weekend or night shift); and (3) whether the patient resided in an urban or a rural area and the quartile in which their median household income fell (
Stroke is among the most common and dangerous misdiagnosed medical conditions. Black people, Hispanic people, women, older people on Medicare, and people in rural areas are less likely to be diagnosed in time for treatment after having a stroke. Timely detection is the key to effective management and improved patient outcomes.
We developed a high-performance ML-based stroke prediction algorithm that outperforms the existing early warning scoring systems. The algorithm is based on variables routinely collected and readily available at the time of patients’ hospital presentations and may provide an opportunity for enhanced patient monitoring and stroke triage and improved health outcomes. Because the prediction model does not require clinical notes or diagnostic test results, it can be particularly useful in underresourced EDs in rural and underserved communities with limited availability of sensitive diagnostic tools and incomplete data-gathering capabilities. Moreover, the algorithm can be incorporated into an automated, AI-enabled decision support system that combines administrative data widely available at the time of ED presentation and subsequently available clinical notes and diagnostic test results to further improve stroke diagnosis, triage, and management.
The top 20 principal diagnoses in the analysis sample.
Community-level social determinants of health variables.
Tuned hyperparameters in machine learning models.
Performance of the stroke prediction models based on the alternative data split method.
Glossary of the terms used as well as the variable definitions.
Results of the ablation analysis.
American Community Survey
area under the receiver operating characteristic curve
computed tomography
emergency department
electronic health record
gradient boosting machine
machine learning
random forest
social determinants of health
Shapley Additive Explanations
The authors thank the Agency for Healthcare Research and Quality and its partner organization the Florida Agency for Health Care Administration for providing access to the State Inpatient Databases through the Healthcare Cost and Utilization Project.
They are grateful for the comments and suggestions from the 3 anonymous reviewers and participants of the 2019 Institute for Operations Research and the Management Sciences Healthcare Conference, 2020 Workshop on Information Technologies and Systems, Healthcare Information and Management Systems Society 2020 Big Data Symposium, and Production and Operations Management Society 31st Annual Conference, where earlier versions of this work were presented, and to graduate students at the Heinz College at Carnegie Mellon University for their help with data gathering and preliminary analyses.
MC and RP conceived the idea for this study. MC, RP, and XT designed the study. MC and XT conducted the analysis and drafted the manuscript with extensive input and critical suggestions from RP. All the authors interpreted the results, revised the manuscript, and read and approved the final manuscript.
None declared.