Natural Language Processing and Machine Learning for Identifying Incident Stroke From Electronic Health Records: Algorithm Development and Validation

Background: Stroke is an important clinical outcome in cardiovascular research. However, the ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Current phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease, which requires knowledge of the temporal sequence of events. Objective: The aim of this study was to develop a machine learning–based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing. Methods: The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected the stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke. The algorithm was further validated using a cohort (n=150) stratified sampled from a population in Olmsted County, Minnesota (N=74,314). Results: Among the 4914 patients with AF, 740 had validated incident stroke events. The best-performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features in a random forest classifier. Among patients with stroke codes in the general population sample, the best-performing model achieved a positive predictive value of 86% (43/50; 95% CI 0.74-0.93) and a negative predictive value of 96% (96/100). For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample. Conclusions: We developed and validated a machine learning–based algorithm that performed well for identifying incident stroke and for determining type of stroke. The algorithm also performed well on a sample from a general population, further demonstrating its generalizability and potential for adoption by other institutions.


Introduction
Stroke is a syndrome involving a rapid loss of cerebral function with vascular origin [1]. The loss of function can result in deep coma or subarachnoid hemorrhage. There are two broad categories of stroke: hemorrhagic and ischemic stroke [2]. Hemorrhage is caused by bleeding within the skull cavity, while ischemia is characterized by inadequate blood to supply a part of the brain. Stroke identification is an important outcome for various cardiovascular studies [3][4][5]. However, a challenge with stroke ascertainment is the inconsistent use of International Classification of Diseases (ICD) codes [6], which may result in inaccurate code-based ascertainment of cases [7]. Therefore, the time-consuming process of electronic health record (EHR) abstraction remains the gold standard of stroke ascertainment [8,9].
Machine learning has recently gained popularity for its ability to classify patients or make predictions on various aspects of diseases. In contrast to manually curated algorithms based on domain expertise, machine learning is a data-driven approach that can be trained on large data sets to identify and leverage complex feature relationships and improve classification and prediction tasks thereby. In terms of stroke, machine learning algorithms have been applied to predict future stroke cases [10], mortality and recurrent strokes [11,12], and treatment outcomes [13,14]. Most existing phenotyping algorithms have been developed to only differentiate between cases and noncases of diseases [15][16][17][18]; however, ascertaining incident disease (ie, first occurrence of disease) in a population is a more difficult task [8,19,20]. A recent study by Ni et al [21] examined potential predictive features of stroke occurrence including demographic, clinical, and diagnostic characteristics of patients. The authors found that diagnostic tests for stroke, such as computed tomography (CT) and magnetic resonance imaging (MRI), contributed to most of the model performance, and that the optimal feature set included imaging findings, signs and symptoms, interventions, emergency department assessments, findings from angiography and carotid ultrasound tests, ICD codes, substance use (smoking, alcohol, and street drugs) characteristics, and demographics. However, features such as signs and symptoms, substance use characteristics, and demographics may not be specific enough for disease ascertainment, as there is a high prevalence of strokelike symptoms among people without a diagnosis of stroke [22]. In addition, incorporating too many features in the model may result in overfitting without appropriate regularization. Another study [7] also used ICD and Current Procedural Terminology (CPT) [23] codes as features to classify positive, possible, and negative stroke cases. However, stroke-related clinical concepts (including both disease name concepts and symptom concepts) in unstructured clinical notes were not included in this model. Rapid adoption of EHRs has enabled secondary use of the EHR data in epidemiological research [24][25][26]. Previous studies noted the existence of bias using a single type of EHR data (ie, diagnosis codes) [27][28][29]. To avoid this bias, the Electronic Medical Records and Genomics (eMERGE) consortium [30,31] has piloted the development of EHR-based phenotyping algorithms using multiple types of EHR data [32][33][34]. This has given rise to a number of phenotyping algorithms that use both structured EHR data (eg, demographics, diagnosis and procedure codes, laboratory test results, and medications) and unstructured EHR data (eg, clinical notes, imaging reports, and discharge summaries) [35][36][37][38]. However, the eMERGE consortium algorithms are typically focused on identifying cases and noncases rather than characterizing a new-onset (ie, incident) disease in a population. Moreover, extracting information from unstructured clinical text is a nontrivial task that involves natural language processing techniques [39][40][41].
In our paper, we address existing challenges for stroke ascertainment, specifically for incident stroke. Our research objective is to develop and validate a machine learning-based phenotyping algorithm to identify incident stroke and detailed stroke subtypes based on three major EHR-derived data elements: clinical concepts extracted from clinical notes; ICD, Ninth Revision (ICD-9) diagnosis codes; and CPT procedure codes.

Methods
This study was approved by the Mayo Clinic Institutional Review Board (no. 17-008818) and is in accordance with the ethical standards mandated by the committee on responsible human experimentation. The data that support the findings of this study are available from the corresponding author upon reasonable request.

Study Design
This was a predictive modeling study that used observational cohort data for training and validation. We employed an atrial fibrillation (AF) cohort, in which all incidences of stroke were manually ascertained in a previous study [4], to train and test our phenotyping algorithm for the date of incident stroke events. We then evaluated the generalizability of our algorithm in a general population cohort.

The AF Cohort
The AF cohort comprised a patient population from Olmsted County, Minnesota, USA [4,42]. Olmsted County is an area relatively isolated from other urban centers with only a few providers delivering most care to residents, primarily Mayo Clinic and Olmsted Medical Center [43][44][45]. Extracting all health care-related events was completed through the Rochester Epidemiology Project (REP), a records linkage system [43,44]. The REP is a records linkage system that allows retrieval of nearly all health care utilization and outcomes of residents living in Olmsted County. The electronic indexes of the REP include demographic information, diagnostic and procedure codes, health care utilization data, outpatient drug prescriptions, results of laboratory tests, and information about smoking, height, weight, and body mass index. ICD-9 codes and the Mayo Clinic electrocardiograms were obtained among adults aged ≥18 years from 2000 to 2014 to ascertain AF. Patients were identified by the presence of an ICD-9 code for stroke through March 31, 2015, and then validated by manual review of the EHR. Strokes were classified as ischemic strokes/transient ischemic attack or hemorrhagic strokes [4,46]. The first (incident) event of each type of stroke after the incident AF date was ascertained, regardless of whether a patient had a prior stroke. The AF cohort included 4914 validated patients with AF, 1773 of whom were screened for a possible stroke. Table 1 shows the cohort characteristics. Manual abstraction of the EHR validated the stroke code in 740 patients. Manual ascertainment of stroke and the dates of the events were used as a gold standard to train and test the stroke algorithm.

Candidate Predictive Features
The proposed algorithm aimed to identify first (incident) stroke events within a certain time frame. The three major data elements we used were clinical concepts, ICD-9 codes, and CPT codes. To align with the manual review process, only codes and clinical notes from the AF incident date to March 31, 2015, were retrieved and processed. In our analyses, we constructed different models by varying the inclusion of CPT codes and symptom-related clinical concepts in the model feature set and compared different models' performances.
Both ICD-9 and CPT codes were extracted from the REP database. Clinical concepts were identified from the major and secondary problem list section of Mayo Clinic EHR, and from clinical notes from other REP sites using a natural language processing system, MedTagger [47]. Expert-provided vocabulary was adopted from a previous study [48] to extract clinical concepts from unstructured clinical notes. MedTagger enables a series of natural language processing processes, including regular expression matching and positive, negative, or probable identification with ConText [49,50], and is insensitive to upper and lower case. MedTagger is also able to determine if the extracted clinical concepts are referring to the patients or their family members, or if the extracted clinical concepts are in present tense and thus are referring to a current event rather than a past medical condition. We considered only documents with positive, present-tense stroke mentions that were referring to patients themselves. Table S1 in Multimedia Appendix 1 lists clinical concepts for 2 major stroke subtypes and stroke-related symptoms. Table S2 in Multimedia Appendix 1 lists ICD-9 codes for 2 stroke subtypes and stroke-related symptoms. Table  S3 in Multimedia Appendix 1 lists the CPT codes used in the stroke algorithm.
Clinical concept dates were determined by the date of the clinical notes from which clinical concepts were extracted, while ICD-9 and CPT code dates were extracted from the REP. Each visit was characterized by clinical concepts, ICD-9, and CPT codes within a 60-day window. The visit date was determined by the earliest date of any of the 3 elements in the 60-day window. If visit dates were within a 60-day window of a confirmed stroke incidence date, they were considered positive instances; otherwise, they were considered negative instances. Figure 1 demonstrates an example with an incident stroke on July 4, 2004. All visits were extracted and included in our data set if there was at least one key word or code during a 60-day window. Nurse abstractors reviewed every visit sequentially until they determined the incidence date to be July 4, 2004. All subsequent visits after a positive stroke incident were not reviewed and thus were not included in our analyses. Since the confirmed stroke incidence date fell in the date range of the third visit (June 24, 2004-August 22, 2004), we considered the combination of codes and clinical concepts in this visit to be predictive of a positive stroke incidence.

Data Analysis
After incident stroke was confirmed, visits afterwards were not reviewed by abstractors and thus excluded from our overall data set. Figure 2 shows the workflow of the algorithm training and testing process. We created a data set with 9130 confirmed visits (with stroke vs nonstroke labels) among the 1773 patients. In total, there were 746 stroke visits and 8384 nonstroke visits. The stroke incidence count (n=746) was larger than the number of patients with confirmed stroke incidence (740) because incidence dates for different subtypes of stroke (ischemic stroke/ transient ischemic attack and hemorrhagic stroke) were all recorded, such that patients might have had multiple incidence dates. We included data from a randomly selected 79.98% of our screened patients (1418/1773 patients; 7253 visits) as a training set and the remaining 20.02% of our screened patients (355/1773 patients; 1877 visits) were retained as an independent testing set. Due to the outcome imbalance in the data set (positive:negative ratio of about 1:10), we used the synthetic minority oversampling technique [51] to create oversampled training data sets with an oversampling percentage of 1000%. We considered two machine learning classifiers, logistic regression and random forest [52], to train our phenotyping models. Logistic regression served as a baseline modeling algorithm. Random forest was also chosen because of its high performance with structured input features and better model flexibility. We also considered the influence of feature groups by varying the inclusion of CPT codes and symptom terms in the input feature set. The hyperparameter tuning of the machine learning models was performed using 10-fold cross-validation. The performance metrics adopted for the machine learning task in the test set were precision, recall, and F score. The oversampling and machine learning modeling training and testing processes were implemented in Weka 3 (University of Waikato) [53]. Additional statistical summaries were performed using the R statistical software version 3.6.2 (The R Foundation for Statistical Computing). Quantitative variables are summarized as means, while nominal variables are expressed by counts and percentages.

Validation Cohort
We evaluated the generalizability of our model on a sample from a general population cohort of 71,429 patients. This cohort consisted of individuals sampled in Olmsted County, Minnesota on January 1, 2006, with an age ≥30 years and with no prior history of cardiovascular disease. We applied the best performing model based on the leave-out test set to this entire population cohort to generate incident stroke predictions. We then randomly selected 50 patients from those who had no stroke-related features (ie, de facto negative stroke predictions), 50 patients from those who were shown to have negative stroke predictions, and 50 patients from those who were shown to have positive stroke predictions and a predicted incident stroke for evaluation. This verification-based sampling strategy allowed for estimates of positive and negative predictive values (PPVs and NPVs, respectively) by conditioning on algorithm predictions. Under these conditions (n=50), the half-width of the 95% Wilson score CI for the PPVs and separate NPVs would be approximately 0.1 for a true value of 0.85.
All 150 patient cases were reviewed by 1 nurse abstractor to confirm incident stroke, which served as our gold standard. We recorded model prediction outputs on all patient visits in the 150-patient validation set. We combined visit-level true predictions to generate patient-level incidence predictions by saving only the earliest date of positive predictions as stroke incidences. We compared patient-level incidence predictions with our gold standard. True prediction in our evaluation meant the date of the predicted incident stroke was within 60 days of the abstracted stroke date. A 2 x 2 confusion matrix was used to calculate performance scores for prediction evaluation. Model performance metrics included PPV and NPV using manual evaluation as the gold standard and patient-level predictions to calculate true positives, false positives, true negatives, and false negatives. The uncertainty of these performance estimates was calculated using Wilson score 95% CI for proportions.
In addition, we developed heuristic rules to distinguish stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke incidence by analyzing the composition of keyword or code input feature sets (in a window of 60 days). We counted the number of keywords or codes for each ischemic stroke/transient ischemic attack and hemorrhagic stroke. If an input feature set contained more keywords or codes for ischemic stroke/transient ischemic attack, then this incidence was considered an ischemic stroke incidence; otherwise, it was considered a hemorrhagic stroke incidence. We only evaluated correct incident stroke predictions from the previous step in the evaluation data set with manually ascertained subtypes as the gold standard. Accuracy was calculated to measure performance of the subtype identification. Table 2 shows the algorithm performance measured on the test set for 8 models run on 4 input combinations and 2 classifiers (logistic regression and random forest). The random forest classifier outperformed the logistic classifier regardless of the feature sets used. Inclusion of CPT codes as features improved the performance for the random forest model with F score increased from 0.836 (Model 3) to 0.905 (Model 1). However, in the logistic model, the inclusion of CPT codes slightly improved the F score from 0.772 (Model 4) to 0.793 (Model 2). Using comparisons to all features (Model 1 and 2) and excluding the symptom terms (Model 6 and 7) achieved better F score (values italicized in Table 2).  Table 3 shows the distribution of stroke features in the AF cohort and the general population cohort. The AF cohort had a higher proportion of stroke-related codes and concepts. Results from the evaluation of the 150 selected patient records are presented in Table 4. Prediction performance corresponded to a PPV of 0.86 (95% CI 0.74-0.93), an NPV without ICD codes of 1.00 (95% CI 0.92-1.00), and an NPV with codes of 0.92 (95% CI 0.90-0.98). No strokes were observed among patients with no eligible stroke ICD codes. For subtype characterization, we achieved an accuracy of 80% (95% CI 0.68-0.89) in the general population sample.

Principal Findings
The rapid expansion of information available in EHRs opens new opportunities to combine structured and unstructured data for research. Advances in machine learning methods and tools facilitate the combination of multimodal clinical data for effective development of phenotyping algorithms. However, performance of stroke electronic phenotyping algorithms varies by stroke subtypes [25] and phenotyping tasks (ie, case vs noncase or incident stroke phenotyping). Our previous study showed that when naïve ICD codes with clinical concept matching were used, stroke incidence identification had a PPV of 60.6% while case-versus-noncase identification had a much higher PPV of 88.7% [20].
In this study, we included clinical concepts extracted from clinical notes along with ICD-9 and CPT codes for incident stroke ascertainment. The rationale to add CPT codes is that diagnosis of stroke usually needs to be confirmed by imaging evidence and will probably be followed by therapeutic procedures. Thus, the addition of CPT codes in the model could potentially help to reduce the information redundancy effect by distinguishing between past and current events recorded in clinical notes. Our algorithm closely resembles the ascertainment process (chart review) of clinicians, which uses multiple types of EHR data (eg, diagnoses and procedure codes, unstructured clinical notes) in a parsimonious manner. Due to the redundancy and temporal ambiguity in unstructured clinical notes, we needed to construct a data set with sufficient and interpretable features from multimodal clinical data.
We found that the random forest generated better results, while the addition of CPT codes improved overall performance. This may be because imaging procedures, especially head CT or MRI, are critical in the diagnosis of stroke. Therefore, CPT codes of such procedures can be important indicators for distinguishing between incident and historical events. In addition, ICD codes and therapeutic procedures can vary significantly between incident and recurrent events. Meanwhile, we observed that the additions of stroke-related symptom concepts were not helpful for the phenotyping task. This may be due to the fact that our stroke incidence ascertainment depends largely on the ubiquitous nature of many stroke-related symptoms: they may be stroke-related but not necessarily stroke specific. Additionally, ascertainment requires well-documented evidence, such as imaging or imaging reports. Without properly recorded evidence, patients are not likely to be ascertained as stroke.
Our generalizability evaluation demonstrates that models trained using a specific disease cohort for incident stroke ascertainment can generalize well to a general patient population. This is very encouraging given there are many existing patient cohorts available. Secondary use of these patient cohorts would be a cost-effective way for developing machine learning-based phenotyping algorithms. The study also illustrates that incorporating structured EHR data, such as CPT codes, can effectively distinguish incident stroke mentions from historical events in the clinical notes.
One limitation of our study is the dependence of domain experts to provide relevant clinical concepts, ICD-9 codes, and CPT codes. In the future, we will explore advance feature engineering approaches to identify those relevant concepts or codes automatically or semiautomatically. We are also aware that our imbalance cohort data and oversampling strategies might have introduced overfitting. Although our evaluation in the general population proved the performance of the algorithm, in the future, we can adopt a case-control matching strategy to deal with imbalanced data and mitigate the potential overfitting issue.
In addition, new treatment strategies (mechanical thrombectomy) to treat stroke have been in the market in recent years, and thus the features used in our algorithm could have different weights for predictions of events in different temporal settings. A more precise strategy could consider using different features for prediction tasks in different time frames, where variations in clinical knowledge and care path have been considered.

Conclusions
In conclusion, the high prevalence of stroke and the lack of an efficient algorithm to confirm incident stroke events necessitate the development of an effective and interpretable algorithm to identify incident stroke occurrences. In this paper, we described our efforts to develop and validate an EHR-based algorithm that accurately identifies incident stroke events and goes beyond typical case-versus-noncase stroke identification. Our algorithm's good performance in a general population sample demonstrates its generalizability and potential to be adopted by other institutions.