This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Fractures as a result of osteoporosis and low bone mass are common and give rise to significant clinical, personal, and economic burden. Even after a fracture occurs, high fracture risk remains widely underdiagnosed and undertreated. Common fracture risk assessment tools utilize a subset of clinical risk factors for prediction, and often require manual data entry. Furthermore, these tools predict risk over the long term and do not explicitly provide short-term risk estimates necessary to identify patients likely to experience a fracture in the next 1-2 years.
The goal of this study was to develop and evaluate an algorithm for the identification of patients at risk of fracture in a subsequent 1- to 2-year period. In order to address the aforementioned limitations of current prediction tools, this approach focused on a short-term timeframe, automated data entry, and the use of longitudinal data to inform the predictions.
Using retrospective electronic health record data from over 1,000,000 patients, we developed Crystal Bone, an algorithm that applies machine learning techniques from natural language processing to the temporal nature of patient histories to generate short-term fracture risk predictions. Similar to how language models predict the next word in a given sentence or the topic of a document, Crystal Bone predicts whether a patient’s future trajectory might contain a fracture event, or whether the signature of the patient’s journey is similar to that of a typical future fracture patient. A holdout set with 192,590 patients was used to validate accuracy. Experimental baseline models and human-level performance were used for comparison.
The model accurately predicted 1- to 2-year fracture risk for patients aged over 50 years (area under the receiver operating characteristics curve [AUROC] 0.81). These algorithms outperformed the experimental baselines (AUROC 0.67) and showed meaningful improvements when compared to retrospective approximation of human-level performance by correctly identifying 9649 of 13,765 (70%) at-risk patients who did not receive any preventative bone-health-related medical interventions from their physicians.
These findings indicate that it is possible to use a patient’s unique medical history as it changes over time to predict the risk of short-term fracture. Validating and applying such a tool within the health care system could enable automated and widespread prediction of this risk and may help with identification of patients at very high risk of fracture.
Fractures due to osteoporosis and low bone mass are associated with a significant personal, clinical, and economic burden. These fractures are common; the risk of sustaining such a fracture increases with age, and their incidence is expected to increase worldwide as the population ages [
Several fracture risk prediction tools have been developed for clinical use. The most commonly used tools are the University of Sheffield Fracture Risk Assessment Tool, known as FRAX [
Increased risk of fracture in the next 1-2 years is not routinely assessed in clinical practice, despite the existence of rapid-acting preventative therapeutics [
To address these unmet needs, we developed
We used subsets of the Optum deidentified electronic health record data set, which contains comprehensive longitudinal electronic health record data for 91 million patients from over 140,000 providers (as of March 2018) from the United States. The subsets, which contain bone health and pan-therapeutic populations respectively, cover the time from January 1, 2007, through December 31, 2018 (Optum, email communication, August 2019).
The bone health subset was obtained by filtering for patients with osteoporosis, fractures, or bone-related medications (n=6,329,986). In the period covered by the data set, the fracture incidence rate (ie, the proportion of fractures among all events detected, which may include multiple fractures per person) was 39% in the population over 50 years of age. The bone health data set was primarily used for training the model.
The pan-therapeutic data set represented a random sample of 5% of the overall Optum electronic health record data set and contained patient data (n=3,476,219) with no filtering for any specific comorbidities or treatments; this dataset had a fracture incidence rate of 8.5% in the population over 50 years of age. Because the sample was drawn from such a large population, the pan-therapeutic data set was assumed to be broadly representative of the US population. As such, we performed all model evaluations on a testing sample from this data set (a holdout data set), to better understand the generalizability of the model in a real-world setting.
Since this was a retrospective study using deidentified data, patients were not required to actively participate in the study. Therefore, neither informed consent of patients nor institutional review board approval was required.
The cohort consisted of patients who were at least 50 years of age at the time of their event; this criterion was chosen to reduce the data to a population that is more susceptible to fractures associated with osteoporosis and low bone mass. For fracture patients, an event is the date of occurrence of any qualifying fracture. Qualifying fractures are defined by a set of rules based on those used by Wright et al [
We further filtered our cohorts for patients with at least 2 years of medical history leading up to their respective events. Applying these parameters limited the bone health cohort to 3,408,494 patients and the pan-therapeutic cohort to 700,315 patients.
We applied sliding windows to the data (
Sliding window algorithm schematic. This schematic depicts the sliding window algorithm for a multifracture and nonfracture patient. Dx:diagnosis; ICD: International Classification of Diseases.
There was no additional filtering based on specific diagnoses or comorbidities. For each qualifying patient, the algorithms utilized all available ICD codes in the historical sequences described above. Only the codes that occurred fewer than 5 times in the full cohort were excluded, as these codes were too rare to be included in the diagnosis code vocabulary.
Before model training, we generated a 70:30 random split of the pan-therapeutic data, representing training and holdout subsets. Since the pan-therapeutic data set is highly imbalanced, with a fracture event incidence of only 6.5% after applying the sliding window algorithm, we oversampled additional fracture windows from the bone health data set to achieve a balanced (50:50) training set for modeling. This oversampling training paradigm was replicated for all models. The holdout set remained untouched, with the original distribution of fractures.
Crystal Bone was inspired by techniques that are typically applied in natural language processing. However, instead of applying these techniques to text-based data, we applied them to sequences of ICD codes. Correspondingly, each ICD code was analogous to a word, and each sequence of ICD codes was analogous to a document. To this end, we implemented 2 distinct frameworks: (1) ICD code vectorization and long short-term memory networks, and (2) patient-level vectorization and extreme gradient boosting decision trees. Both approaches utilize sequences of ICD codes as inputs. The ICD code vectorization and long short-term memory framework undertakes this task by first learning semantic definitions for the codes, then evaluating the sequence of definitions through a deep learning network.. The patient-level vectorization and extreme gradient boosting modeling framework employs a similar approach; however, rather than embedding individual ICD codes, it embeds the entire ICD code sequence for each patient, thereby learning “summaries” of patient sequences. This framework produces a prediction by feeding these summaries through a decision tree classifier. The model parameters were tuned to optimize AUROC; details of this process are provided in
The first framework consisted of 2 primary components. The ICD code vectorization component was responsible for learning a “definition” for each ICD code based on skip-gram architecture word embedding (word2vec) [
The long short-term memory component consisted of a neural network with long short-term memory layers, a deep learning architecture that enables the evaluation of recurrent data, such as sequences of embedded ICD codes. We trained this network with the complete training set (including oversampling from the bone health data set). The long short-term memory network predicted the likelihood of a fracture event within 2 years as a classification problem. Long short-term memory networks are a common approach for solving such problems [
Additionally, given the ubiquitous use of nonsequential features such as age and sex for predicting fracture risk, we supplied age and sex to the neural network as static features through concatenation of long short-term memory and dense layers. Furthermore, because the long short-term memory framework required all input sequences to have uniform length, we also included total diagnosis count as a static feature to account for the effects of truncating or padding the sequences. The schematic in
2D projection of ICD-10 code embeddings from the ICD code vectorization model: (a) All ICD-10 codes by the first letter (high-level category) of the code, (b) a cluster of codes related to alcohol near coordinates (2.3, 3) by code subgroups, (c) a cluster of codes related to kidney function near coordinates (3.75, 0.025) by code subgroups, and all ICD-10 fracture codes in region C (d) by region of the body, and (e) by frequency of occurrence. ICD: International Classification of Diseases; UMAP: uniform manifold approximation and projection.
High-level architecture of the long short-term memory neural network including the dimensionality of the inputs, as well as the number of nodes in each layer. Dx: diagnosis; Icd2vec: ICD code vectorization; LSTM: long short-term memory.
Similar to the ICD code vectorization + long short-term memory modeling framework, the patient-level vectorization and extreme gradient boosting decision trees framework consists of 2 components. First, the patient-level vectorization embeds entire ICD code sequences to a 128-dimensional semantic space using the distributed bag of words framework [
The extreme gradient boosting decision trees component utilized the embeddings from the patient-level vectorization, as well as the static features of age, sex, and total diagnosis count that were incorporated in Framework 1, to predict fracture risk. This type of algorithm, also referred to as XGBoost, is a scalable tree-based modeling approach that improves the generalizability, speed, and efficacy of prediction [
An ensemble model was also evaluated. This algorithm combined the outputs of both the aforementioned frameworks with a logistic regression metaclassifier.
We compared these modeling frameworks to 2 baseline models. The first baseline model utilized the age and sex of each patient at each window. These were 2 of the only features shared by the FRAX tool and the GIH-BFRC models. The other shared feature is prior fracture; however, because neither the FRAX tool nor GIH-BFRC’s method of measuring this value was possible for our data set without censoring, we did not include it in the model. The second baseline incorporated age, sex, and total diagnosis count (number of ICD codes) in each sample; these represent all of the static features used by both modelling frameworks, enabling evaluation of the relative benefit of including sequential ICD code data. Both baseline models utilized extreme gradient boosting decision tree algorithms, the same classification approach that was used in Framework 2.
In addition to these baselines, we approximated human-level performance by isolating a set of retrospective physician-prescribed interventions that were identifiable in the electronic health record data set. These interventions consisted of diagnostic tests as well as pharmacologic treatments. The list of interventions was based on treatment guidelines provided by the National Osteoporosis Foundation [
List of physician interventions for human-level performance analysis.
Type and name |
Pharmacologic | ||
|
|
||
|
Dual-energy x-ray absorptiometry | No | |
|
Vertebral fracture assessment | No | |
|
Quantitative computed tomography | No | |
|
Other bone density measurements (single energy x-ray absorptiometry, radiographic absorptiometry, ultrasound, single-photon absorptiometry) | No | |
|
Bone turnover markers | No | |
|
Administration of any medications referenced below | Yes | |
|
|
||
|
Bisphosphonates (alendronate, alendronate-cholecalciferol, ibandronate, risedronate, zoledronic acid) | Yes | |
|
Abaloparatide | Yes | |
|
Denosumab | Yes | |
|
Raloxifene | Yes | |
|
Bazedoxifene | Yes | |
|
Romosozumab | Yes | |
|
Teriparatide | Yes | |
|
Calcitonin | Yes | |
|
|
||
|
Osteoporosis (M80, M81, 733.0) | No |
We defined the cohort of patients who did not receive any form of intervention (diagnoses, tests, or treatments) as
We report model performance on a set of 5 primary metrics: AUROC, recall (sensitivity), specificity, precision, and area under the precision-recall curve (AUPRC).
The overall performance of the algorithms is shown through comparison of the 2 frameworks with the 2 baseline models to demonstrate the quality of each algorithm's predictions.
Comparison of model performance metrics.
Model | AUROCa | Recall | Specificity | Precision | AUPRCb |
ICD code vectorization + LSTMc | 0.812 | 0.646 | 0.812 | 0.192 | 0.462 |
Patient level vectorization + XGBoostd | 0.790 | 0.670 | 0.758 | 0.161 | 0.358 |
Ensemble | 0.818 | 0.693 | 0.777 | 0.177 | 0.463 |
Baseline (age, sex) | 0.667 | 0.787 | 0.416 | 0.0855 | 0.119 |
Baseline (age, sex, diagnosis count) | 0.668 | 0.547 | 0.707 | 0.114 | 0.130 |
aAUROC: area under the receiver operating characteristics curve.
bAUPRC: area under the precision-recall curve.
cLSTM: long short-term memory.
dXGBoost: extreme gradient boosting.
To further characterize this performance, we evaluated the ICD code vectorization and long short-term memory model on primary and subsequent fracture events. While the model performs best on subsequent fractures, both primary and subsequent fracture analyses (AUROC 0.742 and 0.910, respectively) show a marked improvement against corresponding baseline models (AUROC 0.591 and 0.747, respectively). We report detailed results of this experiment and additional evaluations of sensitivity and robustness of this model in
For windows with interventions, only 11,833 of 69,198 (17.1%) of the detected interventions included treatments; thus, the remaining 57,365 (82.9%) interventions were either diagnoses or diagnostic tests. In the intervention cohort, Crystal Bone Framework 1 correctly captured 10,277 out of 12,244 windows for which fracture occurred within 2 years (83.9%). For the windows with interventions and no fracture event, 19,235 out of 56,954 (33.8%) are incorrectly flagged by our algorithm as at risk. These results suggest Crystal Bone’s ability to recognize interventions through their associated ICD codes and adjust the predicted fracture risk accordingly. However, a deeper exploration of specific interventions is required to verify this.
Human-level performance results.
Cohort | Windows, n (%) | Flag, n (%) | No flag, n (%) | |||||
|
630,445 (100) | —a | — | |||||
|
|
561,247 (89.0) | — | — | ||||
|
|
Fracture | 28,626 (5.1) | 16,127 (56.3) | 12,449 (43.7) | |||
|
|
Nonfracture | 532,621 (94.9) | 91,717 (17.2) | 440,904 (82.8) | |||
|
|
69,198 (11.0) | — | — | ||||
|
|
Fracture | 12,244 (17.7) | 10,277 (83.9) | 1967 (16.1) | |||
|
|
Nonfracture | 56,954 (82.3) | 19,235 (33.8) | 37,719 (66.2) |
aNot reported.
The overlap analysis enabled us to better understand how well Crystal Bone Framework 1 correlated with observed physician interventions through exploration of the first pharmacological treatment in the holdout set. Of the 7127 patients who received treatment, 6071 had enough medical history leading up to this treatment for Crystal Bone Framework 1. When evaluating these patients, 3017 out of those 6071 (49.7%) were considered at risk of fracture in 2 years.
We evaluated the incidence of fracture within 2 years for this subgroup. Of the cohort deemed at risk by the algorithm, 684 out of 3017 (22.7%) experienced a fracture within 2 years of the first intervention date. This precision is a slight improvement over that of the algorithm on the overall holdout set, at 19.2%. Furthermore, of all 570 patients in this pharmacological intervention cohort who ultimately suffered from a fracture within 2 years, Crystal Bone Framework 1 correctly flagged 469 (82.3%).
In this study, we evaluated the performance of 2 natural language processing–inspired fracture prediction models: (1) ICD code vectorization and long short-term memory (AUROC 0.812) and (2) patient-level vectorization and extreme gradient boosting (AUROC 0.790). The performance of these models reflected a substantial improvement over 2 baseline models: (1) with age and sex (AUROC 0.670) and (2) with age, sex, total diagnosis count (AUROC 0.670). Furthermore, these short-term prediction metrics were an improvement over cross-sectional tools for long-term time frames, such as FRAX and GIH-BFRC, which have been widely clinically accepted [
The human-level performance comparison provides deeper insight to the benefits of Crystal Bone. The retrospective labeling utilized in both the cohort and overlap analyses enabled a scalable, data-driven comparison of physician action and Crystal Bone and avoided bias that may occur through alternative methods of human-level performance evaluation [
Through the cohort analysis we learned that only a small proportion of patients received preventative interventions, including basic diagnostic tests, showcasing the extent of unmet need in the health care system [
The findings of the overlap analysis further support the merits of Crystal Bone, through demonstration of alignment with observable interventions made by physicians. Because it is impossible to confirm whether these treatment interventions were taken in response to a perceived short-term risk of fracture, we cannot expect 100% overlap between Crystal Bone and these observed interventions. We saw that Crystal Bone was aligned with these physician interventions 49.7% of the time. While this overlap is not complete, it captured 82.3% of the patients who ultimately experienced a fracture, reflecting the algorithm’s increased sensitivity for the cohort deemed at-risk by physicians. This suggests a meaningful alignment with both physician evaluation and actual observed fracture risk. Ultimately, these human-level performance comparisons, coupled with performance against baseline models and alternative risk prediction methods, suggest that Crystal Bone can fulfill a critical unmet need through identification of patients at high risk of fracture.
Various limitations exist for the approaches described, particularly from the inherent complications of using real-world data. The techniques described rely upon ICD codes recorded in electronic health record systems, which will impact the performance and validity of the models if diagnoses are not detected, incorrectly recorded, or missed due to patient dropout. Indeed, most vertebral fragility fractures are clinically silent and hence not captured in electronic health records [
In addition to data set challenges, there exist limitations inherent to assumptions of the modeling approach. The suppositions of constant time between diagnosis codes and uniform sequence length may affect performance. Exploration of more advanced methods that do not require such assumptions could improve the model and is an area of future work.
Perhaps the greatest limitation of the described approaches is that they are generally considered black box approaches and lack significant interpretability. Developing methods for improved interpretation of deep learning models is an active area of research. We have performed an initial exploration of this for the ICD code vectorization and long short-term memory model in
Exploration of model interpretability by comparison of various characteristics of the input data for the 4 prediction cohorts of the confusion matrix. FN: false negative; FP: false positive; ICD: International Classification of Diseases;TN: true negative; TP: true positive; UMAP: uniform manifold approximation and projection.
Another limitation of this study is the inability to perform direct comparisons with established risk calculators such as FRAX. Additionally, this approach has yet to be validated with external data, which is the subject of future work.
We foresee numerous applications of this work in the health care system, with benefits for patients, providers, and payers alike. For payers, Crystal Bone provides a unique opportunity to explore population health, enabling insurers to identify and address patients in need of evaluation or intervention, and preventing the large expenses associated with fracture events. For providers, direct electronic health record integration would facilitate patient care, and help identify at-risk patients who are not currently identified as such. That being said, effective implementation requires additional understanding on the impact of interventions on short-term fracture risk; while there is evidence to suggest that rapid acting treatments and bone-forming agents can significantly decrease fracture risk on a shortened time frame [
Crystal Bone addresses the need for an automated and largely physician-independent tool that is effective at predicting short-term fracture risk. It is the first such approach that takes longitudinal patient trajectories into account, rather than focusing primarily on cross-sectional information, enabling a more personalized assessment of fracture risk. Furthermore, with automated aggregation of patient histories in an electronic health record system, the prediction of fracture risk could be entirely hands-off, without requiring a doctor or patient to manually enter any information into the software. This unique approach may facilitate broader adoption of the algorithm. Still, the lack of clinical guidelines for 1- and 2-year risk may limit adoption in the near future.
Such a tool, if widely applied, could facilitate early patient identification, and help reduce the morbidity and mortality associated with fractures. The retrospective human-level performance comparison suggests that Crystal Bone would identify patients who are currently missed in the health care system, potentially minimizing the burden on patients and the health care system overall. Given the prevalence and anticipated increase of fractures due to osteoporosis and low bone mass as the population ages, as well as the enormous personal, clinical, and economic costs associated with such fractures, Crystal Bone could provide a meaningful positive impact through reduced burden and improved outcomes.
Supplementary Information.
area under the precision-recall curve
area under the receiver operating characteristics curve
University of Sheffield Fracture Risk Assessment Tool
Garvan Institute of Health Bone Fracture Risk Calculator
International Classification of Diseases
This study was funded by Amgen Inc. The costs covered by Amgen Inc were licensing of the Optum data set, access to the computational resources required to develop the model, and compensation for listed Amgen Inc employees. No additional funding was provided for this study.
Thank you to Optum for providing access to and assistance with the data. We would like to additionally thank the following individuals for their guidance and support in conducting this study and creating this manuscript: Inbal Lapid, Tammy Lindberg, Howard Chen, Mandy Suggitt, Lisa Humphries, Marc Doble, Nkem Ogbechie, John Page, Michi He, Akhila Balasubramanian, and Erle Davis.
YA is the first author. Technical conception, design and direction: YA, AR, PZ, and KW. Medical direction and interpretation: CH, MO, EM, and SRC. Data analysis and interpretation: YA, PZ, AWM, RP and AM Writing of the manuscript: YA, AR, PZ, KW, CH, MO, EM, and SRC. Authors EM and SR contributed equally. All authors contributed to critical revisions of the draft and approved the final manuscript.
YA, PZ, RP, AM, KW, and MO are employees and stock owners at Amgen Inc, the funders of this study. AR, AWM, and CH are former employees and stock owners at Amgen Inc. EM is a consulting fee recipient, grant recipient, and speaker on behalf of Amgen Inc, as well as a member of the International Osteoporosis Foundation. SRC is a consulting fee recipient and grant recipient from Amgen Inc.