Predicting outcomes in patients undergoing pancreatectomy using wearable technology and machine learning: Prospective cohort study

Background: Pancreatic cancer is the third leading cause of cancer-related deaths, and although pancreatectomy is currently the only curative treatment, it is associated with significant morbidity. Objective: The objective of this study was to evaluate the utility of wearable telemonitoring technologies to predict treatment outcomes using patient activity metrics and machine learning. Methods: In this prospective, single-center, single-cohort study, patients scheduled for pancreatectomy were provided with a wearable telemonitoring device to be worn prior to surgery. Patient clinical data were collected and all patients were evaluated using the American College of Surgeons National Surgical Quality Improvement Program surgical risk calculator (ACS-NSQIP SRC). Machine learning models were developed to predict whether patients would have a textbook outcome and compared with the ACS-NSQIP SRC using area under the receiver operating characteristic (AUROC) curves. Results: Between February 2019 and February 2020, 48 patients completed the study. Patient activity metrics were collected over an average of 27.8 days before surgery. Patients took an average of 4162.1 (SD 4052.6) steps per day and had an average heart rate of 75.6 (SD 14.8) beats per minute. Twenty-eight (58%) patients had a textbook outcome after pancreatectomy. The group of 20 (42%) patients who did not have a textbook outcome included 14 patients with severe complications and 11 patients Cos et al JOURNAL OF MEDICAL INTERNET RESEARCH


Introduction
Pancreatectomy is a particularly complex operation with a 90-day mortality rate over 4% and serious morbidity rates over 20%, even in high-volume centers [1,2]. In the recently completed Alliance for Clinical Trials in Oncology (ALLIANCE) trial A021101 [3] and PREOPANC [4] multicenter clinical trials, 53% and 68% of patients, respectively, experienced at least a moderate complication from pancreatectomy. When a complication occurs after a pancreatectomy, the cost of the procedure to the health care system nearly triples from US $31,809 to US $82,576 because of prolonged hospitalization, additional treatments, and readmissions [5,6]. Complications are especially morbid in patients with pancreas cancer, a frail population with a mean age of 70 years, with up to 40% of patients being malnourished on presentation [7]. Multiple studies have shown that patients with pancreatic cancer who experience a therapeutic complication have decreased overall survival and quality of life [8].
Patients undergoing pancreatectomy have an increased risk of postoperative complications if they have poor preoperative physical health and overall performance [9,10]. To evaluate patients for surgery, physicians perform a physical examination in the office. This is subjective and can be misleading [11][12][13]. The patient's condition on that day may or may not be consistent with their general health. There are simple tests such as the 6-minute walk test or the Timed Up and Go test that can be used to determine a patient's baseline physical capacity and assess if a patient is fit for the physical demands of surgery; however, these tests have not been widely adopted [11][12][13]. In addition, although they are more objective than a physical examination, these tests also suffer from being a single measurement at a single time point. A more widely used surgical assessment tool is the American College of Surgeons National Surgical Quality Improvement Program surgical risk calculator (ACS-NSQIP SRC) [14][15][16]. It uses 20 patient-specific variables to calculate the likelihood of a patient having a complication or readmission after surgery. Although these evaluation tools are helpful, there is still a major gap in the ability to objectively measure and analyze patient health status in order to determine if the patient is fit for surgery.
Recently published data have demonstrated that telemonitoring using wearable devices with a 3-axis accelerometer and methodologies in patients who underwent pancreatectomy have also been shown to perform better than traditional methods in predicting outcomes [15,16].
For patients undergoing pancreatectomy, this technology has the potential to improve patient selection. To evaluate the relationship between longitudinal patient activity bioinformatics and their effect on surgical outcomes, our team implemented a protocol in which we provided patients with wearable telemonitoring devices before undergoing pancreatectomy at our institution and evaluated predictive outcomes. Herein, we present a prospective cohort study of patients undergoing pancreatectomy over a 12-month period.

Study Population
From February 2019 to February 2020, eligible patients were recruited from multidisciplinary pancreas clinics. Both men and women and members of all races and ethnic groups were eligible for this trial. The inclusion criteria for our study included patients who (1) were scheduled to undergo pancreatic resection, (2) had access to a smartphone, (3) were at least 18 years of age, and (4) were able to understand and willing to sign an institutional review board (IRB)-approved informed consent document (IRB #201810002).

Study Design
We conducted a prospective, single-center, single-cohort trial evaluating the utility of telemonitoring devices to measure daily activity in patients undergoing pancreatectomy. The device used in this study was the Fitbit Inspire HR (Fitbit, Inc), which was selected because it provides remote data access from the device with a set frequency and enhanced granularity. It is also a waterproof, inexpensive, consumer-based device and designed to be compatible with most smartphones. At the time of consent, study patients were provided with a telemonitoring device and assisted in setting it up with their smartphone. Pancreatectomy typically took place more than two weeks after surgical consent, providing a minimum of two weeks of preoperative activity metric data. All clinical practices followed the standard of care.

Patient Activity Assessments
Our team developed software to remotely collect activity metrics from our patient telemonitoring devices that was compliant with the Health Insurance Portability and Accountability Act. This ACS-NSQIP SRC risk calculations were evaluated and documented.

Study Outcome Measurements
All outcome measurements were prospectively collected by the study team and recorded in the patient's secure study record. All postoperative complications were coded and graded using the Modified Accordion Grading System (MAGS) [26]. The MAGS grades complications on a scale of 1 to 6, with grade 3=severe, 4=single organ system failure, 5=multiorgan system failure, and 6=death (grades 1 and 2 complications are considered nonsevere). To ensure rigor and reproducibility, surgical complications were presented and verified at a multidisciplinary pancreas conference held every week. All postoperative complications and readmissions were collected for 30 days after hospital discharge. Complications data were then used to compute the primary outcome for our study-the textbook outcome for pancreatectomy [27]. Textbook outcome was defined as the absence of postoperative pancreatic fistulae, bile leak, postpancreatectomy hemorrhage, severe complications, readmission, and in-hospital mortality. We modified our definition of textbook outcome to allow for discharging distal pancreatectomy patients with a drain on or before day 4, the standard of care in our practice.

Feature Engineering
To construct machine learning models based on activity metrics data, we applied feature engineering techniques to extract three types of features: statistical, semantic, and biobehavioral rhythmic features. We extracted first-and second-order statistical features from the daily step count, heart rate, and sleep time-series data [17]. The first-order statistical features used in our analysis were mean, maximum, minimum, skewness, and kurtosis. The second-order statistical features in medical data mining were co-occurrence features for which we generated energy, entropy, correlation, inertia, and local homogeneity. We then performed detrended fluctuation analysis (DFA) on the data, which evaluates long-range correlation of noisy time-series data, and used the root-mean-square deviation from the trend, namely the fluctuation, from DFA as the feature in our analysis.
[17]. The semantic features collected provided summaries of the patient's daily activity level and sleep quality. Examples of the semantic features were time in bed, minutes to fall asleep, daily sedentary time, and daily sedentary bout To account for variation in the study participation period (ie, time to surgery), the extracted patient activity features were unified to consistent dimensions. Biobehavioral rhythmic features were computed for the entire study participation period, and the statistical and semantic features were generated daily. In order to eliminate varying input feature dimension caused by different lengths of monitoring periods, we used mean and variance of the statistical and semantic features of a participant as the final inputs to the machine learning models.

Machine Learning Methods and Statistical Considerations
Multiple machine learning models were developed, trained, and evaluated for their ability to predict outcomes by discovering complex underlying patterns from multimodal time-series patient activity data collected from wearable devices and patient clinical characteristics. To avoid overfitting, we performed state-of-the-art "shallow" machine learning models, including random forest, gradient boosted trees (GBT), k-nearest neighbors (KNN), support vector machine (SVM) with linear kernel, and logistic regression (LR) with L1 penalty. A GBT model is an ensemble of weak decision trees that classifies the samples based on the predictions of those trees [22]. The algorithm iteratively fits a weak decision tree to the pseudo-residuals from the last iteration. We then employed regularization and feature selection to avoid overfitting and improve generalizability of the models. When implementing the GBT model, we explored established regularization techniques including controlling the complexity of the trees, applying shrinkage during the training process, and using stochastic gradient boosting. In general, an SVM model constructs an optimal hyperplane or a set of hyperplanes that can separate the samples of different classes by enforcing a large margin. It then makes predictions by deciding which side or region of the hyperplane the input sample should be on. In our implementation, we chose a linear kernel instead of other nonlinear kernels, such as a radial basis function (RBF) kernel, because the linear kernel is less likely to be overfitted in small data sets. LR with L1 penalty enforces the coefficients of less important features to be shrunk to zero, which works well for the case that has multiple features. For the feature selection in the training phase, we implemented a mixture of feature selection methods, using the chi-square statistic as the heuristic for categorical features and the F statistic from analysis of variance (ANOVA) for continuous features. When training the models, the hyperparameters were tuned importance score-the Shapley value. SHAP is an established model-agnostic explanation approach that can be used to explore models from any kind of machine learning [29].

Missing Data
There were three possible causes of missing data: (1) improper wearing of the device, (2) lack of user compliance (not wearing the device), and (3) loss of connectivity for longer than 7 days. For patients with missing data, we applied a two-level imputation method to the activity metrics collected by our telemonitoring devices [17]. The data-level imputation was to fill the missing data points in heart rate time series if the daily data yield, defined as the fraction of the expected data points that were successfully collected, was equal to or above the threshold (10%). The imputed time-series data were then used to compute the features [23]. We applied KNN imputation to estimate the missing heart rate data based on recent step count and heart rate data in a sliding window (eg, 5 minutes). For those heart rate time series with a daily yield of less than 10% but greater than 0%, we used feature-level imputation to directly impute their corresponding statistical and semantic features. For the feature-level imputation, we again applied KNN imputation to the missing statistical and semantic features based on other available features from the same participant on the same day. Days with no data (daily yield of 0%) were discarded in the analysis.

Model Performance Evaluation
To evaluate the effectiveness of the machine learning models in predicting postoperative outcomes, defined by the modified textbook outcome, we compared them with clinical patient performance status assessment tools, including the ACS-NSQIP SRC. Utilizing the ACS-NSQIP SRC as our baseline model, we evaluated the performance and efficacy of this approach and applied machine learning models to (1) patient clinical characteristics (demographics, comorbidities, and clinical presentation), (2) features derived from remotely collected activity metrics, and (3) patient clinical characteristics + features derived from remotely collected activity metrics. The comparative evaluation of the "patient activity-only" and "clinical characteristic-only" models assessed the predictive power of activity metrics, while the performance of a combined "patient activity + clinical characteristic" model, by design, tested whether activity metrics and clinical records complement each other to yield better results.
In our cohort, 28 (58%) patients had a textbook outcome, with the other 20 (42%) patients not achieving a textbook outcome. Fourteen patients developed 19 severe complications (MAGS score ≥3), including delayed gastric emptying (n=3), pancreatic fistula (n=3), organ space infection (n=2), postpancreatectomy hemorrhage (n=4), nonpancreatic anastomotic leak (n=1), myocardial infarction (n=1), and other (n=5). Additionally, 11 patients required readmission to the hospital. See Table 1 for univariate analyses of demographic and comorbidity features stratified by textbook outcome in our cohort. steps per day, had an average heart rate of 75.6 (SD 14.8) beats per minute, and had an average sleep time series of 2 (SD 1), which was a mean DFA of their sleep stages with 50-minute windows. The average ACS-NSQIP SRC calculations for a patient developing any complication was 27.3% (SD 6.4%), developing a serious complication was 23.3% (SD 5.5%), and being readmitted was 15.1% (SD 3.4%).
Utilizing the ACS-NSQIP SRC as our baseline model, we evaluated the performance and efficacy of this approach and applied machine learning models to (1) patient clinical characteristics, which included demographics, comorbidities, and clinical presentation; (2) patient activity with features derived from remotely collected activity metrics; and (3) patient clinical characteristics + patient activity with features obtained or derived from both clinical records and activity metrics. Table  2 shows the performance comparison of these models at predicting a textbook outcome. The predictive models were trained with probabilistic outputs and then the classification thresholds were adjusted to obtain a sensitivity of 0.9 in order to ensure a high detection rate and allow an equitable comparison. Our AUROC curves were 0.6333 for the ACS-NSQIP SRC, 0.7054 for the patient clinical characteristics model, 0.7027 for the patient activity model, and 0.7875 for the patient clinical characteristics + patient activity model.  In our analysis, we observed that 15 out of 20 features with the highest impact discovered by SHAP were from the best performing GBT model trained on patient clinical characteristics + patient activity (see Table 3 for feature exemplars).
Finally, to determine if the amount of missing data affected the performance of the classification model, the average number of days with high data availability (again, defined as days with a yield greater than or equal to 50%) for correctly classified patients was compared with that for incorrectly classified patients. The difference in the average number of days with high data availability between correctly classified patients and incorrectly classified patients was statistically insignificant (17 days, SD 10 days, versus 25 days, SD 25 days, respectively; P=0.12). This suggests that the amount of missing data did not affect the performance of the classification model. ACS-NSQIP SRC outperformed the ACS-NSQIP SRC, with an AUROC curve of 0.7054 for LR. This was similar to machine learning models that utilized only patient activity data collected from telemonitoring (AUROC curve of 0.7027 for SVM). The best results were achieved with machine learning models that combined patient clinical characteristics with patient activity data (AUROC curve of 0.7875 for GBT). This confirmed our hypothesis that machine learning technology can outperform the standard ACS-NSQIP SRC in predicting textbook outcomes in patients who had a pancreatectomy. In addition, patient activity metrics significantly improved the predictive power.
Within the machine learning model, we utilized SHAP scores to identify features with the greatest impact. Specifically, within heart rate features, the "variance of local homogeneity" in heart rate was significantly correlated with higher SHAP values. This suggests that particular attention should be paid to patients' physiological status prior to surgery. Additionally, the "mean of intradaily stability" and "relative amplitude" of steps taken [18], which pertain to the subjects' physical mobility, were also significantly associated with higher SHAP values. The definition and derivation of these features was described by Mao et al [29]. Similar to the findings of previous studies [18,21-23], incorporating patient activity data with patient clinical data increased the performance of our machine learning models. The patient clinical data that specifically improved the models' performance included neutrophil levels, calcium levels, and a history of prior surgery. The Rotterdam Study [31] found that patients with an elevated neutrophil count in relation to lymphocyte count (neutrophil to lymphocyte ratio) were independently associated with increased morbidity and mortality. Likewise, multiple authors have also shown age-related changes in calcium metabolism and found that variations in absorption of vitamin D, as well as a decreased intake of calcium, are commonly seen in the elderly [32]; 26 (54%) of the patients in this study were aged ≥65 years at the time of surgery.
Physical activity is a targetable and modifiable behavior that has been shown to improve outcomes of cancer patients undergoing chemoradiation [33][34][35]. Similarly, a meta-analysis of 15 randomized controlled trials with more than 400 patients showed that prehabilitation prior to major abdominal surgery led to a significant reduction in overall and pulmonary morbidity [33].
Based on our early results, we think that the combination of patient activity metrics collected preoperatively using wearable devices and machine learning models has the potential to reliably predict operative risks. In addition, by objectively tracking activity metrics and identifying areas of weakness, the data will provide targets for preoperative optimization and allow surgeons to more efficiently engage patients in their surgical care even before they undergo a major procedure. The ultimate goal is to decrease the likelihood of postoperative complications, which we believe will have a particularly large impact on patients with pancreatic cancer, a growing population with a high proportion of elderly and frail patients.

Limitations
The study was limited by a small sample size, which could potentially increase the risk of overfitting. However, as discussed in the methods section, multiple precautions were taken to reduce the effect of overfitting. We also acknowledge the risk for selection bias, as we recruited patients with access to a smartphone, which has the potential to exclude elderly patients and patients from lower socioeconomic groups.

Conclusion
Machine learning models based on preliminary data outperform standard ACS-NSQIP SRC estimates when used to predict a textbook outcome after pancreatectomy. The highest performance at this task was observed when machine learning models incorporated patient clinical characteristics and activity metrics collected with wearable telemonitoring technology. In the future, this can provide physicians with real-time actionable data that can be used to modify management of patients undergoing pancreatectomy and develop interventions to increase patient activity.