This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Currently, selection of patients for sequential versus concurrent chemotherapy and radiation regimens lacks evidentiary support and it is based on locally optimal decisions for each step.
We aim to optimize the multistep treatment of patients with head and neck cancer and predict multiple patient survival and toxicity outcomes, and we develop, apply, and evaluate a first application of deep Q-learning (DQL) and simulation to this problem.
The treatment decision DQL digital twin and the patient’s digital twin were created, trained, and evaluated on a data set of 536 patients with oropharyngeal squamous cell carcinoma with the goal of, respectively, determining the optimal treatment decisions with respect to survival and toxicity metrics and predicting the outcomes of the optimal treatment on the patient. Of the data set of 536 patients, the models were trained on a subset of 402 (75%) patients (split randomly) and evaluated on a separate set of 134 (25%) patients. Training and evaluation of the digital twin dyad was completed in August 2020. The data set includes 3-step sequential treatment decisions and complete relevant history of the patient cohort treated at MD Anderson Cancer Center between 2005 and 2013, with radiomics analysis performed for the segmented primary tumor volumes.
On the test set, we found mean 87.35% (SD 11.15%) and median 90.85% (IQR 13.56%) accuracies in treatment outcome prediction, matching the clinicians’ outcomes and improving the (predicted) survival rate by +3.73% (95% CI –0.75% to 8.96%) and the dysphagia rate by +0.75% (95% CI –4.48% to 6.72%) when following DQL treatment decisions.
Given the prediction accuracy and predicted improvement regarding the medically relevant outcomes yielded by this approach, this digital twin dyad of the patient-physician dynamic treatment problem has the potential of aiding physicians in determining the optimal course of treatment and in assessing its outcomes.
Head and neck cancer, which includes cancers of the larynx, throat, lips, mouth, nose, and salivary glands, is now an epidemic, with 65,000 new cases in the United States annually [
Furthermore, disposition to initial IC is then followed by a second responsive disposition to either RT or concurrent chemoradiotherapy. Inferring the optimal treatment policies for multistage decisions (eg, which treatment to administer initially and then after observing treatment response;
For this reason, in the absence of rigorous clinical trials comparing adaptive IC permutations with concurrent RT, group comparison is exceedingly difficult because simple models that account for confounders at initial disposition (eg, propensity scores) are unequipped to incorporate sequential decision processes (eg, the choice of CC
Overview of the therapy selection process, which shows two distinct phases: initial therapeutic selection and subsequent therapeutic selection.
To address multistage models of therapy selection that incorporate both relevant cancer and side-effect considerations, we introduce an approach based on
In this paper, we apply for the first time Q-learning methodology to dynamically select treatment based on multiple clinically relevant outcomes from data specific to patients with head and neck cancer. We use these methods to construct and develop optimal dynamic treatment strategies, that is, digital twins of the therapy process. In conjunction with simulation models of patient data, the treatment prescription models form a patient-physician (prescriber)
A state-of-the-art machine learning method applicable to the optimal therapy process problem is reinforcement learning (RL), in particular DQL [
We used DQL to find a treatment policy that maximizes a linear combination of multiple patient outcomes; for example, toxicological and survival outcomes. We considered a 3-step Markov decision process (MDP), with 3 actions in each episode corresponding to the three treatment decision points for each patient:
Decision 1 (D1):
Decision 2 (D2):
Decision 3 (D3):
More details on the setup of the MDP are described in the following sections, including the reward functions and state variables.
We performed a retrospective review of 536 patients with oropharyngeal squamous cell carcinoma who were treated at the MDACC between 2005 and 2013 (
Demographics of pretreatment features (before decision 1 [D1]: induction chemotherapy or not; N=536).
Characteristics | All patients (N=536) | Training set (n=402) | Testing set (n=134) | |||||
|
||||||||
|
Age (years) at diagnosis, mean (SD) | 58.9 (9.5) | 58.5 (9.4) | 60.2 (9.6) | ||||
|
|
|||||||
|
|
I | 6 (1.1) | 2 (0.5) | 4 (3) | |||
|
|
II | 154 (28.7) | 114 (28.4) | 40 (29.9) | |||
|
|
III | 274 (51.1) | 206 (51.2) | 88 (65.7) | |||
|
|
IV | 3 (0.6) | 1 (0.2) | 2 (1.5) | |||
|
|
Not available | 99 (18.5) | 79 (19.7) | 20 (14.9) | |||
|
|
471 (87.9) | 355 (88.3) | 116 (86.6) | ||||
|
|
|||||||
|
|
Negative | 43 (8) | 33 (8.2) | 10 (7.5) | |||
|
|
Positive | 305 (56.9) | 228 (56.7) | 77 (57.5) | |||
|
|
Unknown | 188 (35.1) | 141 (35.1) | 47 (35.1) | |||
|
|
|||||||
|
|
T1 | 113 (21.1) | 87 (21.6) | 26 (19.4) | |||
|
|
T2 | 219 (40.9) | 156 (38.8) | 63 (47) | |||
|
|
T3 | 116 (21.6) | 91 (22.6) | 25 (18.7) | |||
|
|
T4 | 86 (16) | 67 (16.7) | 19 (14.2) | |||
|
|
Txd | 2 (0.4) | 1 (0.2) | 1 (0.7) | |||
|
|
|||||||
|
|
N0g | 20 (3.7) | 14 (3.5) | 6 (4.5) | |||
|
|
N1 | 249 (46.5) | 181 (45) | 68 (50.7) | |||
|
|
N2 | 250 (46.6) | 194 (48.3) | 56 (41.8) | |||
|
|
N3 | 17 (3.2) | 13 (3.2) | 4 (3) | |||
|
|
|||||||
|
|
I | 186 (34.7) | 137 (34.1) | 49 (36.6) | |||
|
|
II | 81 (15.1) | 63 (15.7) | 18 (13.4) | |||
|
|
III | 64 (11.9) | 44 (10.9) | 20 (14.9) | |||
|
|
IV | 203 (37.9) | 157 (39.1) | 46 (34.3) | |||
|
|
Not available | 2 (0.3) | 1 (0.2) | 1 (0.7) | |||
|
|
|||||||
|
|
Current | 115 (21.5) | 85 (21.1) | 30 (22.4) | |||
|
|
Former | 203 (37.9) | 151 (37.6) | 52 (38.8) | |||
|
|
Never | 218 (40.7) | 166 (41.3) | 52 (38.8) | |||
|
|
|||||||
|
|
Packs per year, mean (SD) | 17.7 (23.7) | 16.7 (22.9) | 20.5 (26) | |||
|
|
Not available, n (%) | 28 (4.7) | 21 (5.2) | 7 (5.2) | |||
|
Aspiration rate before therapy (no), n (%) | 16 (3) | 14 (3.5) | 2 (1.5) | ||||
|
Number of affected lymph nodes, mean (SD) | 2.0 (1.3) | 2.1 (1.3) | 1.8 (1) | ||||
|
|
|||||||
|
|
Bilateral | 21 (3.9) | 16 (4) | 5 (3.7) | |||
|
|
Left | 242 (45.1) | 188 (46.8) | 54 (40.3) | |||
|
|
Right | 273 (50.9) | 198 (49.3) | 75 (56) | |||
|
|
|||||||
|
|
Base of tongue | 266 (49.6) | 204 (50.7) | 62 (46.3) | |||
|
|
Tonsil | 223 (41.6) | 158 (39.3) | 65 (48.5) | |||
|
|
Other | 47 (8.8) | 40 (10) | 7 (5.2) | |||
|
|
|||||||
|
|
African American or Black | 16 (3) | 10 (2.5) | 6 (4.5) | |||
|
|
Asian | 4 (0.7) | 3 (0.7) | 1 (0.7) | |||
|
|
Hispanic or Latino | 21 (3.9) | 17 (4.2) | 4 (3) | |||
|
|
Native American | 1 (0.2) | 1 (0.2) | 0 (0) | |||
|
|
White or other | 494 (92.2) | 371 (92.3) | 123 (91.8) |
aHPV: human papillomavirus.
bP16: protein expression 16.
cT: primary tumor.
dTx: no information about the primary tumor or it cannot be measured.
eN: lymph nodes.
fAmerican Joint Committee on Cancer’s Cancer Staging Manual, 8th edition.
gN0: nearby lymph nodes do not contain cancer.
hAJCC: American Joint Committee on Cancer.
Feature demographics before and after decision junctions (N=536).
Characteristics | All patients (N=536), n (%) | Training set (n=402), n (%) | Testing set (n=134), n (%) | |||||
|
||||||||
|
|
|||||||
|
|
None | 342 (63.8) | 250 (62.2) | 92 (68.7) | |||
|
|
Doublet | 41 (7.6) | 32 (8) | 9 (6.7) | |||
|
|
Triplet | 143 (26.7) | 111 (27.6) | 32 (23.9) | |||
|
|
Quadruplet | 7 (1.3) | 7 (1.7) | 0 (0) | |||
|
|
Not otherwise specified | 3 (0.6) | 2 (0.5) | 1 (0.7) | |||
|
Chemotherapy modification | 85 (15.9) | 65 (16.2) | 20 (14.9) | ||||
|
|
|||||||
|
|
No dose adjustment | 451 (84.1) | 336 (83.6) | 115 (85.8) | |||
|
|
Dose modified | 21 (3.9) | 16 (4) | 5 (3.7) | |||
|
|
Dose delayed | 10 (1.9) | 9 (2.2) | 1 (0.7) | |||
|
|
Dose cancelled | 18 (3.4) | 13 (3.2) | 5 (3.7) | |||
|
|
Dose delayed and modified | 6 (1.1) | 5 (1.2) | 1 (0.7) | |||
|
|
Regimen modification | 29 (5.4) | 22 (5.5) | 7 (5.2) | |||
|
|
Unknown | 1 (0.2) | 1 (0.2) | 0 (0) | |||
|
Dose-limiting toxicity | 95 (17.7) | 73 (18.2) | 22 (16.4) | ||||
|
|
|||||||
|
|
0 | 446 (83.2) | 334 (83.1) | 112 (83.6) | |||
|
|
1 | 7 (1.3) | 6 (1.5) | 1 (0.7) | |||
|
|
2 | 33 (6.2) | 26 (6.5) | 7 (5.2) | |||
|
|
3 | 41 (7.6) | 29 (7.2) | 12 (9) | |||
|
|
4 | 9 (1.7) | 7 (1.7) | 2 (1.5) | |||
|
Imaging (yes) | 194 (36.2) | 152 (37.8) | 42 (31.3) | ||||
|
Complete response, primary (1c, as opposed to 0d) | 84 (15.7) | 67 (16.7) | 17 (12.7) | ||||
|
Complete response, nodal (1) | 16 (3) | 14 (3.5) | 2 (1.5) | ||||
|
Parietal response, primary (1) | 89 (16.6) | 70 (17.4) | 19 (14.2) | ||||
|
Parietal response, nodal (1) | 156 (29.1) | 125 (31.1) | 31 (23.1) | ||||
|
Stable disease, primary (1) | 11 (2.1) | 8 (2) | 3 (2.2) | ||||
|
Stable disease, nodal (1) | 10 (1.9) | 6 (1.5) | 4 (3) | ||||
|
||||||||
|
|
|||||||
|
|
None | 126 (23.5) | 89 (22.1) | 37 (27.6) | |||
|
|
Platinum based | 257 (47.9) | 198 (49.3) | 59 (44) | |||
|
|
Cetuximab based | 129 (24.1) | 95 (23.6) | 34 (25.4) | |||
|
|
Other | 24 (4.5) | 20 (5) | 4 (3) | |||
|
Concurrent chemotherapy modification (1) | 99 (18.5) | 77 (19.2) | 22 (16.4) | ||||
|
Complete response, primary 2 (1) | 450 (84.1) | 336 (83.8) | 114 (85.1) | ||||
|
Complete response, nodal 2 (1) | 247 (46.1) | 186 (46.3) | 61 (45.5) | ||||
|
Parietal response, primary 2 (1) | 77 (14.4) | 58 (14.4) | 19 (14.2) | ||||
|
Parietal response, nodal 2 (1) | 257 (47.9) | 191 (47.5) | 66 (49.3) | ||||
|
Stable disease, primary 2 (1) | 2 (0.4) | 2 (0.5) | 0 (0) | ||||
|
Stable disease, nodal 2 (1) | 10 (1.9) | 6 (1.5) | 4 (3) | ||||
|
Dose-limiting toxicity 2 (also included for dermatological, neurological, gastrointestinal, hematological, nephrological, vascular, and other) | 102 (19) | 80 (19.9) | 22 (16.4) | ||||
|
||||||||
|
Four-year overall survival (alive) | 457 (85.3) | 344 (85.6) | 113 (84.3) | ||||
|
Feeding tube 6 months (yes) | 98 (18.3) | 77 (19.2) | 21 (15.7) | ||||
|
Aspiration rate after therapy (yes) | 98 (18.3) | 79 (19.7) | 19 (14.2) | ||||
|
Dysphagia (yes) | 154 (28.7) | 122 (30.3) | 32 (23.9) |
aD1: decision 1 (induction chemotherapy or not).
bD2: decision 2 (concurrent chemotherapy or radiotherapy alone).
cThe patient survived for at least four years after the treatment ended.
dAll other events.
eD3: decision 3 (neck dissection or not).
Demographics of physicians’ decisions (N=536).
Characteristics | All patients (N=536), n (%) | Training set (n=402), n (%) | Testing set (n=134), n (%) | ||||
|
|||||||
|
D1a: yes | 194 (36.2) | 152 (37.8) | 42 (31.3) | |||
|
D2b: yes | 410 (76.5) | 313 (77.9) | 97 (72.4) | |||
|
D3c: yes | 111 (20.7) | 84 (20.9) | 27 (20.1) |
aD1: decision 1 (induction chemotherapy or not).
bD2: decision 2 (concurrent chemotherapy).
cD3: decision 3 (neck dissection or not).
The data were collected after approval from the MDACC institutional review board (PA16-0303 and retrospective RCR03-0800).
We focused on two outcome measures: (1) four-year
Equation (1) was used as the total reward in training the DQL models
The state variables are illustrated in
Group 1: pretreatment features (before D1)
Group 2: post–IC-decision features (after D1 and before D2)
Group 3: post–CC-decision features (after D2 and before D3)
Group 4: primary outcomes after ND decision (after D3)
The features in
The data set was randomly split into training (402/536, 75%) and testing (134/536, 25%) sets. To reduce the radiomic feature dimensionality (approximately 1000) [
The first model to be trained was Q3, which represents ND (D3), based on the final outcomes, the treatment decisions made in D3, and the patient’s history before D3. We tuned the learning rate so that the mean reward converged smoothly instead of fluctuating drastically. The training for D3 was terminated when the NN weights had converged. Next, the model for D2 was trained based on the result of Q3 instead of the final outcomes, and D1 was trained based on the result of Q2. The models were constructed and trained using the PyTorch framework with graphics processing unit acceleration. Once the models had been trained, they were used in a forward order, as opposed to the training order, to prescribe the optimal treatment at each decision step. This is illustrated in
Overview of deep Q-learning model training. RL: reinforcement learning.
Overview of applying deep Q-learning model to make treatment prescriptions. RL: reinforcement learning.
We constructed multiple shallow-to-deep NNs with an increasing number of layers until the deepest model showed poor performance because of overfitting. We sampled 1000 separate training sets from the initial training data and trained a separate model on each of these sets, thus obtaining bootstrapped models with 95% CIs. Because of the high computational cost of bootstrapping, we will report in the
By prescribing an optimal treatment at each treatment junction, the DQL models constructed a
As the DQL goal is not to replicate clinicians’ decisions but to find an optimal, potentially different treatment, our evaluation includes building a treatment simulator (TS) model that, given a patient’s history and the prescribed treatment, predicts the outcome of that treatment. The TS consists of a transition model for each intermediate and final outcome measure, built using a support vector classifier (SVC). For example, in the case of D1, an SVC was trained for each group 2 feature in
The TS serves as an in silico
Illustration of the treatment simulator for D2. Those for D1 and D3 are similar, and their input features are from group 1 and groups 1-3, respectively. SVC: support vector classifier. D1: decision 1. D2: decision 2. D3: decision 3.
The DQL models were evaluated against the TS because our goal is not to replicate physicians’ decisions but to learn from the final reward and then quantitatively evaluate the treatment decisions learned by the DQL model. Such
At the same time, to the best of our knowledge, there is no existing rule-based approach (eg, decision trees) that is suitable for this task. We note that although very generic methods such as decision trees could be customized for a single-step prediction, they do not account for the sequential nature of this decision-making process. Furthermore, ultimately, evaluating such rule-based approaches would encounter the same
Although we tested the DQL models against the TS that allows on-policy evaluation, we emphasize two important considerations in the evaluation protocol:
TS was not used for training. Instead, we intentionally trained DQL on a tabular observation data set of 402 patients. This is because if we did train on the TS, the learned model would overfit the simulated environment, thereby overestimating the test performance (which is also measured from the TS). This deliberate decoupling of training and testing strategy, which is also adopted by Yauney and Shah [
The learned agent did not have access to the TS at test time, and the decisions were based solely on the current state. The TS was invoked only to simulate the environment, that is, generating the consequent state arising from the proposed decision and treatment, allowing the performance to be evaluated. This was consistent with the model-free nature of the DQL and ensured a fair evaluation by avoiding peeking into the real dynamics under which the test was conducted.
Incidentally, even if the TS
Although sampling is a natural approach to it, the high dimensionality of the state space demands a large amount of samples from the TS to accurately compute the expected reward.
In practice, closed-loop planning is clearly preferred, where later actions are chosen to best respond to the outcome of preceding decisions and treatments, leading to the mathematical optimization formulation as
As a result, we must compute the state value (V[s]-functions) or the state-action value (Q[s,a]-functions). Because of the complexity of state space, both of them are nontrivial, even given the TS. Compared with open-loop planning, an additional layer of difficulty is incurred here because one needs to estimate 8
To summarize, this
The TS performance was evaluated by 2 accuracies without running the DQL. The
The DQL models were then evaluated by comparing the OS and DP rates (as computed by the TS) resulting from the DQL treatment decisions with the outcomes observed under physician treatment on a separate test set. To facilitate interpretation, we computed the similarity between each of the DQL model’s decisions and the physicians’ decisions, considering each decision point independently. This evaluation does not need the TS. To further support interpretation, the policy followed by each model was analyzed by computing the increase (or decrease) in prescription rate for each treatment decision compared with the physicians’ ad hoc prescriptions to express whether the model was more (or less) likely to prescribe a certain treatment when compared with actual physicians.
We also evaluated the DQL treatment decisions by examining compliance with the National Comprehensive Cancer Network guidelines of acceptable care [
We first report the performance of the TS and the simulation performance of the DQL models and compare the DQL recommendations with the physician decision process, both in terms of per-decision similarity and overall similarity, that is, averaging the similarity for each decision point for each model. To report compliance, and to ensure quality and facilitate reproducibility, we provide a formal presentation of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis checklist, formalized in Table S3 in
The complete
One-step prediction accuracy of treatment simulation (with 95% CIs) based on out-of-bag evaluation of 1000 stratified bootstrapped samples.
Predicted outcome | Accuracy without radiomics (%; 95% CI) | Accuracy with radiomics (%; 95% CI) |
Overall survival (4 years) | 78.23 (73.20-82.92) |
|
Feeding tube (6 months) | 74.37 (68.81-79.40) |
|
Aspiration rate after therapy |
|
73.96 (68.04-78.76) |
Prescribed chemotherapy (single, doublet, triplet, quadruplet, none, or not otherwise specified) |
|
82.77 (78.06-87.13) |
Chemotherapy modification (yes or no) |
|
80.22 (75.98- 84.82) |
Dose modified | 92.39 (89.23-94.95) |
|
Dose delayed |
|
92.35 (88.56-95.52) |
Dose cancelled | 91.58 (87.68-94.77) |
|
Regimen modification |
|
91.79 (54.7-95.05) |
DLTb (yes or no) | 81.51 (77.25-85.42) |
|
DLT: dermatological |
|
90.58 (87.05-93.3) |
DLT: neurological | 92.17 (88.66-95.1) |
|
DLT: gastrointestinal | 89.60 (85.86-92.96) |
|
DLT: hematological | 90.10 (86.17-93.23) |
|
DLT: nephrological |
|
98.50 (96.55-99.52) |
DLT: vascular | 98.45 (96.45-100) |
|
DLT: infection (pneumonia) |
|
98.44 (96.37-99.50) |
DLT: other |
|
92.35 (83.17-96.98) |
DLT: grade | 73.85 (53.84-79.9) |
|
No imaging (0=no and 1=yes) | 100 (100-100) | 100 (100-100) |
Complete response, primary | 83.51 (78.82-87.56) |
|
Complete response, nodal | 94.79 (90.5-97.03) |
|
Parietal response, primary |
|
80.32 (75.89-84.85) |
Parietal response, nodal | 92.93 (90-95.65) | 92.93 (90.05-95.52) |
Stable disease, primary | 95.10 (91.96-97.84) |
|
Stable disease, nodal | 96.58 (94.47-98.05) |
|
Concurrent chemotherapy regimen |
|
65.99 (59.91-71.8) |
Concurrent chemotherapy modification (yes or no) | 70.53 (64.92-76.06) |
|
Complete response, primary 2 |
|
77.35 (29.95-84.57) |
Complete response, nodal 2 | 55.50 (49.01-61.54) |
|
Parietal response, primary 2 | 78.92 (74.26-83.25) |
|
Parietal response, nodal 2 | 52.50 (46.19-58.03) |
|
Stable disease, primary 2 | 99.48 (98.46-100) | 99.48 (98.41-100) |
Stable disease, nodal 2 | 96.50 (94.12-98.04) |
|
DLT: dermatological 2 | 91.99 (87.63-95.17) |
|
DLT: neurological 2 |
|
91.97 (88.29-94.69) |
DLT: gastrointestinal 2 | 89.74 (85.22-93.65) |
|
DLT: hematological 2 | 92.71 (89.42-95.16) |
|
DLT: nephrological 2 | 92.25 (88.17-97.94) |
|
DLT: vascular 2 | 100 (99.45-100) | 100 (99.02-100) |
DLT: other 2 |
|
93.24 (89.23-96.14) |
aValues in italics indicate whether higher accuracy is achieved by including or excluding radiomics.
bDLT: dose-limiting toxicity.
Recall from group 4 in
The complete performance of all DQL models on simulated patient outcomes is presented in Table S4 in
For the purposes of this paper, we consider the
To assess model parsimony (ie, the minimum number of layers for maintaining equivalent predictive performance),
Model performance for the combined outcome (overall survival+dysphagia) models without (left) and with radiomics (right). The figure shows the performance for overall survival (top) and toxicity (dysphagia; bottom), with varying numbers of layers showing treatment simulation results on the test data.
The similar rates (with 95% CIs) with respect to physicians’ treatment
The distributions (with 95% CIs) of the T and N stages of patients in the test set, separated by chemotherapeutic treatment prescribed by the best-performing model, are presented in
The rates at which models choose a certain policy compared with the physicians’ treatment rate at each decision point are shown in
Tumor stage demographics of patients based on the chemotherapeutic treatment decisions of the best-performing model (n=134).
Demographics | Chemotherapy | No chemotherapy, no induction chemotherapy, radiotherapy alone (%; 95% CI) | ||||||
|
Induction chemotherapy | No induction chemotherapy, concurrent chemotherapy (%; 95% CI) |
|
|||||
|
Concurrent chemotherapy (%; 95% CI) | Radiotherapy alone (%; 95% CI) |
|
|
||||
|
||||||||
|
T1 | 23.08 (0-65.38) | 3.85 (0-26.92) | 69.23 (26.92- 96.15) | 0 (0-23.08) | |||
|
T2 | 25.40 (6.35-55.56) | 3.17 (0-22.22) | 66.67 (38.06-88.89) | 0 (0-20.63) | |||
|
T3 | 32 (8-64.1) | 4 (0-28) | 60 (28-88) | 0 (0-16) | |||
|
T4 | 36.84 (5.26-84.21) | 5.26 (0-31.58) | 52.63 (10.53-94.74) | 0 (0-15.79) | |||
|
Txb | 0 (0-100) | 0 (0-100) | 100 (0-100) | 0 (0-100) | |||
|
||||||||
|
N0d | 20 (0-100) | 0 (0-40) | 60 (0-100) | 0 (0-40) | |||
|
N1 | 17.39 (0-73.91) | 0 (0-30.43) | 73.91 (17.39-95.65) | 0 (0-21.74) | |||
|
N2 | 29.41 (9.80-56.89) | 3.92 (0-22.55) | 62.75 (36.27-81.37) | 0 (0-17.65) | |||
|
N3 | 25 (0-100) | 0 (0-50) | 50 (0-100) | 0 (0-25) | |||
|
||||||||
|
N0 | 16.67 (0-100) | 0 (0-33.33) | 66.67 (0-100) | 0 (0-33.33) | |||
|
N1 | 22.06 (2.94-60.29) | 2.94 (0-26.47) | 70.59 (30.88-92.65) | 0 (0-22.06) | |||
|
N2 | 33.93 (8.93-69.64) | 3.57 (0-26.79) | 58.93 (25-83.93) | 0 (0-14.29) | |||
|
N3 | 25 (0-100) | 0 (0-50) | 50 (0-100) | 0 (0-25) |
aT: primary tumor.
bTx: no information about the primary tumor or it cannot be measured.
cN: lymph nodes.
dN0: nearby lymph nodes do not contain cancer.
eAmerican Joint Committee on Cancer’s Cancer Staging Manual, 8th edition.
Absolute increase (or decrease) of treatment decision rate compared with physicians’ decisions. The plots refer to decisions 1 (top), 2 (middle), and 3 (bottom) on the test set and for models considering only overall survival as an outcome measure (left) or overall survival+dysphagia (right) without radiomics. Y: yes.
The training time for a single DQL model did not significantly vary between shallower and deeper NNs and was just a few minutes on average for a complete model. With 1000-sample bootstrapping, the training time was accordingly longer, costing >24 hours to generate the results shown in
As there is no practical way of verifying counterfactual
The patient in the first case study differed in every decision: the treatment sequence prescribed by their clinician team was D1: IC, D2: RT, and D3: ND, whereas the DQL sequence was D1: not IC, D2: CC, and D3: not ND. During our discussion, upon retrieving and examining the medical records, the oncologists described this case as having “a very unique and strange presentation” with bilateral disease involving the retropharyngeal lymph node (RPN). As the MDACC has historically associated RPN involvement with increased metastatic risk in published series [
The patient we considered in the second case study featured disagreement only in the first decision. The treatment sequence prescribed by their clinician team was D1: not IC, D2: CC, and D3: not ND, whereas the DQL sequence was D1: IC, D2: CC, and D3: not ND. Upon examining the medical records, the oncologists noted that the patient had only 1 functioning kidney; therefore, in the first stage, the team decided to prescribe a low-dose chemotherapy regimen treatment as a precaution to prevent renal injury [
Overall, the physician review in both these instances that we investigated in detail suggests that, in the absence of specific
The high average, median, and overall accuracies provided by the TS in predicting the outcomes of treatments indicate that the TS is a valid digital twin for the treatment process when predicting the outcome of a treatment sequence. Our results also indicate that the Q-learning models indeed capture the nature of the dynamic treatment problem and provide a valid solution. Our models showed consistent improvements for all the outcome features taken into account, as well as moderate similarity to physicians’ decisions. Overall, these results indicate that DQL modeling can serve as a digital twin of the treatment decision process and TS modeling can serve as a digital twin of the patient treatment. When combined, DQL and TS constitute a valid patient-physician
Furthermore, our results show that the DQL models that consider OS+DP outperform models considering only OS in terms of simulated survival rate. As the absence of DP (FT or AR) symptoms is positively correlated with OS, maximizing these indirectly helps maximize OS-model performance as well.
Moreover, OS+DP models show higher similarity to actual physician decisions because they represent a finer-grained approximation of the decision process than models that include only OS as an outcome, including more of the features considered by the physician when choosing an optimal treatment.
Surprisingly, given the abundance of data on radiomics models for head and neck cancer [
Our findings also justify the choice of a deep NN model instead of a regular linear model: whereas by using DQL we reduce model parsimony, we can see that the results of the linear models (ie, the 0–hidden-layers NNs) are comparatively suboptimal to deeper models in terms of simulated performance, CI variance, and similarity.
Furthermore, per
When comparing OS-only models with OS+DP models, the prescription rates presented in
Although the proposed approach was shown to be effective in dynamically selecting optimal treatment strategy for patients with oropharyngeal squamous cell carcinoma, it is not without limitations. Because of the retrospective nature of the data set, our Q-learning models had to be evaluated through the TS, a supervised learning model, which might be seen as self-referential. However, the TS is a necessary approach before prospective application because evaluating the models based on physician similarity alone would not reflect the purpose of DQL. Intuitively, the goal of DQL is not to replicate the decisions taken by physicians in the data set but to learn from these decisions and their effect to discern between optimal and nonoptimal choices with respect to a given outcome measure.
Furthermore, because we train our
In conclusion, we constructed a DQL modeling approach to make optimized sequential treatment decisions based on a set of desired outcomes in head and neck cancer therapy and paired it with a simulation of the treatment process for evaluation purposes. This modeling approach represents, to our knowledge, the first application of DQL with simulation as a
Supplementary Tables 1-5.
American Joint Committee on Cancer
aspiration rate
concurrent chemotherapy
dysphagia
deep Q-learning
feeding-tube dependence
induction chemotherapy
MD Anderson Cancer Center
Markov decision process
neck dissection
neural network
overall survival
reinforcement learning
retropharyngeal lymph node
radiotherapy
support vector classifier
treatment simulator
The authors thank all members of the Electronic Visualization Laboratory, members of the MD Anderson Head and Neck Cancer Quantitative Imaging Collaborative Group, and the authors’ collaborators at the University of Iowa and University of Minnesota. This work was directly supported by the National Institutes of Health (NCI-R01-CA214825 and NCI-R01CA225190) and the National Science Foundation (CNS-1625941 and CNS-1828265). Direct infrastructure support was provided for this project by the multidisciplinary Stiefel Oropharyngeal Research Fund of the University of Texas MD Anderson Cancer Center Charles and Daneen Stiefel Center for Head and Neck Cancer, the MD Anderson Cancer Center Support Grant (P30CA016672), and the MD Anderson Program in Image Guided Cancer Therapy. CDF received funding and salary support
The data set used in the data analysis is publicly available [
ET, XZ, GC, CDF, AW, and GEM designed and developed the machine learning models and were responsible for data extraction and curation, statistical analysis, and interpretation. LVD, ASRM, and CDF were responsible for direct patient care provision, direct clinical data collection, interpretation, and analytic support. GC supervised statistical analysis, data extraction, and analytic support and is the guarantor of statistical quality. ET, XZ, AW, GC, LVD, ASRM, CDF, and GEM were responsible for manuscript writing and editing. XZ, GC, CDF, and GEM, as the primary investigators, conceived, coordinated, and directed all study activities and were responsible for data collection, project integrity, manuscript content, editorial oversight, and correspondence. All authors made substantial contributions to the conception or design of the work or the acquisition, analysis, or interpretation of data; drafted the manuscript or revised it critically for important intellectual content; gave final approval of the version to be published; and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
None declared.