This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The dynamic tracking of tumors with radiation beams in radiation therapy requires the prediction of realtime target locations prior to beam delivery, as treatment involving radiation beams and gating tracking results in time latency.
In this study, a deep learning model that was based on a temporal convolutional neural network was developed to predict internal target locations by using multiple external markers.
Respiratory signals from 69 treatment fractions of 21 patients with cancer who were treated with the CyberKnife Synchrony device (Accuray Incorporated) were used to train and test the model. The reported model’s performance was evaluated by comparing the model to a long shortterm memory model in terms of the root mean square errors (RMSEs) of real and predicted respiratory signals. The effect of the number of external markers was also investigated.
The average RMSEs of predicted (ahead time=400 ms) respiratory motion in the superiorinferior, anteriorposterior, and leftright directions and in 3D space were 0.49 mm, 0.28 mm, 0.25 mm, and 0.67 mm, respectively.
The experiment results demonstrated that the temporal convolutional neural network–based respiratory prediction model could predict respiratory signals with submillimeter accuracy.
The aim of radiation therapy is not only to deliver lethal doses of radiation to target tumors but also to minimize the dose of unnecessary radiation delivered to the surrounding healthy tissues and structures [
Many methods have been investigated to reduce the effect of respiratory motion, which mainly include the following:
Adding a margin around the target tumor: a 10 to 15mm margin is always used as the radiation treatment field to avoid missing a tumor, which may result in unnecessary radiation exposure to heathy tissues and structures [
Breath hold: patients need to hold their breath during the treatment to temporarily stop respiration, but this is not applicable for some patients, such as older patients and juvenile patients [
Beam tracking: radiation beams track a moving tumor dynamically to ensure that the tumor target is constantly within the treatment field [
All beam tracking methods must compensate for the latency of various sources, such as latencies from beam adjustment and image capture times [
Recently, deep learning approaches based on long shortterm memory (LSTM) have been successfully used to solve time series prediction problems in several fields. For example, Ma et al [
The tumor motion data (69 treatment fractions of 21 patients) used in this study were obtained from an open data set, which was recorded by the CyberKnife Synchrony (Accuray Incorporated) tracking system with a recorded sampling rate of 25 Hz [
The general scheme for the prediction process of 2 models is outlined in
Flowchart of the prediction algorithm.
The arrangement of respiratory signals used for network training and validation.
Position type  Data for training  Data for validation  



Position of marker 1  M^{a}1_{SI}^{b}, _{AP}^{c}, _{LR}^{d} (1, 2,…, t_{s})  M1_{SI, AP, LR} (t_{s}+1, 2,…, t_{s}+t_{end})  

Position of marker 2  M2_{SI, AP, LR} (1, 2,…, t_{s})  M2_{SI, AP, LR} (t_{s}+1, 2,…, t_{s}+t_{end})  

Position of marker 3  M3_{SI, AP, LR} (1, 2,…, t_{s})  M3_{SI, AP, LR} (t_{s}+1, 2,…, t_{s}+t_{end})  



Position of a tumor  T^{e}_{SI, AP, LR} (1, 2,…, t_{s})  T_{SI, AP, LR} (t_{s}+1, 2,…, t_{s}+t_{end}) 
^{a}M: external marker position.
^{b}SI: superiorinferior.
^{c}AP: anteriorposterior.
^{d}LR: leftright.
^{e}T: tumor position.
For the training process, the training input data and prediction target data were first used to tune the hyperparameters, which was done by using a crossvalidation model. Afterward, they were used to train the model. The external markers’ positions during the first input period of the training process (ie, the time between t=1 and t=t_{delay}) were used as the training input data for predicting the tumor positions (target positions) at a specific time frame (t=t_{delay}+t_{ahead}). This training process was repeated and continued to predict the next tumor position until either the threshold of the cost function or the maximum iteration number, which was set in advance, was reached. Each pair of data points (ie, the input data, M[t+1,…, t+t_{delay}], vs the output data, T[t+t_{delay}+t_{ahead}]) consisted of a training data set. “M” denoted 3 external markers’ positions (M1, M2, and M3), which were based on 3 directions (the superiorinferior, anteriorposterior, and leftright directions). t_{ahead} represented the ahead time we needed for making predictions.
For the evaluation process, the testing signals, which were represented as M(t_{s}+1, t_{s}+2,…, t_{end}) and T(t_{s}+1, t_{s}+2,…, t_{s}+ t_{end}), were used to evaluate the developed model. Similar to the process implemented in the training process, the external markers’ positions during the first input period of the evaluation process (ie, the time between t=1 and t=t_{delay}) were used to predict a tumor’s position (T’[t_{s}+t_{delay}+t_{ahead}]) at a specific time (t=t_{s}+t_{delay}+t_{ahead}). This process was also repeated to predict the next tumor position continuously. The external signals that were recorded during radiation therapy (ie, the time between t=t_{end}−t_{delay}−t_{ahead}+1 and t=t_{end}−t_{ahead}) were used to predict the final tumor position (T’[t_{end}]). Finally, the predicted signals (T’[t_{s}+t_{delay}+t_{ahead}],…, T’[t_{end}]) were compared to the real tumor positions (T[t_{s}+t_{delay}+t_{ahead}],…, T[t_{end}]).
The recurrent neural network (RNN) is a particular type of neural network that allows for selfcycle connections and transmits parameters across different time stamps. An RNN model can store the information of former time stamps. However, it is difficult for the RNN to memorize longterm memory information due to vanishing and exploding gradients [
The LSTM layer is a special RNN layer that overcomes the weakness that the RNN has with memorizing longterm memory information [
The structure of an LSTM layer. LSTM: long shortterm memory.
The TCN model was based on a transformation of a 1D fully convolutional network that was used for sequential prediction problems. The TCN model used a multilayer network to learn information over a long time span. Sequence information were transmitted layer by layer across the network until prediction results were obtained. The architecture of the TCN model is illustrated in
The TCN model used causal convolutions, in which the output at time t was convolved only with elements from previous layers at time t and earlier, to ensure that no leakage occurred from the future into the past.
The TCN model used dilated convolutions to ensure that each hidden layer had the same size as the input sequence and to increase the receptive field (ie, learning longer lengths of information).
The input of the TCN model was interval sampled. The equation for the dilated convolution was as follows:
In equation 1,
Residual networks [
The architecture of the temporal convolutional neural network model. "d" was the dilation factor. Conv: convolution; ReLU: rectified linear unit.
With regard to the TCN model, previous TCN studies [
The respiratory signals from 69 treatment fractions of 21 patients with cancer who were treated with the CyberKnife Synchrony (Accuray Incorporated) device were used to evaluate the proposed model. Of the 69 treatment fractions, 5 were used to tune the hyperparameters. The rest of the patients were used to evaluate prediction performance. For each of the 69 treatment fractions, signals that were acquired around the first 3 minutes (4500 data points) were used as the training signals for training the prediction model, and signals from the following 30 seconds were used as the test signals for assessing the effectiveness of the proposed model. The ahead time (t_{ahead}) used in this study was 400 ms [
The root mean square errors (RMSEs) between real and predicted signals of respiratory motion in a 3D space were used for assessment [
In equation 5,
The root mean square errors (RMSEs) of the three prediction models.
Direction  RMSEs (mm) of the LSTM^{a} model  RMSEs (mm) of the TCN^{b} model  RMSEs (mm) of the no prediction model 
Anteriorposterior direction  0.29  0.28  0.50 
Leftright direction  0.27  0.25  0.45 
Superiorinferior direction  0.55  0.49  1.04 
3D space  0.73  0.67  1.36 
^{a}LSTM: long shortterm memory.
^{b}TCN: temporal convolutional neural network.
We investigated the hyperparameters in the 4D hyperparameter space (625 experiments) for both the TCN and LSTM models by using the grid search method among 5 treatment fractions, which were selected randomly. The options and results of hyperparameter tuning are depicted in
The RMSEs for respiratory motion in all directions. These were determined by using the LSTM and TCN models and different ahead times for each treatment fraction. AP: anteriorposterior; LR: leftright; LSTM: long shortterm memory; RMSE: root mean square error; SI: superiorinferior; TCN: temporal convolutional neural network.
The performance comparison between the TCN and LSTM methods for predicting motion in the (A) superiorinferior direction, (B) leftright direction, and (C) anteriorposterior direction. LSTM: long shortterm memory; TCN: temporal convolutional neural network.
The options and results of hyperparameter tuning.
Models and hyperparameters  Hyperparameter options  Hyperparameter selected  



Number of layers  4, 5, 6, 7, and 8  5  

Filter size  1, 3, 5, 7, and 9  9  

Number of neurons in the input layer  5, 10, 15, 20, and 25  15  

Learning rate  0.0001, 0.001, 0.005, 0.01, and 0.1  0.001  



Number of LSTM layers  1, 2, 3, 4, and 5  2  

Learning rate  0.0001, 0.001, 0.005, 0.01, and 0.1  0.01  

Number of hidden units per layer  10, 50, 100, 150, 200, and 250  200  

Number of neurons in the input layer  5, 10, 15, 20, and 25  20 
^{a}LSTM: long shortterm memory.
As illustrated in
The root mean square errors (RMSEs) of the temporal convolutional neural network model for each external marker (EM).
Direction  RMSEs for all EMs  RMSEs for EMs 1 and 2  RMSEs for EMs 1 and 3  RMSEs for EMs 2 and 3  RMSEs for EM 1  RMSEs for EM 2  RMSEs for EM 3 
Anteriorposterior direction  0.28  0.28  0.28  0.28  0.29  0.29  0.29 
Leftright direction  0.25  0.26  0.26  0.25  0.27  0.26  0.26 
Superiorinferior direction  0.49  0.51  0.50  0.50  0.52  0.53  0.53 
3D space  0.67  0.69  0.68  0.68  0.71  0.72  0.72 
A comparison of RMSEs for respiratory motion in a 3D space among each treatment fraction. A: Results of the TCN model using 1 external marker compared to those of the TCN model using all 3 external markers. B: Results of the TCN model using 2 external markers compared to those of the TCN model using all 3 external markers. RMSE: root mean square error; TCN: temporal convolutional neural network.
The effects of different components in the temporal convolutional neural network layer. A: Residual blocks. B: FS. FS: filter size; RMSE: root mean square error.
A TCN model for predicting respiratory motion by using external markers’ prior signals was developed and tested in this study. The experiment demonstrated that the TCN model’s performance in predicting future respiratory signals with a 400ms ahead time was better than that of the LSTM model.
As is well known, hyperparameter settings have a large influence on the prediction performance of machine learning models. This also holds true for our TCN and LSTM models. We tuned 4 major hyperparameters for both of the TCN and LSTM models. Among these hyperparameters, the number of neurons in the input layer and the learning rate were tested for both models. Having a large number of neurons in the input layer allows for the inclusion of more features in models. Obviously, useful features may increase prediction accuracy. However, redundancy features may also be brought in along with the useful features. Hence, if this hyperparameter is too large, prediction performance may degenerate. The best number of neurons in the input layer for the TCN and LSTM models in this study was 15 and 20, respectively. The learning rate was an important hyperparameter in the model optimization process. If the learning rate is too large, the model may oscillate around the global minimum value instead of achieving convergence. On the other hand, if this value is too small, the training time and the risk of overfitting increase. Learning rates of 0.001 and 0.01 were selected as the optimal hyperparameters of the TCN and LSTM models, respectively. In addition to the two abovementioned hyperparameters, the number of layers and filter sizes were also investigated for the TCN model, whereas the number of LSTM layers and number of hidden units per layer were tested for the LSTM model. With regard to the TCN model, the size of the effective window (receptive field) increased as the number of layers and filter size increased. Hence, these two hyperparameters should guarantee that the receptive field of TCN model covers enough context for respiratory signal prediction. The optimal values for these two hyperparameters in our experiments were 5 and 9, respectively. With regard to the LSTM model, on one hand, a deeper LSTM model (a large number of LSTM layers) may be representative of a more complicated relationship among respiratory signals and improve prediction performance. On the other hand, a deeper LSTM model also has an increased risk of overfitting and increased convergence speed. In this study, the prediction performance results of the LSTM model were comparable when the number of LSTM layers was over 2. Hence, we selected 2 as the optimal number of LSTM layers. Further, the number of hidden units per layer determined the width of each LSTM layer. We also found that having a large number of hidden units per layer was helpful for establishing a more complicated prediction model, but at the same time, this increased the risk of overfitting and convergence speed.
The effect that different numbers of external markers had on prediction performance was also investigated in this study. The TCN model had the best prediction performance when it used all three markers’ positions. As shown in
Finally, we studied the influence of the different components (the filter size and residual blocks) in the TCN model. The size of the effective window (receptive field) increased with filter size. Hence, the model’s prediction performance initially became better as the filter size increased. However, the model’s prediction performance only slightly improved as the filter size increased continually. This may be because the receptive field that resulted from using a filter size of 3 provided enough context for the respiratory signal prediction task. On the other hand, we observed that the residual block architecture enhanced the model’s prediction performance immensely. We believe that this was because the residual blocks effectively allowed the TCN model to be modified based on identity mapping instead of a full transformation, which was crucial for the deep neural network architecture.
A deep learning approach based on the TCN architecture was developed to predict internal tumor positions with a 400ms ahead time based on the external markers’ positions in this study. The results demonstrated that this model could predict tumor positions accurately. Further, the prediction performance of the TCN model using multiple external markers was more robust and positive than that of the TCN model using 1 or 2 external markers.
long shortterm memory
root mean square error
recurrent neural network
temporal convolutional neural network
This work was partially supported by the National Natural Science Foundation of China (62103366), the General Project of Chongqing Natural Science Foundation (grant cstc2020jcyjmsxm2928), Seed Grant of the First Affiliated Hospital of Chongqing Medical University (grant PYJJ2019208), Chongqing Municipal Bureau of Human Resources and Social Security Fund (grant cx2018147), and Medical Research Key Project of Jiangsu Health Commission (grant ZDB 2020022).
None declared.