Establishing Machine Learning Models to Predict Curative Resection in Early Gastric Cancer with Undifferentiated Histology: Development and Usability Study

Background: Undifferentiated type of early gastric cancer (U-EGC) is included among the expanded indications of endoscopic submucosal dissection (ESD); however, the rate of curative resection remains unsatisfactory. Endoscopists predict the probability of curative resection by considering the size and shape of the lesion and whether ulcers are present or not. The location of the lesion, indicating the likely technical difficulty, is also considered. Objective: The aim of this study was to establish machine learning (ML) models to better predict the possibility of curative resection in U-EGC prior to ESD. Methods: A nationwide cohort of 2703


Introduction
Endoscopic submucosal dissection (ESD) is indicated for the treatment of patients with early gastric cancer (EGC) satisfying prespecified criteria, including histology, according to the differentiation, specific lesion size, morphology, and whether ulcers are present or not in the target lesion. The long-term prognosis following ESD for cases of EGC meeting the ESD criteria (achievement of curative resection) is comparable to that achieved with surgical resection [1,2]. In the context of histology, the undifferentiated type of EGC (U-EGC) generally refers to poorly differentiated adenocarcinoma, signet-ring cell carcinoma, or mucinous adenocarcinoma [3,4]. Although U-EGC is included among the expanded indications of ESD (mucosal U-EGC<2 cm without ulceration and without evidence of lymphovascular invasion), the rate of curative resection in U-EGC has remained very low-reported previously as 61.4% in a meta-analysis and 36.4% in a nationwide cohort study in Korea [5,6]. This implies that an unmet need persists regarding the accurate prediction of curative resection in U-EGC (ie, difficulty in adopting a precise ESD indication). Therefore, proper candidate selection prior to ESD is important.
Endoscopists predict the probability of curative resection by considering the size and shape of the lesion and whether ulcers are present or not. These components together compose the indications of ESD. In addition, lesion location, which can suggest the expected technical difficulty during the procedure and hint at the general condition of the patient, is also considered prior to conducting ESD. However, U-EGC has distinctive growth patterns relative to differentiated-type EGC [3,4,6,7]. U-EGC is known to extend laterally along the proliferative zone in the intermediate layer of the mucosa (subepithelial spreading), and the development pattern from the intermediate layer could lead to nonexposure to the surface mucosa, limiting the precise measurement of lesion size [5,8]. Subepithelial-spreading signet-ring cell carcinoma is more prevalent than the epithelial-spreading type in cases with background atrophy or intestinal metaplasia of the gastric mucosa [9,10]. Further, ESD of poorly differentiated adenocarcinoma presents a stronger association with submucosal invasion relative to that of signet-ring cell carcinoma [6]. Although adopting a precise indication is a key ability of endoscopists, U-EGC itself is a risk factor for a greater out-of-indication rate, leading to noncurative resection [11,12].
With the extensive production and collection of ongoing medical data, the application of artificial intelligence has been attempted in clinical practice [13]. Machine learning (ML) is a mathematical artificial intelligence algorithm automatically built from given data to predict precise outcomes in uncertain conditions without being explicitly programmed [14]. Examples of ML include Bayesian inferences, decision trees, support vector machines, deep neural networks, or ensemble methods (bagging or boosting) [14]. In short, ML is a type of applied statistical technique and is characterized by high accuracy. We aimed to establish an ML model to better predict the possibility of curative resection in U-EGC prior to ESD.

Ethical Statement
This study was approved by the Institutional Review Board of the Chuncheon Sacred Heart Hospital, Korea (no. 2020-07-019). It adhered to the principles expressed in the Declaration of Helsinki.

Data Sets
A nationwide cohort of 2703 U-EGCs treated by ESD (n=967) or surgery (n=1736) from 2006 to 2015 composed the training and internal validation groups. Eligible subjects were retrospectively enrolled from 18 university hospitals in Korea. Separately, an independent data set involving the Korean ESD registry with 275 U-EGCs and an Asan medical center data set with 127 U-EGCs treated by ESD were used for external validation. Subjects in the Korean ESD registry data set were retrospectively identified from 8 institutions of Korea [6], having been treated with ESD from 2006 to 2015, while subjects in the Asan medical center data set were treated by ESD from 2007 to 2013. All these data sets were mutually exclusive.

ML Models
All the currently available types of supervised ML classifiers were tested for the establishment of a curative resection prediction model in U-EGC. In total, 18 ML classifiers were assessed, including naïve Bayes in Bayesian inferences, linear-discriminant analysis, logistic regression in generalized linear modeling, linear support vector machine, stochastic gradient descent, decision tree, k-nearest neighbors, deep neural networks, bagging ensemble methods (bagging classifier, random forest, and voting classifier), boosting ensemble methods (gradient boosting, adaptive boosting, categorical Boosting, extreme gradient boosting [XGBoost], light gradient boosting machine, histogram-based gradient boosting), and a stacking ensemble method (stacking classifier). The Gaussian Naïve Bayes classifier is a model based on the Bayes' theorem encompassing the assumption that there is independence between the features. A generalized linear model is the extension of a linear model set up to include cases where the dependent variable is not normally distributed. We adopted the logistic regression classifier for this study. The support vector machine is a model that defines a decision boundary (hyperplane), that is, a reference line for classification. The stochastic gradient descent is a model for linear classifiers under convex loss functions such as support vector machine and logistic regression [15]. The decision tree is an algorithm that automatically finds rules in the data and creates tree-based classification rules. k-nearest neighbors is a classification or clustering algorithm that relies on distance metrics measures for similarity. Deep neural networks refer to an artificial neural network with multiple hidden layers between the input and output layers that learns from input data and optimizes the output classification with mathematical calculations. Ensemble algorithms combine multiple classification models to achieve better performance and can be classified as either bagging, boosting, or stacking methods. Bagging is a parallel ensemble method that fits individual random samples of the data set and aggregates the predictions of each model for the final classification (bootstrap aggregation) [15]. This meta-estimator can reduce the variance of each classification model by introducing randomization for the model establishment and then creating an ensemble out of it. As such, bagging reduces overfitting of the ML model [15]. Separately, boosting algorithms attempt to conduct ensemble modeling sequentially by learning from the errors of the previous model and updating the weight of subsequent models to optimize the loss functions and reduce the overall bias. In contrast with learning from homogenous weak models in the bagging and boosting algorithms, stacking algorithms learn from heterogeneous models, creating a meta-model for the final classification. For the current ML analysis of this study, we used bagging classification, random forest, and voting classification for the bagging ensemble methods and gradient boosting, adaptive boosting, categorical boosting, XGBoost, light gradient boosting machine, and histogram-based gradient boosting for the boosting methods. For the stacking algorithm, we chose stacking classification. All the ML classifiers were imported from the scikit-learn package version 0.23.2 using the Python programming language (version 3.8.5, Python Software Foundation). Figure 1 shows the types of ML classifiers examined in this study.

Variables, Primary Outcome, and Data Splitting
A total of 18 ML classifiers were used for the establishment of prediction models of curative resection with the following variables: age; sex; location, size, and shape of the lesion; and whether ulcers were present or not. The primary outcome was the accuracy of the established ML models for the prediction of curative resection with the given variables of the lesions. Thus, the main metric was the classifying accuracy. Each data set was prepared in the .csv file format. After uploading .csv files to the Google Colaboratory analysis platform, 2703 U-EGC data points were randomly split into training and internal validation sets according to a ratio of 9:1.

Definitions of the Variables
Among the variables used in this study, patient age and the size of the lesion were the continuous variables and the others were considered as categorical variables. The location of the lesion was categorized by both longitudinal location (lower-third, mid-third, and upper-third) and circular location (lesser curvature, greater curvature, posterior wall, and anterior wall). The shape of the lesion was defined in accordance with the Japanese classification: elevated, flat, or depressed according to the morphological characteristics. According to this system, type I (protruded) and type IIa (superficial elevated) were considered as elevated, type IIb (flat) and type IIc (superficial depressed) were considered as flat, and type III (excavated) was considered as depressed [4]. Curative resection was defined as complete resection of U-EGC with a diameter of 2 cm or less and a lesion confined to the mucosa, with negative lateral and deep resection margins and lymphovascular invasion. Noncurative resection referred to cases in which the resected lesion did not fulfill these criteria.

Statistical Analysis and Explainable Artificial Intelligence
Continuous variables were expressed as mean (SD) and categorical variables were expressed as numbers and percentages. Descriptive synthesis was conducted to reveal the baseline characteristics of the training and internal validation data set and external validation data set. To add to the interpretability of the established ML model, we performed an explainable artificial intelligence analysis. To elucidate the variables associated with lesions either accurately or inaccurately determined by the ML model, univariable analysis was conducted (Student t test and Fisher exact test for continuous and categorical variables, respectively). A two-tailed P value of less than .05 was adopted as the threshold for statistical significance. These analyses were performed using SPSS version 24.0. (IBM Corporation). Additionally, a feature importance (or permutation importance) analysis was completed to reveal which variables primarily contributed to the model's decision process [16,17]. This assessment measures the predictive error when a certain feature value is randomly shuffled; therefore, insignificant features do not affect the performance of the model [15]. Feature importance is measured by the F-score, which represents the ratio between the explained and the unexplained variance [17]. A decision process tree was plotted to visualize the step-by-step process of the decision making of the established ML model using the Graphviz package (version 0.14.1; AT&T Labs Research). A partial-dependence plot tool box (version 0.2.0) in the scikit-learn package to visualize the important features for the ML model was adopted and the target plot and interaction plot were visualized [18,19]. A Shapley additive explanations (version 0.35.0) analysis is an approach used to explain the output of any ML model using Shapley values and the degree of independence between features. The Shapley value expresses how much each feature contributes to creating the overall performance and represents feature importance while maintaining consistent and locally accurate additive feature attribution for a particular prediction [20].

Characteristics of the Training, Internal Validation, and External Validation Data Sets
The training and internal validation data sets contained not only endoscopically resected cases but also surgically removed cases of U-EGC. The first external validation data set was composed of a nationwide cohort of cases of ESD performed for U-EGC, while the second external validation data set consisted of cases of ESD performed for U-EGC from a single hospital with the largest degree of ESD experience to date in Korea. Therefore, the included data sets were marked by different clinical characteristics. Table 1 [15] to automatically search among multiple optimal parameter values to fit estimators of an ML model. By using the GridSearchCV analysis, we found the optimal hyperparameters for the best performance as follows: learning rate 0.4, maximum depth 6, and number of estimators 100. Figure 2 shows the confusion matrix for the XGBoost classifier in the internal validation data set.     Figure 3 and Figure 4 show the confusion matrices for the XGBoost classifier in the first and second external validation data sets, respectively.   Table 3 shows the univariable analysis for the associated factors of lesions determined accurately or inaccurately in the curative resection of U-EGC by the XGBoost classifier. Notably, there was no single significant factor associated with lesions determined either accurately or inaccurately by the XGBoost classifier.  Figure 5 shows the feature importance plot for the XGBoost classifier. Age, endoscopic size, and morphology of the lesions were the three most significant factors for the establishment of the ML model, in sequence. Multimedia Appendix 1 illustrates the decision process tree for the XGBoost classifier prior to adopting the GridSearchCV library. This simplified tree shows the step-by-step determination process of the ML model. The final leaf score is inserted in the following equation: p(x) = 1 / 1 + e -leaf score . Any value over 0.5 (50%) indicates curative resection and any value less than 0.5 indicates noncurative resection, as predicted by the XGBoost classifier [21]. Multimedia Appendix 2 shows the final decision process tree for the XGBoost classifier after adopting the GridSearchCV library, which presented the best performance in the internal validation. Endoscopic size of the lesion, patient age, and longitudinal location of the lesion were the important factors, in sequence. Multimedia Appendix 3 shows the partial-dependence target plot for the feature of endoscopic size of the lesion in the first external validation assessment. The probability of curative resection for the lesions with sizes ranging from 4 mm to 10 mm reached 80%. Meanwhile, U-EGC lesions with sizes ranging from 20.78 mm to 26.22 mm showed the lowest probability of curative resection at 16.1%. Multimedia Appendix 4 presents the two-way partial-dependence target plot for the features of endoscopic size of the lesion and patient age in the first external validation cohort. Given that the color of the circle above the imaginary line of Y=X is darker than that below the line, the endoscopic size and age are suggested to be correlated with curative resection of U-EGC. Multimedia Appendix 5 shows the partial-dependence interaction plot for the features of endoscopic size of the lesion and age in the first external validation group. Given that the contour lines are generally parallel to the Y-axis, the probability of curative resection is more dependent on the endoscopic size of the lesion.

Discussion
This study introduces the good performance of an ML model applied to the prediction of curative resection of U-EGC prior to ESD, suggesting the possibility of a beneficial effect of ML modeling for decision making in this part of clinical practice [22]. Moreover, thorough external validations confirmed the higher rate of curative resection predicted by ML modeling as compared with curative resection rates reported by clinicians. To our knowledge, this is the first study to establish and confirm the predictive performance of an artificial intelligence model for the therapeutic outcomes of ESD for U-EGC. Indeed, ML is characterized as a computer-aided prediction method and its most important benefit in this context consists of the improvement in predictive accuracy for curative resection prior to ESD. The proper selection of candidates for ESD is essential before beginning ESD. The most fundamental hypothesis is that endoscopic resection can be performed with curative intent in cases of EGC without lymph node metastasis. Therefore, indications of ESD were established using a combination of factors associated with a negligible lymph-node metastasis rate from the retrospective analysis of surgically resected specimens [3]. These indications are categorized by differentiated-type EGC and U-EGC according to the differentiation, specific size, and morphological and histological conditions of the involved lesion. However, optical endoscopic determination of the factors stated above involves operator-dependent characteristics. In the study of a Korean multicenter registry of ESD for U-EGC, there was a discrepancy between pre-ESD indications and post-ESD criteria in 36.7% of all the lesions [6]. Underestimation of the size was the most common reason for noncurative resection (71.4%), followed by underestimation of the depth of invasion (32%) and unpredictability of lymphovascular invasion (14.9%) [6]. Although adopting a precise indication is important, U-EGC itself is a risk factor for an enhanced out-of-indication rate, leading to noncurative resection; therefore, more strict indications might be necessary for pursuing the ESD of U-EGC [11,12].
Another important finding of this study is the presentation of the determination reason or process of the ML model through the explainable artificial intelligence analysis. Notably, there is a tradeoff between accuracy and interpretability in the classification model of ML [14]. Although the ML approach exhibited high degrees of accuracy based on complex calculations, it is characterized by low interpretability (artificial intelligence is more generally characterized as being of a "black-box nature") [14]. Conventional statistical analyses such as univariate or multivariate logistic regression analyses in previous studies have shown the reasons underlying the lower curative resection rate of ESD for U-EGC [5,6]. However, there is a limitation in the explanatory power of the overall model (low accuracy) in these studies. The XGBoost classifier used parallel-tree boosting analysis to provide highly efficient and accurate predictions. Through the ensemble model and extensive explainable artificial intelligence analysis, we identified the size of the lesion as being the most important feature for the successful prediction of curative resection in the ESD of U-EGC. Although a prospective trial of ESD for U-EGC that satisfied the expanded indication reported an excellent long-term survival rate [6,23,24], more cautious application or restriction of ESD indications has been recommended, especially regarding the size categorization [3,25]. Most recently published studies have also indicated that small intramucosal U-EGC lesions measuring less than 1.0 cm or 1.5 cm without lymphovascular invasion should be considered as the ESD candidate [26,27]. The explainable artificial intelligence analysis in our study also revealed that U-EGC lesions of less than 1 cm have the greatest probability of curative resection (Multimedia Appendix 3). Considering that the aim of this study was not the validation of current ESD criteria, further studies with robust analysis would elucidate the value of these findings.
In the context of ecological factors, age and gender have been tested with the endoscopic factors for the potential variable for the curative resection rate prediction. However, these variables were not consistently identified as important indicators for predicting curative resection [28][29][30]. Although feature importance analysis ( Figure 5) or Shapley additive explanations analysis (Multimedia Appendix 6) in our study revealed that age is an important variable for the ML determination process, explainable artificial intelligence analysis is currently an experimental method to understand how ML judges. It is presumed that the reason ML shows higher accuracy than traditional statistics is that it performs a complex operation that considers all variables. It is true that age is an important factor influencing ML judgment, but further explainable artificial intelligence statistics can explain how much it affects the actual curative resection.
Although this study established and rigorously validated the predictive performance of the designed ML model, several inevitable limitations became apparent. First, there was some discrepancy in the validation performance between the first and second external data sets. The indications of ESD for U-EGC have not been approved by all endoscopists. Therefore, practice patterns adopting ESD indications for U-EGC have been heterogenous depending on the institution. The first external validation data set was more heterogenous with respect to the baseline characteristics and therapeutic outcomes. However, the second data set was collected from a single institution, thus providing a more discrete application pattern of the ESD indication for U-EGC. Second, patient age was an important feature in the explainable artificial intelligence analysis; however, this feature does not perfectly reflect the general condition of the patient. Further, there is no age factor for ESD indications. However, the general condition of the patients is frequently considered in the determination of whether to pursue ESD. Therefore, clinical factors that reliably reflect patients' health status other than age should be developed and considered so as to attain the most favorable therapeutic outcomes of ESD. Third, the training and internal validation data sets included cases that were surgically resected as well as endoscopically resected cases. Endoscopists decide whether to perform ESD or surgery when they detect U-EGC. In other words, it has not been determined which U-EGC is a candidate for ESD or surgery. All the U-EGCs resected with surgery or ESD were included as it was not always accurate and appropriate for the endoscopists to differentiate between ESD or surgery. If only U-EGCs that were resected by ESD were collected, a clear ESD candidate would have been collected, which in itself may be a selection bias. In conclusion, we established an ML model capable of accurately predicting the curative resection of U-EGC prior to ESD by considering the morphological and ecological characteristics of the lesions. A clinical application study in a randomized controlled manner would elucidate the real value of this ML model.