This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence (AI) for gastric cancer diagnosis has been discussed in recent years. The role of AI in early gastric cancer is more important than in advanced gastric cancer since early gastric cancer is not easily identified in clinical practice. However, to our knowledge, past syntheses appear to have limited focus on the populations with early gastric cancer.
The purpose of this study is to evaluate the diagnostic accuracy of AI in the diagnosis of early gastric cancer from endoscopic images.
We conducted a systematic review from database inception to June 2020 of all studies assessing the performance of AI in the endoscopic diagnosis of early gastric cancer. Studies not concerning early gastric cancer were excluded. The outcome of interest was the diagnostic accuracy (comprising sensitivity, specificity, and accuracy) of AI systems. Study quality was assessed on the basis of the revised Quality Assessment of Diagnostic Accuracy Studies. Meta-analysis was primarily based on a bivariate mixed-effects model. A summary receiver operating curve and a hierarchical summary receiver operating curve were constructed, and the area under the curve was computed.
We analyzed 12 retrospective case control studies (n=11,685) in which AI identified early gastric cancer from endoscopic images. The pooled sensitivity and specificity of AI for early gastric cancer diagnosis were 0.86 (95% CI 0.75-0.92) and 0.90 (95% CI 0.84-0.93), respectively. The area under the curve was 0.94. Sensitivity analysis of studies using support vector machines and narrow-band imaging demonstrated more consistent results.
For early gastric cancer, to our knowledge, this was the first synthesis study on the use of endoscopic images in AI in diagnosis. AI may support the diagnosis of early gastric cancer. However, the collocation of imaging techniques and optimal algorithms remain unclear. Competing models of AI for the diagnosis of early gastric cancer are worthy of future investigation.
PROSPERO CRD42020193223; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=193223
Gastric cancer is the fifth most common cancer and the third leading cause of cancer deaths worldwide, contributing to 19.1 million disability-adjusted life years in 2017 [
Although significant increases in AI exist in many fields and in health care [
Since the breakthrough of deep learning in the 2010s, the use of AI in clinical practice has increased dramatically [
Early gastric cancer was defined as mucosal and submucosal (T1) gastric cancer irrespective of lymph node involvement. Studies involving advanced gastric cancer, precancerous lesions such as intestinal metaplasia and dysplasia, and gastric cancer without specific annotations were excluded. The accuracy of AI was defined as the area under the hierarchical summary receiver operating characteristic curve or the area under the curve (AUC).
This meta-analysis was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We systematically searched the PubMed, Embase, Cochrane Library, and Web of Science databases for studies that assessed the diagnostic accuracy of AI in early gastric cancer from endoscopic images from database inception to June 2020. We used “gastric cancer,” “endoscopy,” and “artificial intelligence” as relevant terms with Boolean operators “OR” and “AND” (
The quality of the included studies was assessed independently by 2 authors (P-CC and L-YR) on the basis of the revised Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2), and all disagreement was resolved through discussion with the third author (Y-NK). The assessment included risk of bias and applicability to the QUADAS-2 domains: patient selection, index test, reference standard, and flow and timing. From the included studies, we extracted data on the number of endoscopic images of lesions diagnosed as early gastric cancer (ie, true positive), the number of endoscopic images of benign lesions misdiagnosed as malignant (ie, false positive), the number of endoscopic images of malignant lesions misdiagnosed as benign (ie, false negative), and the number of endoscopic images of benign lesions correctly diagnosed as benign (ie, true negative). We also extracted data on the country of origin, AI methods, and image modalities used.
The primary outcome was the accuracy of AI to diagnose early gastric cancer from endoscopic images. Secondary outcomes focused on the sensitivity analysis of (a) different AI methods, (b) endoscopic imaging modalities, (c) studies that compared AI and endoscopist performance, (d) studies that evaluated larger gastric lesions (>20 mm), (e) studies that simply differentiated abnormal and normal lesions rather than using pathological staging, and (f) studies that separated the training and testing data sets during AI training. Sensitivity analysis was conducted if a subgroup contained more than two studies. We only assessed the heterogeneity of the included studies. Following extraction, the data were primarily analyzed using STATA 14 (StataCorp LP, StataCorp) except for subgroups with fewer than four studies. The midas and metandi commands were used to determine sensitivity, specificity, and AUC and analyze the summary receiver operating characteristic (SROC) and hierarchical summary receiver operating characteristic (HSROC) curves. Basic formulas for the analyses were as follows:
In the formulas, “a” is the intercept, “b” is slope, and DOR refers to the diagnostic odds ratio. Moreover, TPR is the true positive rate, and FPR is the false-positive rate. The modchk tool was used to examine goodness-of-fit and bivariate normality before SROC analysis in a bivariate mixed-effects model. The metabias command and the pubbias syntax were used to perform the Egger test and Deeks funnel plot asymmetry tests, respectively. The Egger test for diagnostic meta-analysis was based on the formula proposed by Hasselblad and Hedges, and the formula is mainly to detect publication bias detection via testing standard normal deviate among the included studies [
In the regression model, with intercept “
where Q refers to Cochran Q, and
Of the 5591 studies identified in the literature review, 5265 underwent title and abstract screening after duplication removal. The flowchart of the literature review process was constructed according to the PRISMA flowchart format (
Flowchart of the study selection process according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) format. AI: artificial intelligence.
Detailed information on the 12 studies is listed in
We also assessed the quality of the studies along with the risk of bias according to the revised QUADAS-2 tool (
Characteristics of the included studies.
Study ID | Country of origin | Testing image number | Reference standard | Image modality | AIa method | AI training and testing data set | Standard reference | Endoscopist comparison | Other information |
Kubota et al, 2012 [ |
Japan | 902 | Pathology | Not mentionedb | Multilayer neural network | Not separated | Unclear | No | Detected with pathological grading prediction |
Miyaki et al, 2013 [ |
Japan | 92 | Pathology | FICEc | SVMd (scale-invariant feature transform) | Separated | Pathology | No | Differentiated early gastric cancer from noncancerous tissues |
Liu et al, 2016 [ |
China | 400 | Pathology | Not mentionedb | Principal component discriminant analysis (YCbCr color space) | Separated | Pathology | No | Differentiated early gastric cancer from normal tissues |
Kanesaka et al, 2018 [ |
Japan | 81 | Pathology | NBIe | SVM (grey-level co-occurrence feature) | Separated | Pathology | No | Included only depressed type early gastric cancers that were <10 mm in size |
Sakai et al, 2018 [ |
Japan | 926 | Pathology | WLIf | CNNg |
Not separated | Pathology | No | —h |
Yamakawa et al, 2018 [ |
Japan | 817 | Uncleari | Not mentionedj | Not mentioned | Separated | Unclear | No | Differentiated early gastric cancer from nonneoplastic tissues |
Cho et al, 2019 [ |
Korea | 200 | Pathology | WLI | CNN |
Separated | Pathology | Yes | Detected early gastric cancer with pathological grading prediction |
Namikawa et al, 2019 [ |
Japan | 1479j | Uncleari | WLI, NBI, Chromok | CNN | Separated | Pathology | No | Differentiated early gastric cancer from gastric ulcers |
Wu et al, 2019 [ |
China | 200 | Pathology | WLI, NBI, BLIl | CNN |
Separated | Pathology | Yes | Differentiated early gastric cancer from gastritis and normal tissues |
Yoon et al 2019 [ |
Korea | 3390 | Pathology | WLI | CNN |
Not separated | Pathology | No | — |
Horiuchi et al, 2020 [ |
Japan | 258 | Pathology | NBI | CNN |
Separated | Pathology | No | Differentiated early gastric cancer from |
Ikenoyama et al, 2020 [ |
Japan | 2940 | Pathology | WLI | CNN (Single-shot multiBox Detector) | Separated | Pathology | Yes | Included only early gastric lesions that were <20 mm |
aAI: artificial intelligence.
bStudies that failed to mention imaging modalities.
cFICE: flexible spectral imaging color enhancement.
dSVM: support vector machine.
eNBI: narrow-band imaging.
fWLI: white light imaging.
gCNN: convolutional neural network.
hNot available.
iStudies that mentioned early gastric cancer but without reference to pathological staging.
jStudies were reported in meeting abstracts.
kChromo: chromoendoscopy.
lBLI: blue laser imaging.
To assess the diagnostic ability of AI to detect early gastric cancer from endoscopic images, we performed a meta-analysis on the selected 12 studies. Goodness-of-fit (
We assessed the diagnostic performance of various AI methods and endoscopic imaging modalities for early gastric cancer (
For endoscopic imaging modalities, studies using WLI had a sensitivity and specificity of 0.73 (95% CI 0.42-0.91) and 0.89 (95% CI 0.76-0.96), respectively. Studies using NBI reported a sensitivity and specificity of 0.96 (95% CI 0.92-0.98) and 0.83 (95% CI 0.54-0.95), respectively. The accuracy of the NBI group (AUC=0.96) was higher than that of the WLI group (AUC=0.90;
Overall sensitivity and specificity of artificial intelligence–assisted diagnosis of early gastric cancer. (A) Goodness-of-fit; (B) bivariate normality; (C) forest plot of overall sensitivity; and (D) forest plot of overall specificity. FP: false positive; TN: true negative.
Summary receiver operating characteristic curve, HSROC, AUC, and the Deeks funnel plot asymmetry test of artificial intelligence–assisted diagnosis of early gastric cancer. AUC: area under the curve; ESS: effective sample sizes; HSROC: hierarchical summary receiver operating characteristic; SENS: sensitivity; SPEC: specificity; SROC: summary receiver operator characteristic.
We excluded some studies with a high risk of bias and performed sensitivity analysis on the remaining studies (Tables S2-S5
Pooled sensitivity, specificity, and accuracy of the studies included in the meta-analysis and sensitivity analysis.
Group (studies and number of patients) | Sensitivity (95% CI) | Specificity (95% CI) | AUCa | ||||||||
Overall (12 studies, n=11,685) | 0.86 (0.75-0.92) | 97 | 0.90 (0.84-0.93) | 97 | 0.94 | ||||||
|
|||||||||||
|
Deep learning (8 studies, n=10,295) | 0.84 (0.69-0.93) | 98 | 0.88 (0.80-0.93) | 98 | 0.93 | |||||
|
Nondeep learning (3 studies, n=573) | 0.91 (0.86-0.95) | 18 | 0.90 (0.87-0.93) | 0 | 0.96 | |||||
|
|||||||||||
|
WLIc (4 studies, n=7456) | 0.73 (0.42-0.91) | 99 | 0.89 (0.76-0.96) | 99 | 0.902 | |||||
|
NBId (2 studies, n=339) | 0.96 (0.92-0.98) | 0 | 0.83 (0.54-0.95) | 51 | 0.959 | |||||
|
|||||||||||
|
Excluding studies with unknown method (11 studies, n=10,868) | 0.87 (0.76-0.93) | 97 | 0.89 (0.83-0.93) | 97 | 0.936 | |||||
|
Excluding studies with sample size <100 (10 studies, n=11,512) | 0.84 (0.71-0.92) | 97 | 0.89 (0.83-0.94) | 98 | 0.932 | |||||
|
Excluding studies without separation of testing data (9 studies, n=6467) | 0.85 (0.70-0.93) | 96 | 0.90 (0.86-0.93) | 91 | 0.934 | |||||
|
Excluding studies with any situation abovementioned (6 studies, n=5477) | 0.84 (0.62-0.94) | 98 | 0.89 (0.83-0.93) | 92 | 0.923 |
aAUC: area under the curve.
bAI: artificial intelligence.
cWLI: white light imaging.
dNBI: narrow-band imaging.
To our knowledge, this was the first systematic review and meta-analysis of AI-assisted endoscopic diagnosis of early gastric cancer. The accuracy, sensitivity, and specificity were 0.94, 0.86, and 0.90, respectively. High heterogeneity was noted. Sensitivity analysis revealed less heterogeneity in studies using nondeep learning AI methods and endoscopic NBI.
Our results indicate good sensitivity and specificity of AI-assisted detection of early gastric cancer. However, high heterogeneity was also noted among the included studies, which may be attributed to between-study differences in machine learning methods and imaging modalities [
We assessed the diagnostic performance of AI and endoscopists (n=91) for early gastric cancer detection, which was compared in 3 studies. The endoscopists were assigned to only 1 subgroup because of the inconsistent definitions of expert and nonexpert endoscopists between studies. The sensitivity and specificity of AI were 0.67 and 0.87, respectively, and those of the endoscopists were 0.68 and 0.92, respectively. In both groups, diagnostic performance varied widely with high heterogeneity. The diagnostic performance of AI was better than that of WLI compared with other studies; a meta-analysis reported a pooled sensitivity and specificity of 48% and 67% between endoscopists and WLI, whereas those between endoscopists and NBI were 83% and 97%, respectively [
Only 2 of the included studies evaluated only small lesions [
Some studies have explored the application of AI to other aspects of gastroendoscopy. For example, Wu et al [
The considerable advancement of AI in precise image recognition challenges the roles of physicians in disease diagnosis. AI systems offer certain advantages over physician diagnosis, the foremost of which are faster image processing rates and continuous work. In all included studies that specified image processing time, that of AI systems was shorter than that of endoscopists. AI assistance may reduce the risk of human error that arises from performing numerous endoscopic examinations. Moreover, the training of AI systems is considerably faster and less complicated than that of endoscopists. Well-trained AI systems learn from analyzing numerous images, whereas endoscopists rely on their individual skills and clinical experience. Training endoscopists is expensive and time-consuming because of the steep learning curve for the various image-enhancing techniques. In addition, AI may work as a double-check system during or after endoscopy, given its high sensitivity and specificity. AI allows for a second opinion, which is particularly valuable now that gastroendoscopy has been popularized and nationwide screening for gastric cancer has been implemented.
Our study had several limitations. First, all the included studies were retrospective case control studies performed in Asia, some of which compared early gastric cancer and normal gastric tissues, and some compared benign gastric lesions such as ulcers and gastritis. The possibility of selection bias cannot be ruled out. A randomized controlled trial comparing the diagnostic performance of AI and endoscopists for early and advanced gastric cancer (NCT04040374) is currently underway. Second, all the studies identified gastric lesions from still, clear, endoscopic images; images with blood or mucus were excluded. In daily practice, however, gastroendoscopy is recorded in video format, and still images are only captured for suspicious lesions. Blood, food debris, mucus, and foam, which reduce the accuracy of AI, are commonly encountered during examination [
To our knowledge, this is the first meta-analysis of the performance of AI in detecting early gastric cancer using endoscopic images. The available evidence suggests that AI can support the diagnosis of early gastric cancer; however, the collocation of imaging techniques and optimal algorithm remains unclear. Larger prospective cohort studies should be conducted to further validate the diagnostic accuracy of AI. Moreover, competing models of AI for the detection of early gastric cancer are worthy of future investigation.
Supplementary File 1. Search strategy (primary search strategy).
Supplementary File 2. Study quality assessment according to the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies [revised]).
Supplementary File 3. Forest plot of empirical Bayes predicted and observed findings.
Supplementary File 4. Scatter matrix.
Supplementary File 5. Egger’s test.
Supplementary File 6. Subgroup analysis for studies that used deep learning.
Supplementary File 7. Subgroup analysis for studies without deep learning.
Supplementary File 8. Subgroup analysis for studies that used white light image.
Supplementary File 9. Subgroup analysis for studies that used narrow band imaging techniques.
Supplementary Table 1. Characteristics of the studies that compared diagnostic performance of artificial intelligence to endoscopists and its sensitivity analysis.
Supplementary Table 2. Sensitivity analysis of the studies that included gastric lesions other than small gastric cancer lesions.
Supplementary Table 3. Sensitivity analysis of the studies that do not detect early gastric cancer lesions based on pathological grading.
Supplementary Table 4. Sensitivity analysis of the studies that separated training and testing data set during artificial intelligence training.
Supplementary Table 5. Sensitivity analysis of the studies with low risk on index test.
artificial intelligence
area under the curve
hierarchical summary receiver operating characteristic
narrow-band imaging
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
Quality Assessment of Diagnostic Accuracy Studies (revised)
summary receiver operating characteristic
white light imaging
None declared.