This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence approaches can integrate complex features and can be used to predict a patient’s risk of developing lung cancer, thereby decreasing the need for unnecessary and expensive diagnostic interventions.
The aim of this study was to use electronic medical records to prescreen patients who are at risk of developing lung cancer.
We randomly selected 2 million participants from the Taiwan National Health Insurance Research Database who received care between 1999 and 2013. We built a predictive lung cancer screening model with neural networks that were trained and validated using pre-2012 data, and we tested the model prospectively on post-2012 data. An age- and gender-matched subgroup that was 10 times larger than the original lung cancer group was used to assess the predictive power of the electronic medical record. Discrimination (area under the receiver operating characteristic curve [AUC]) and calibration analyses were performed.
The analysis included 11,617 patients with lung cancer and 1,423,154 control patients. The model achieved AUCs of 0.90 for the overall population and 0.87 in patients ≥55 years of age. The AUC in the matched subgroup was 0.82. The positive predictive value was highest (14.3%) among people aged ≥55 years with a pre-existing history of lung disease.
Our model achieved excellent performance in predicting lung cancer within 1 year and has potential to be deployed for digital patient screening. Convolution neural networks facilitate the effective use of EMRs to identify individuals at high risk for developing lung cancer.
Lung cancer is a leading cause of cancer death worldwide, and to reduce its mortality, early detection is crucial. The National Lung Cancer Screening Trial (NLST) revealed that screening with low-dose computed tomography (LDCT) can reduce the mortality associated with lung cancer by 20% [
In the era of digital medicine, the use of artificial intelligence has resulted in good performance for predicting image-related tasks, specifically the use of convolutional neural networks (CNNs). In lung cancer research, CNNs have been applied to LDCT and chest radiographic images to facilitate detection and classification of pulmonary nodules; these models demonstrate performance that is comparable to that achieved by human experts [
In predicting lung cancer risk, the EMR should be suited to the task of identifying high-risk individuals [
Deidentified EMRs of 2 million patients who received care between January 01, 1999, and December 31, 2013, were initially sampled from the Taiwan National Health Insurance Research Database (NHIRD). These EMRs included the demographic information, diagnoses, and procedure codes from the
Previous validation studies that focused on lung cancer using the NHIRD have shown a positive predictive value (PPV) of 95% [
Inclusion and exclusion criteria for the study.
The index date for patients with lung cancer was defined as the date of first diagnosis. For the control patients, the index dates were randomly selected from their medical history.
The inputs included age, gender, and an image representing the patient’s 3-year history of diagnosis and medication. The image was input into Xception, a 126-layer neural network, in which feature extraction was performed. The final layer of the Xception network was connected to an average pooling layer and then connected to a fully connected layer with the patient’s age and gender.
Visualization of the hidden layer of the model using t-stochastic neighbor embedding. Avg: average; fc: fully connected layer.
We performed 3 subgroup analyses to investigate the performance of the model in different populations. According to the age criteria used in previous trials focused on lung cancer screening [
All patient data were split into training, validation, and testing sets based on their respective index dates. Data with index dates prior to December 31, 2012, were used for training and internal validation, and data with index dates after that date were used for prospective testing. The patients’ age, gender, and image-like arrays described above were used as inputs to generate the model (
Lung cancer risk prediction was treated as a binary classification task using supervised learning. The model was trained to determine whether a given patient was likely to develop lung cancer within 1 year. The Xception architecture [
To understand the model prediction, occlusion sensitivity analysis was performed by iteratively masking information from a single diagnosis or medication followed by evaluating any changes in the model prediction [
A total of 11,617 lung cancer patients and 1,423,154 control patients were identified in our data set. The mean age of the lung cancer group was 66.62 years (SD 14.01); the overall data set included 856,558 (59.7%) men and 578,213 (40.3%) women. The baseline demographics of this patient cohort and the assigned subgroups are summarized in
Demographics of the patients with lung cancer and control patients (N=1,434,771).
Group |
Patients, n | Age (years), mean (SD) | Male gender, n (%) | Mean diagnosis record count (SD), n | Mean medication record count (SD), n |
|
|
|
|||||||
|
Lung cancer | 11,617 | 66.62 (14.01) | 6931 (59.7) | 121.62 (113.19) | 202.68 (208.97) |
|
Control | 1,423,154 | 44.95 (16.32) | 683,375 (48.0) | 66.09 (76.60) | 105.99 (135.54) |
|
|
|
|||||||
|
Lung cancer | 11,617 | 66.62 (14.01) | 6931 (59.7) | 121.62 (113.19) | 202.68 (208.97) |
|
Control | 116,169 | 66.62 (14.01) | 69,310 (59.7) | 117.99 (113.67) | 190.22 (196.78) |
|
|
|
|||||||
|
Lung cancer | 9261 | 71.99 (9.46) | 5673 (61.3) | 135.12 (116.31) | 227.81 (218.12) |
|
Control | 385,052 | 66.57 (9.04) | 56,730 (48.6) | 114.23 (106.76) | 184.50 (189.50) |
|
|
|
|||||||
|
Lung cancer | 2356 | 45.50 (7.55) | 1258 (53.4) | 68.58 (80.42) | 103.90 (126.71) |
|
Control | 1,038,102 | 36.93 (9.85) | 496,256 (47.8) | 48.23 (51.36) | 76.87 (93.45) |
|
|
|
|||||||
|
Lung cancer | 3565 |
70.79 (12.73) | 2244 (63.0) | 175.12 (134.36) |
297.56 (245.55) |
|
Control | 182,098 |
53.01 (18.09) |
85,070(46.7) |
125.17 (114.53) |
204.85 (204.66) |
|
|
|
|||||||
|
Lung cancer | 8052 |
64.77 (14.16) |
4687 (58.2) |
97.94 (93.08) |
160.67 (174.80) |
|
Control | 1,270,651 | 43.77 (15.70) |
598,305 (48.2) |
57.42 (64.94) |
91.48 (115.23) |
|
aLung diseases included asbestosis, bronchiectasis, chronic bronchitis, chronic obstructive pulmonary disease, emphysema, fibrosis, pneumonia, sarcoidosis, silicosis, and tuberculosis. More information is provided in Table S11 in
For all patients, the model revealed an AUC of 0.821 when the input image-like array included sequential diagnostic information only. By contrast, the AUC was 0.894 when the input features included sequential medication information only; when the sequential diagnostic and medication information was simplified to binary variables, the model performance decreased (AUC=0.827). When both sequential diagnostic and medication information were integrated, the model reached an AUC of 0.902 on prospective testing, with a sensitivity of 0.804 and specificity of 0.837 (Table S12 in
The model performance at different age cutoffs was then investigated. Screening using an age cutoff of 55 years revealed a superior AUC of 0.871 compared to those obtained when cutoffs of 50 or 60 years were used (0.866 and 0.863, respectively) (Table S13,
Analyses of the subgroups included one that was both age- and-gender-matched, those at ages above and below 55 years, and those with or without lung disease were performed. For this analysis, we identified an age- and gender-matched control subgroup that was 10 times larger than the original lung cancer subgroup. This model revealed an AUC of 0.818 (SD 0.005) with a sensitivity of 0.647 (SD 0.017) and a specificity of 0.873 (0.023 SD), as shown in
Discrimination performance (testing set) of the model in the subgroups.
Subgroup | Lung cancer group, n |
Control, n | Testing AUCa (SD) | Testing sensitivity (SD) | Testing specificity (SD) | PPVb (SD), % | NPVc (SD), % |
Whole population | 1304 | 138,640 | 0.898 (0.002) | 0.805 (0.015) | 0.825 (0.018) | 4.2 (0.3) | 99.8 (0) |
Matching age and gender | 1304 | 13,040 | 0.818 (0.005) | 0.647 (0.017) | 0.873 (0.023) | 34.6 (0.4) | 96.0 (0.1) |
Age ≥55 years | 1046 | 43,328 | 0.869 (0.002) | 0.784 (0.011) | 0.785 (0.016) | 8.1 (0.5) | 99.3 (0) |
Age <55 years | 258 | 95,312 | 0.815 (0.007) | 0.620 (0.080) | 0.838 (0.054) | 1.1 (0.2) | 99.9 (0) |
History of lung disease | 361 | 16,596 | 0.829 (0.021) | 0.816 (0.021) | 9.0 (0.8) | 0.995 (0.1) | |
No history of lung disease | 943 | 122,044 | 0.887 (0.002) | 0.781 (0.025) | 0.827 (0.026) | 3.4 (0.5) | 99.8 (0.0) |
Age ≥55 years with history |
318 | 8184 | 0.875 (0.005) | 0.755 (0.047) | 0.819 (0.044) | 98.9 (0.2) | |
Age ≥55 years with no history |
728 | 35,144 | 0.865 (0.003) | 0.775 (0.019) | 0.786 (0.018) | 7.0 (0.4) | 99.4 (0.0) |
Age <55 years with history |
43 | 8,412 | 0.909 (0.006) | 0.777 (0.054) | 0.891 (0.036) | 3.8 (1.0) | 99.9 (0.0) |
Age <55 years with no history |
215 | 86,900 | 0.797 (0.008) | 0.533 (0.048) | 0.865 (0.026) | 99.9 (0.0) |
aAUC: area under the curve.
bPPV: positive predictive value.
cNPV: negative predictive value.
dItalic text indicates the best performance for the parameter.
The model’s hidden layer outputs of 1000 patients with cancer (red dots) and 9000 control patients (green dots) were visualized using t-SNE (
Occlusion sensitivity analysis further revealed that the specific diagnosis and medication factors were associated with an increased risk of developing lung cancer. Interestingly, “other noninfectious gastroenteritis and colitis” and “other agents for local oral treatment” were associated with the highest risks of developing lung cancer with respect to patient diagnosis and medication, respectively. The top 20 factors identified in the analysis are summarized in
Prediction analysis of the prospective testing data set (N=139,944).
Group | Patients, n | Age (years), |
Male gender, |
Mean diagnosis |
Mean medication |
|
|
||||||
|
True positive | 1052 | 69.91 (11.58) | 617 (58.65) | 141.75 (113.31) | 210.7 (186.32) |
|
False positive | 22,624 | 69.19 (12.48) | 12,641 (55.87) | 114.96 (111.04) | 159.14 (171.74) |
|
True negative | 116,016 | 41.94 (13.14) | 53,671 (46.26) | 63.08 (67.53) | 81.46 (101.84) |
|
False negative | 252 | 50.96 (10.79) | 134 (53.17) | 81.37 (95.67) | 104.03 (139.98) |
|
||||||
|
True positive | 851 | 72.86 (9.25) | 510 (59.93) | 146.32 (110.84) | 217.88 (181.04) |
|
False positive | 10,989 | 74.88 (9.66) | 6640 (60.42) | 124.11 (119.27) | 170.8 (179.15) |
|
True negative | 32,339 | 63.28 (6.58) | 13,871 (42.89) | 110.24 (97.26) | 152.69 (154.96) |
|
False negative | 195 | 64.62 (6.63) | 106 (54.36) | 125.98 (132.09) | 185.08 (216.55) |
|
||||||
|
True positive | 209 | 47.87 (6.07) | 113 (54.07) | 83.3 (87.98) | 106.48 (128.64) |
|
False positive | 32,765 | 46.78 (6.58) | 18,422 (56.22) | 59.4 (63.22) | 74.38 (92.27) |
|
True negative | 62,547 | 32.45 (7.43) | 27,379 (43.77) | 48.67 (48.88) | 60.74 (71.36) |
|
False negative | 49 | 36.22 (5.82) | 22 (44.90) | 63.98 (63.75) | 83.88 (115.66) |
|
||||||
|
True positive | 300 | 72.86 (11.18) | 182 (60.67) | 184.91 (118.07) | 278.71 (194.81) |
|
False positive | 2791 | 75.41 (11.97) | 1750 (62.70) | 180.66 (140.56) | 253.68 (214.05) |
|
True negative | 13,805 | 49.34 (15.6) | 5876 (42.56) | 119.33 (102.8) | 162.24 (162.85) |
|
False negative | 61 | 61.41 (12.11) | 34(55.74) | 171.72 (155.81) | 246.79 (226.86) |
|
||||||
|
True positive | 757 | 68.45 (11.4) | 442 (58.39) | 120.97 (104.28) | 177.03 (172.5) |
|
False positive | 23,328 | 66.54 (12.25) | 12,881 (55.22) | 95.23 (94.24) | 130.24 (146.34) |
|
True negative | 98,716 | 40.39 (12.27) | 45,805 (46.40) | 56.19 (59.51) | 71.56 (88.63) |
|
False negative | 186 | 48.19 (10.32) | 93 (50.00) | 65.08 (66.98) | 81.69 (101.83) |
|
||||||
|
True positive | 255 | 74.89 (9.03) | 160 (62.75) | 188.33 (119.58) | 284.4 (193.99) |
|
False positive | 1778 | 78.53 (9.16) | 1205 (67.77) | 188.16 (142.99) | 263 (215.97) |
|
True negative | 6406 | 66.38 (7.88) | 2669 (41.66) | 169.82 (121.41) | 239.26 (195.71) |
|
False negative | 63 | 70.44 (7.81) | 35 (55.56) | 203.87 (148.87) | 308.17 (221.29) |
|
||||||
|
True positive | 587 | 71.76 (9.24) | 347(59.11) | 126.04 (102.89) | 185.01 (166.72) |
|
False positive | 8958 | 73.86 (9.69) | 5,281(58.95) | 104.85 (103.3) | 142.56 (154.72) |
|
True negative | 26,186 | 62.73 (6.27) | 11,356(43.37) | 98.04 (87.47) | 135.09 (139.76) |
|
False negative | 141 | 63.47 (6.25) | 74(52.48) | 100.89 (103.77) | 148.73 (195.18) |
|
||||||
|
True positive | 37 | 48.89 (6.08) | 18 (48.65) | 120.46 (100.27) | 157.62 (173.25) |
|
False positive | 1080 | 46.56 (7.56) | 653 (60.46) | 85.56 (72.24) | 109.78 (108.74) |
|
True negative | 7332 | 37.7 (9.58) | 3099 (42.27) | 86.84 (75.16) | 113.06 (116.51) |
|
False negative | 6 | 43.33 (9.24) | 3 (50.00) | 103.67 (98.36) | 149.83 (152.85) |
|
||||||
|
True positive | 172 | 47.55 (6.07) | 95(55.23) | 74.94 (83.33) | 94.44 (114.72) |
|
False positive | 30,982 | 46.56 (6.56) | 17,478(56.41) | 55.1 (58.63) | 68.47 (84.96) |
|
True negative | 55,918 | 32.06 (7.25) | 24,571(43.94) | 45.68 (45.68) | 56.64 (65.81) |
|
False negative | 43 | 35.65 (5.54) | 19(44.19) | 59.88 (56.98) | 78.84 (108.63) |
Visualization of the hidden layer of the model using t-stochastic neighbor embedding.
Top 20 factors related to lung cancer learned by the model.
Rank | Factor | Lung cancer risk increase (%), mean (SD) |
1 | Other noninfectious gastroenteritis and colitis | 1.85 (1.01) |
2 | Other congenital anomalies of the circulatory system | 1.84 (2.21) |
3 | Other agents for local oral treatment | 1.76 (1.02) |
4 | Antidotes | 1.69 (1.55) |
5 | Postinflammatory pulmonary fibrosis | 1.69 (1.43) |
6 | Metronidazole | 1.69 (1.29) |
7 | Acariasis | 1.65 (1.73) |
8 | Antiviral drugs | 1.57 (1.03) |
9 | Orchitis and epididymitis | 1.57 (1.48) |
10 | Pneumococcal pneumonia | 1.52 (0.93) |
11 | Buflomedil | 1.44 (1.76) |
12 | Danazol | 1.42 (1.41) |
13 | Calcineurin inhibitors | 1.42 (1.29) |
14 | Other disorders of the urethra and urinary tract | 1.37 (1.34) |
15 | Angina pectoris | 1.35 (1.44) |
16 | Other nonorganic psychoses | 1.35 (1.99) |
17 | Respiratory conditions due to other and unspecified external agents | 1.33 (1.33) |
18 | Open wound of back | 1.33 (2.46) |
19 | Hydrazinophthalazine derivatives | 1.31 (1.57) |
20 | Insulin | 1.30 (1.51) |
In this study, we explored the possibility of predicting lung cancer using a CNN with diagnosis and medication history extracted from EMRs as a data source. Unlike other proposed lung cancer risk models, our model does not rely on self-reported parameters such as smoking/cessation history, family history, socioeconomic status, or BMI. This model could be readily deployed as a means to evaluate centralized health care databases and perform efficient population-based screening. Such an approach has potential to improve the accuracy of current screening methods, as it can identify those most likely to benefit from interventions [
Lung cancer prediction models are under investigation with the goal of identifying high-risk populations that might benefit from LDCT screening. A variety of parameters have been used for prediction, including epidemiologic factors (eg, socioeconomic status, BMI, and smoking history), clinical history (eg, family history and individual history of lung disease history), and results of clinical examinations (eg, blood tests, genetic analysis, and imaging results). The PLCOm2012 model is the most widely validated, with AUCs of 0.78 to 0.82 [
We recognize that direct comparisons between models may not be fully appropriate, as the target populations and predicted outcomes can vary. Previous reports suggested that the performance of models is inflated when nonsmokers and younger subjects (<55 years of age) are included in the study groups [
In the original NLST trial, the PPV for the LDCT was determined to be 3.4% [
The inclusion of an age- and gender-matched subgroup was necessary to explore the roles of clinical diagnosis and medication history in the predictions generated by our model; evaluation of this subgroup prevented the confounding effects of age and its correlations to clinical history (eg, older people are typically prescribed more chronic disease-related medications). With this consideration, our model achieved an AUC of 0.818. These findings can be compared to the model proposed by Spitz et al [
Our model demonstrated the worst performance in young patients without pre-existing lung diseases. This finding suggests that identifying high-risk patients among young and asymptomatic patients is still the most challenging task. Further studies are required to assess the performance of the model in patients with different staging. One of the major concerns with respect to the use of lung cancer prediction models is that they tend to select individuals who are older and who have multiple comorbidities [
Although deep learning is often considered a “black box,” and it is often challenging to explain the reasoning behind the outcomes, our study used t-SNE and occlusion sensitivity analysis to identify the most critical of the contributing parameters. Our occlusion sensitivity analysis revealed that many of the important factors were those associated with a history of preexisting lung conditions (eg, postinflammatory pulmonary fibrosis and pneumococcal pneumonia) and medications used to treat smoking-related diseases (eg, buflomedil for peripheral arterial disease and angina pectoris, and insulin for insulin resistance of diabetes mellitus) with increased cancer risk (eg, congenital anomalies of the circulatory system [
Although our model achieved excellent discriminative performance, poor calibration was noted, together with the fact that direct numeric output would overestimate the actual risk. This is a known phenomenon associated with modern neural networks [
Our model used nonimaging medical information from EMRs; however, we still used CNN as the model backbone. The study design and aims are different from other lung cancer studies that used CNN to analyze computed tomography (CT) scans and determine if a pulmonary nodule is malignant. Their models were used to automatically identify suspicious nodules from CT scans, which were already present, whereas our model attempted to identify patients with high risk of developing lung cancer in the future.
There are several limitations to this study. First, the data collection was limited to the NHIRD database of Taiwan; the patient records do not include tissue histology or lung cancer staging data. Patients with small cell lung cancer and mutation-rich non–small cell lung cancer (eg, epidermal growth factor receptor, anaplastic lymphoma kinase, ROS-1) could not be separated. These specific types may have different disease courses and risk factors; therefore, they were usually not included in the traditional screening, and the benefit of receiving screening is undetermined. Our subgroup analysis did include only patients with pre-existing lung diseases, but this did not mitigate the issue entirely. Similarly, the NHIRD database does not include information on patients’ lifestyles or any genetic or laboratory data. A subgroup analysis of patients with lung cancer based on tissue histology and staging might help to develop a prediction model that was tailored to different risk groups. Second, the data set did not contain any information on smoking status, which is clearly an important risk factor associated with lung cancer development. This limitation restricted the external validation and the comparisons that could be made between our model and those described in earlier published studies. The authors believe that self-reported information, such as family history, smoking/cessation history, and duration of symptoms, are valuable pieces of information for lung cancer prediction that are very important and can further improve prediction accuracy. In our study, a history of lung diseases (eg, COPD and emphysema) was used as a proxy for a smoking history; our model performed with excellent discriminative power with respect to this subgroup. Finally, the NHIRD includes primarily Taiwanese people; as such, the target population was fairly homogeneous, with limited ethnic diversity. The identified risk factors may not apply to other populations with other ethnicities. Nonetheless, the methodology used here could be easily applied to other medical databases with more diverse patient populations.
Our CNN model exhibited robust performance with respect to the 1-year prospective prediction of the risk of developing lung cancer. As our model included sequential data on clinical diagnoses and medication history, it was capable of capturing features associated with evolving clinical conditions and as such was able to identify patients at higher risk of developing lung cancer. With appropriate ethical regulation, this model may be deployed as a means to analyze medical databases, thus paving the way for efficient population-based screening and digital precision medicine. A future randomized controlled trial will be required to explore the clinical benefit of this model in diverse populations.
Supplementary tables and figures.
Anatomical Therapeutic Chemical
area under the receiver operating characteristic curve
convolutional neural network
chronic obstructive pulmonary disease
computed tomography
electronic medical record
International Classification of Diseases, Ninth Revision, Clinical Modification
low-dose computed tomography
Ministry of Education
National Health Insurance Research Database
National Lung Cancer Screening Trial
positive predictive value
t-distributed stochastic neighbor embedding
World Health Organization
extreme gradient boosting
This research was funded in part by Ministry of Education (MOE) grants MOE 109-6604-001-400 and DP2-110-21121-01-A-01.
MCHY contributed to the data analysis, model construction, interpretation of results, drafting of the manuscript, and literature review. YHW and HCY contributed to the data curation and data preprocessing. KJB contributed to the investigation and the interpretation of the results. HHW contributed to the interpretation of results, conceptualization, supervision, and manuscript editing. YCL contributed to the conceptualization, supervision, manuscript editing, and interpretation of the results. HHW and YCL contributed equally to this article. The corresponding author, YCL, affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
None declared.