Artificial Intelligence Techniques That May Be Applied to Primary Care Data to Facilitate Earlier Diagnosis of Cancer: Systematic Review

Background: More than 17 million people worldwide, including 360,000 people in the United Kingdom, were diagnosed with cancer in 2018. Cancer prognosis and disease burden are highly dependent on the disease stage at diagnosis. Most people diagnosed with cancer first present in primary care settings, where improved assessment of the (often vague) presenting symptoms of cancer could lead to earlier detection and improved outcomes for patients. There is accumulating evidence that artificial intelligence (AI) can assist clinicians in making better clinical decisions in some areas of health care. Objective: This study aimed to systematically review AI techniques that may facilitate earlier diagnosis of cancer and could be applied to primary care electronic health record (EHR) data. The quality of the evidence, the phase of development the AI techniques have reached, the gaps that exist in the evidence, and the potential for use in primary care were evaluated. Methods: We searched MEDLINE, Embase, SCOPUS, and Web of Science databases from January 01, 2000, to implementation barriers, and cost-effectiveness before widespread adoption into routine primary care clinical practice can be recommended.


Background
Cancer control is a global health priority, with 17 million new cases diagnosed worldwide in 2018. In high-income countries such as the United Kingdom, approximately half the population over the age of 50 years will be diagnosed with cancer in their lifetime [1]. Although the National Health Service (NHS) currently spends approximately £1 billion (US $1.37 billion) on cancer diagnostics per year [2], the United Kingdom lags behind comparable European nations with their cancer survival rates [3].
In gatekeeper health care systems such as the United Kingdom, most people diagnosed with cancer first present in primary care [4], where general practitioners evaluate (often vague) presenting symptoms and decide on an appropriate management strategy, including investigations, specialist referral, or reassurance. More accurate assessment of these symptoms, especially for patients with multiple consultations, could lead to earlier diagnosis of cancer and improved outcomes for patients, including improved survival rates [5,6].
There is accumulating evidence that artificial intelligence (AI) can assist clinicians in making better clinical decisions or even replace human judgment, in certain areas of health care. This is due to the increasing availability of health care data and the rapid development of big data analytic methods. There has been increasing interest in the application of AI in medical diagnosis, including machine learning and automated analysis approaches. Recent studies have applied AI to patient symptoms to improve diagnosis [7,8], to retinal images for the diagnosis of diabetic retinopathy [9], to mammography images for breast cancer diagnosis [10,11], to computed tomography (CT) scans for the diagnosis of intracranial hemorrhages [12], and to images of blood films for the diagnosis of acute lymphoblastic leukemia [13].
Few AI techniques are currently implemented in routine clinical care. This may be due to uncertainty over the suitability of current regulations to assess the safety and efficacy of AI systems [14][15][16], a lack of evidence about the cost-effectiveness and acceptability of AI systems [14], challenges to implementation into existing electronic health records (EHRs) and routine clinical care, and uncertainty over the ethics of using AI systems. A recent review of AI and primary care reported that research on AI for primary care is at an early stage of maturity [17], although research on AI-driven tools such as symptom checkers for patient and clinical users are more mature [18][19][20][21].
The CanTest framework [22] (Figure 1) establishes the developmental phases required to ensure that new diagnostic tests or technologies are fit for purpose when introduced into clinical practice. It provides a roadmap for developers and policy makers to bridge the gap from the development of a diagnostic test or technology to its successful implementation. We used this framework to guide the assessment of the studies identified in this review.  [22].

Objectives
Few studies of AI-based techniques for the early detection of cancer have been undertaken in primary care settings [17]. Therefore, the aim of this systematic review is to identify AI techniques that facilitate the early detection of cancer and could be applied to primary care EHR data. We also aim to summarize the diagnostic accuracy measures used to evaluate existing studies and evaluate the quality of the evidence, the phase of development the AI technologies have reached, the gaps that exist in the evidence, and the potential for use in primary care. As many commercial technological developments are not documented in academic publications, we also performed a parallel scoping review of commercially available AI-based technologies for the early detection of cancer that may be suitable for implementation in primary care settings.

Search Strategy and Selection Criteria
This study was conducted in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analysis) guidelines [23], and the protocol was registered with PROSPERO (an international prospective register of systematic reviews) before conducting the review (CRD42020176674) [24]. All aspects of the protocol were reviewed by the senior research team.
We included all primary research articles published in peer-reviewed journals, without language restrictions, from January 01, 2000, to June 11, 2019. Studies were included if they provided evidence around the accuracy, utility, acceptability, or cost-effectiveness of applying AI techniques to facilitate the early detection of cancer and could be applied to primary care EHRs (ie, to the types of data found in primary care EHRs) [22]. We included AI techniques based on any type of data that were relevant to primary care settings, including coded data and free text. We included all types of study design, as we anticipated that there would be few relevant randomized controlled trials. We kept our search terms broad to not miss relevant studies and carefully considered evidence from any health care system to assess whether the evidence could be applied to primary care settings.
As our aim is to identify AI techniques that would be applicable in primary care clinical settings, we excluded studies that incorporated data not typically available in primary care EHRs in the early diagnostic stages (eg, histopathology images, magnetic resonance imaging, or CT scan images). We also excluded studies that only described the development of an AI technique without any testing or evaluation data, studies that did not incorporate an element of machine learning (ie, with training and testing or validation steps), studies that used AI techniques for biomarker discovery alone, and studies that were based on sample sizes of less than 50 cases or controls. Machine learning techniques and neural networks have been described since the 1960s [25,26]; however, they were initially limited by computing power and data availability. We chose to start our search in 2000, as this was when the earliest research describing the new wave of machine learning techniques emerged [27].
We searched MEDLINE, Embase, SCOPUS, and Web of Science bibliographic databases, using keywords related to AI, cancer, and early detection. We extended these systematic searches through manual searching of the reference lists of the included studies. We contacted study authors, where required. Where studies were not published in English, we identified suitably qualified native speakers to help assess these studies. We performed a parallel scoping review to look for commercially developed AI technologies that were not identified through systematic searches, thus unpublished and not scientifically evaluated. This included manually searching commercial research archives and networks (eg, arXiv [28], Google [29], Microsoft [30], and IBM [31]), reviewing the computer-based technologies identified in 3 recent reviews [19][20][21], and manually searching for further technologies mentioned in the text or references of the studies and websites included in these reviews.
Following duplicate removal, 1 author (OJ) screened titles and abstracts to identify studies that fit the inclusion criteria. Of the titles and abstracts, 17.42% (1838/10,456) were checked by 2 other authors (SS and NC); interrater reliability was excellent at 96.24% (1769/1838). Any disagreements were discussed by the core research team (OJ, SS, NC, and FW), and a consensus was reached. Three reviewers (OJ, SS, and NC) independently assessed the full-text articles for inclusion in the review. Any disagreements were resolved by a consensus-based decision.

Data Analysis
Data extraction was undertaken independently by at least two reviewers (OJ, SS, and NC) into a predesigned data extraction spreadsheet. The research team met regularly to reach consensus by discussing and resolving any differences in data extraction. One author (OJ) amalgamated the data extraction spreadsheets, summarizing the data where possible.
The main summary measures collected included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), area under the receiver operating characteristic (AUROC) curve, and any other diagnostic accuracy measures of the AI techniques. Secondary outcomes include the types of AI used, the type of data used to train and test the algorithms, and how these algorithms were evaluated. We also collected data, where identified, on cost-effectiveness and patient or clinician acceptability.
Risk of bias assessment was undertaken for all full-text papers by 2 independent researchers (OJ and NC) using the quality assessment of diagnostic accuracy studies-2 (QUADAS-2) critical appraisal tool [32]. OJ assessed all studies, and 50% (40/79) of them were cross-checked by NC. Any disagreements in the assessment were resolved by consensus discussion.
The studies identified were heterogeneous, employing various AI techniques and using different outcome measures for evaluation. Hence, a meta-analysis of the data was not possible, and we chose to use a narrative synthesis approach, following established guidance on its methodology [33]. We aimed to summarize the findings of the identified studies using primarily a textual approach, while also providing an overview of the quantitative outcome measures used in the studies. Once data extraction was completed, we explored the relationships that emerged within the data.
Full details of our review question, search strategy, inclusion or exclusion criteria, and data extraction methodology are described in Multimedia Appendices 1 [1-5,7-9,11-13,34-38] and 2, and the full list of excluded studies is provided in Multimedia Appendix 3 [34,.
Neural networks were the dominant technique employed (n=10) [39][40][41][42][44][45][46][47]50,51], with many neural network subtypes mentioned. The study by Miotto et al [50] was the only study to include a processed form of the free text notes in the data used by the AI technique, although the work described by Kop et al [49] was developed in a subsequent study to include clinical free text data [115].
Validation studies Most of the studies (n=12) included blood test results, all suitable for use in primary care settings. Age was also commonly included (n=12). Other variables used were sex (n=10), demographics (n=5), symptoms (n=7), comorbidities (n=8), lifestyle history (n=7), examination findings (n=6), medication or prescription history (n=3), spirometry results (n=2), urine dipstick results (n=1), fecal immunochemical test results (n=1), x-ray text reports (n=1), and referrals (n=1). Table 3 shows the study designs and populations. Most studies used data sets originating from specialist care settings (n=7) [39,40,[42][43][44]46,51], with only 3 studies using solely primary care patient data [41,49,52]. Kinar et al [48] included a follow-up validation study based on the health improvement network (THIN) database, also using primary care data. Several studies used a mixture of primary and secondary care patient data (n=5) [34,47,48,50,53]. Almost all the studies used different data sets, with the exception of the Maccabi Health Services EHR, which was used in 2 studies [48,53]. The data set sizes ranged from 193 to 2,225,249 patients, with a mean of 241,585 (SD 555,953), median of 3,150, and IQR of 267,237 patients. The wide range is primarily due to the large data set used by Birks et al [52]. Of the 13 development studies, 3 provided no information on the control population used [39,46,51]. Five of the development studies did not provide full information on how they partitioned their data set for the training and testing of the algorithm [39,41,43,47,49]. Five studies appeared to have independent training and testing data sets, with most split in ratios ranging from 60:40 to 70:30 [40,[44][45][46]50].

Symptify (United States)
--X ----N/C Symptify website [134] Symptomate (Poland) [136] a AI: artificial intelligence. b Not applicable or no data. c Study excluded for the reason specified in the column label. d N/C: not clear. e These studies met the inclusion criteria of the systematic review and were therefore included. f Edwards et al [133] suggests that this Egton Medical Information Systems (EMIS) application is powered by the eConsult system. g Carter et al [177] suggests that this is the group who developed webGP.
h Several published studies are linked in the research section of the website, none involved use of the differential diagnosis or decision support tools.
Some case studies audited the use of these tools.

Principal Findings
We identified 16 studies reporting AI techniques that could facilitate the early detection of cancer and could be applied to the types of data found in primary care EHRs. However, heterogeneity of AI modalities, data set characteristics, outcome measures, conduct of these studies, and quality assessment meant that we were unable to draw strong conclusions about the utility of these techniques in primary care settings. There was a notable paucity of evidence on performance using primary care data. Coupled with the lack of evidence on implementation barriers or cost-effectiveness, this may help explain why AI techniques have not been adopted widely into primary care clinical practice to date. The study by Kinar et al [48] and its subsequent validation in independent data sets [34,52,53], including primary care data sets, is a valuable example of a staged evaluation of an AI technique from early development, via validation data sets, to evaluation in the population for intended use [22]. The work by Kop and collaborators [49,115,184] also represents a good example of the staged development of an AI technique, with sequential peer-reviewed, published evaluations at each stage.
We also identified 21 commercial AI technologies, many of which have not been evaluated and reported in peer-reviewed, published studies. Many other technologies that were patient-facing and designed for the triage of symptoms were identified but had not been applied to EHRs. Eight of these technologies appeared to be based on newer machine learning AI techniques, with the majority appearing to be driven by knowledge-based decision tree algorithms. Only one of the identified technologies has been evaluated specifically for cancer, although it may be more efficacious for these technologies to be very general in scope and to be widely used, rather than to have a narrow focus on cancer alone. With wider adoption, these technologies have a greater potential for raising patient and clinician awareness of cancer. However, it remains important to fully understand their diagnostic accuracy and safety, including for the triage of potential cancer symptoms. AI technologies applied to EHRs are potentially useful for primary care clinicians; however, they need to be designed in a way that is appropriate for the type and origin of the data found in primary care EHRs and to have been thoroughly and transparently evaluated in the population the technology is intended for.

Strengths and Limitations
The strengths of this systematic review include the following: a broad and inclusive search strategy to avoid missing studies; guidance of an international expert panel in the development of the protocol and search strategy; independent screening, quality assessment, and data extraction processes; followed PRISMA guidance; and a parallel scoping review for commercial AI technologies. As only a few heterogeneous studies were identified, it was not possible to synthesize the data and evaluate the utility of these AI techniques. Furthermore, only one commercially available AI technology was identified via the systematic review. Many of the technologies identified in the parallel scoping review lacked sufficient academic detailing and evidence for their accuracy or safety. This is a rapidly evolving research area, which will require further review over time.

Conclusions
Worldwide, there is a great deal of interest in AI techniques and their potential in medicine, not least in the United Kingdom where politicians and NHS leaders have publicly prioritized the incorporation of AI into clinical settings. Our findings support those of Kueper et al [17], namely, that although some AI techniques have good initial validation reports, they have not yet been through the steps for full application in clinical practice. Validation using independent data is preferable to splitting a single data set [185] and could be the next step in the development of many AI techniques identified in this review. Much of the research is at an early stage, with variable reporting and conduct, and requires further validation in prospective clinical settings and assessment of cost-effectiveness after clinical implementation before it can be incorporated into daily practice safely and effectively [186].
Consensus is required on how AI techniques designed for clinical use should be developed and validated to ensure their safety for patients and clinicians in their intended settings. Good internal and external validity is required in these experiments to avoid bias, most notably spectrum bias [187] and distributional shift [16], and to ensure that the appropriate data are used to develop the AI technique in keeping with its anticipated clinical setting and diagnostic function. The CanTest framework provides an outline for further studies aiming to develop this evidence base for AI techniques in clinical settings; to prove their safety and efficacy to commissioners, clinicians, and patients; and to enable them to be implemented in clinical practice [22]. Prospective evaluation in the clinical setting for which the AI technique is intended is essential: AI aimed at primary care clinics must be evaluated in primary care settings, where cancer prevalence is low compared with specialist settings, to accurately evaluate their future performance [187,188]. Further research around the acceptability of AI techniques for patients and clinicians and their cost-effectiveness will also be important to facilitate rapid implementation. Once these AI techniques are ready for implementation, they will require careful design to ensure effective integration into health information systems [189]. Data governance and protection must also be addressed, as they may present significant barriers to the implementation of these technologies [190,191].
In conclusion, AI techniques have the potential to aid the interpretation of patient-reported symptoms and clinical signs and to support clinical management, doctor-patient communication, and informed decision making. Ultimately, in the context of early cancer detection, these techniques may help reduce missed diagnostic opportunities and improve safety netting. However, although there are a few good examples of staged validation of these AI techniques, most of the research is at an early stage. We found numerous examples of the implementation of AI technologies without any or sufficient evidence for their accuracy or safety. Further research is required to build up the evidence base for AI techniques applied to EHRs and to reassure commissioners, clinicians, and patients that they are safe and effective enough to be incorporated into routine clinical practice.