This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence (AI) applications are growing at an unprecedented pace in health care, including disease diagnosis, triage or screening, risk analysis, surgical operations, and so forth. Despite a great deal of research in the development and validation of health care AI, only few applications have been actually implemented at the frontlines of clinical practice.
The objective of this study was to systematically review AI applications that have been implemented in real-life clinical practice.
We conducted a literature search in PubMed, Embase, Cochrane Central, and CINAHL to identify relevant articles published between January 2010 and May 2020. We also hand searched premier computer science journals and conferences as well as registered clinical trials. Studies were included if they reported AI applications that had been implemented in real-world clinical settings.
We identified 51 relevant studies that reported the implementation and evaluation of AI applications in clinical practice, of which 13 adopted a randomized controlled trial design and eight adopted an experimental design. The AI applications targeted various clinical tasks, such as screening or triage (n=16), disease diagnosis (n=16), risk analysis (n=14), and treatment (n=7). The most commonly addressed diseases and conditions were sepsis (n=6), breast cancer (n=5), diabetic retinopathy (n=4), and polyp and adenoma (n=4). Regarding the evaluation outcomes, we found that 26 studies examined the performance of AI applications in clinical settings, 33 studies examined the effect of AI applications on clinician outcomes, 14 studies examined the effect on patient outcomes, and one study examined the economic impact associated with AI implementation.
This review indicates that research on the clinical implementation of AI applications is still at an early stage despite the great potential. More research needs to assess the benefits and challenges associated with clinical AI applications through a more rigorous methodology.
Artificial intelligence (AI) has greatly expanded in health care in the past decade. In particular, AI applications have been applied to uncover information from clinical data and assist health care providers in a wide range of clinical tasks, such as disease diagnosis, triage or screening, risk analysis, and surgical operations [
The term “AI” was coined by McCarthy in the 1950s and refers to a branch of computer science wherein algorithms are developed to emulate human cognitive functions, such as learning, reasoning, and problem solving [
Researchers have devoted a great deal of effort to the development of health care AI applications. The number of related articles in the Google Scholar database has grown exponentially since 2000. However, their implementation in real-life clinical practice is not widespread [
To the best of our knowledge, this review is the first to systematically examine the role of AI applications in real-life clinical environments. We note that many reviews have been carried out in the area of health care AI. One stream of reviews provided an overview of the current status of AI technology in specific clinical domains, such as breast cancer diagnosis [
On the other hand, we note that several viewpoint articles have provided a general outlook of health care AI [
The objective of this systematic review was to identify and summarize the existing research on AI applications that have been implemented in real-life clinical practice. This helps us better understand the benefits and challenges associated with AI implementation in routine care settings, such as augmenting clinical decision-making capacity, improving care processes and patient outcomes, and reducing health care costs. Specifically, we synthesize relevant studies based on (1) study characteristics, (2) AI application characteristics, and (3) evaluation outcomes and key findings. Considering the research-practice gap, we also provide suggestions for future research that examines and assesses the implementation of AI in clinical practice.
The systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [
We used two groups of keywords to identify terms in the titles, abstracts, and keywords of the publications. The first group of keywords had AI-related terms, including “artificial intelligence,” “machine learning,” and “deep learning.” It is worth noting here that AI is a broadly encompassing term and also includes specific AI techniques, such as neural networks, support vector machines, decision trees, and NLP. However, studies using these techniques are highly likely to use “artificial intelligence” or “machine learning” in abstracts or keywords [
We downloaded and imported all of the identified articles using EndNote X9 (Thomson Reuters) for citation management. After removing duplicates, two researchers (JY and KYN) independently screened the titles and abstracts of the identified articles to determine their eligibility. Disagreements were resolved by discussion between the authors until consensus was reached. The inclusion criteria were as follows: (1) the study implemented an AI application with patients or health care providers in a real-life clinical setting and (2) the AI application provided decision support by emulating clinical decision-making processes of health care providers (eg, medical image interpretation and clinical risk assessment). Medical hardware devices, such as X-ray machines, ultrasound machines, surgery robots, and rehabilitation robots, were outside our scope.
The exclusion criteria were as follows: (1) the study discussed the development and validation of clinical AI algorithms without actual implementation; (2) the AI application provided automation (eg, automated insulin delivery and monitoring) rather than decision support; and (3) the AI application targeted nonclinical tasks, such as biomedical research, operational tasks, and epidemiological tasks. We also excluded conference abstracts, reviews, commentaries, simulation papers, and ongoing studies.
Following article selection, we created a data-charting form to extract information from the included articles in the following aspects: (1) study characteristics, (2) AI application characteristics, and (3) evaluation outcomes and key findings (
Author, year
Study design
Involved patient(s) and health care provider(s)
Involved hospital(s) and country of the study
Application description
AI techniques used (eg, neural networks, random forests, and natural language processing)
Targeted clinical tasks
Targeted disease domains and conditions
Performance of AI applications
Clinician outcomes
Patient outcomes
Cost-effectiveness
Our initial search in June 2020 returned a total of 17,945 journal articles (6830 from PubMed, 9124 from Embase, 839 from CINAHL, and 1152 from Cochrane Central) (
Flow diagram of the literature search based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.
Regarding study design, the 51 studies included 20 observational studies (17 prospective studies and three retrospective studies), 13 randomized controlled trials (RCTs), eight experimental studies, four before-and-after studies, three surveys, one randomized crossover trial, one nonrandomized trial, and one structured interview. It is important to note that observational studies can be categorized into prospective and retrospective studies based on the timing of data collection. In prospective studies, researchers design the research and plan the data collection procedures before any of the subjects have the disease or develop other outcomes of interest. In retrospective studies, researchers collect existing data on current and past subjects, that is, subjects may have the disease or develop other outcomes of interest before researchers initiate research design and data collection.
Of the 51 studies, 29 (57%) explicitly mentioned the involved patients, two of which had a sample size smaller than 30. On the other hand, 28 (55%) studies provided information about the involved health care providers, of which 17 studies had 10 or fewer providers.
Additionally, 46 (90%) studies mentioned the involved hospitals or clinics (
Characteristics of the included studies.
Author, year | Study design | Sample characteristics | Hospital (country) | Evaluation outcomes |
Abràmoff et al, 2018 [ |
Observational study (prospective) | 819 patients | 10 primary care clinics (United States) | APa (sensitivity, specificity, imageability rate) |
Aoki et al, 2020 [ |
Experimental study (cross-over design) | 6 physicians | The University of Tokyo Hospital (Japan) | COb (reading time, mucosal break detection rate) |
Arbabshirani et al, 2018 [ |
Observational study (prospective) | 347 routine head CTc scans of patients | Geisinger Health System (United States) | AP (AUCd, accuracy, sensitivity, specificity) |
Bailey et al, 2013 [ |
Crossover RCTe | 20,031 patients | Barnes-Jewish Hospital (United States) | POf (ICUg transfer, hospital mortality, hospital LOSh) |
Barinov et al, 2019 [ |
Experiment (within subjects) | 3 radiologists | NRi | AP (AUC) |
Beaudoin et al, 2016 [ |
Observational study (prospective) | 350 patients (515 prescriptions) | Centre hospitalier universitaire de Sherbrooke (Canada) | AP (number of triggered recommendations, precision, recall, accuracy) |
Bien et al, 2018 [ |
Experimental study (within subjects) | 9 clinical experts | Stanford University Medical Center (United States) | AP (AUC) |
Brennan et al, 2019 [ |
Nonrandomized trial | 20 physicians | An academic quaternary care institution (United States) | AP (AUC) |
Chen et al, 2020 [ |
RCT | 437 patients | Renmin Hospital, Wuhan University (China) | CO (blind spot rate) |
Connell et al, 2019 [ |
Before-after study | 2642 patients | Royal Free Hospital, Barnet General Hospital (United Kingdom) | PO (renal recovery rate, other clinical outcomes, care process) |
Eshel et al, 2017 [ |
Observational study (prospective) | 6 expert microscopists | Apollo Hospital, Chennai (India); Aga Khan University Hospital (Kenya) | AP (sensitivity, specificity, species identification accuracy, device parasite count) |
Giannini et al, 2019 [ |
Before-after study | 22,280 patients in the silent period, 32,184 patients in the alert period | 3 urban acute hospitals under University of Pennsylvania Health System (United States) | AP (sensitivity, specificity) |
Ginestra et al, 2019 [ |
Survey | 43 nurses and 44 health care providers | A tertiary teaching hospital in Philadelphia (United States) | CO (nurse and provider perceptions) |
Gómez-Vallejo et al, 2016 [ |
Observational study (retrospective) | 1800 patients (2569 samples) | A Spanish National Health System hospital (Spain) | AP (accuracy) |
Grunwald et al, 2016 [ |
Observational study (retrospective) | 15 patients, 3 neuroradiologists | A comprehensive stroke center (Germany) | AP (e-ASPECTS performance) |
Kanagasingam et al, 2018 [ |
Observational study (prospective) | 193 patients, 4 physicians | A primary care practice in Midland (Australia) | AP (sensitivity, specificity, PPVj, NPVk) |
Keel et al, 2018 [ |
Survey | 96 patients | St Vincent’s Hospital, University Hospital Geelong (Australia) | AP (sensitivity and specificity, assessment time) |
Kiani et al, 2020 [ |
Experimental study (within subjects) | 11 pathologists | Stanford University Medical Center (United Kingdom) | AP (accuracy) |
Lagani et al, 2015 [ |
Observational study (prospective) | 2 health care providers | Chorleywood Health Centre (United Kingdom) | AP (system performance) |
Lin et al, 2019 [ |
RCT | 350 patients | 5 ophthalmic clinics (China) | AP (accuracy, PPV, NPV) |
Lindsey et al, 2018 [ |
Experimental study (within subjects) | 40 practicing emergency clinicians | Hospital for Special Surgery (United States) | AP (AUC) |
Liu et al, 2020 [ |
RCT | 1026 patients | No. 988 Hospital of Joint Logistic Support Force of PLA (China) | CO (ADRl, PDRm, number of detected adenomas and polyps) |
Mango et al, 2020 [ |
Experimental study (within subjects) | 15 physicians | 13 different medical centers (United States) | AP (AUC, sensitivity, specificity) |
Martin et al, 2012 [ |
Observational study (prospective) | 214 patients | 13 different medical centers (United States) | AP (sensitivity, PPV) |
McCoy and Das, 2017 [ |
Before-after study | 1328 patients | Cape Regional Medical Center (United States) | PO (hospital mortality, hospital LOS, readmission rate) |
McNamara et al, 2019 [ |
Observational study (prospective) | 3 breast cancer experts | John Theurer Cancer Center (United States) | CO (decision making) |
Mori et al, 2018 [ |
Observational study (prospective) | 791 patients, 23 endoscopists | Showa University Northern Yokohama Hospital (Japan) | AP (NPV) |
Nagaratnam et al, 2020 [ |
Observational study (retrospective) | 1 patient | Royal Berkshire Hospital (United Kingdom) | PO (patient care and clinical outcomes) |
Natarajan et al, 2019 [ |
Observational study (prospective) | 213 patients | Dispensaries under Municipal Corporation of Greater Mumbai (India) | AP (sensitivity, specificity) |
Nicolae et al, 2020 [ |
RCT | 41 patients | Sunnybrook Odette Cancer Centre (Canada) | AP (day 30 dosimetry) |
Park et al, 2019 [ |
Experimental study (within subjects) | 8 clinicians | Stanford University Medical Center (United States) | CO (specificity, sensitivity, accuracy interrater agreement, time to diagnosis) |
Romero-Brufau et al, 2020 [ |
Pre-post survey | 81 clinical staff | 3 primary-care clinics in Southwest Wisconsin (United States) | CO (attitudes about AIo in the workplace) |
Rostill et al, 2018 [ |
RCT | 204 patients, 204 caregivers | NHS, Surrey and Hampshire (United Kingdom) | CO (system evaluations) |
Segal et al, 2014 [ |
Observational study (prospective) | 16 pediatric neurologists | Boston Children’s Hospital (United States) | CO (diagnostic errors, diagnosis relevance, number of workup items) |
Segal et al, 2016 [ |
Observational study (prospective) | 26 clinicians | Boston Children’s Hospital (United States) | CO (diagnostic errors) |
Segal et al, 2017 [ |
Structured interviews | 10 medical specialists | Geisinger Health System and Intermountain Healthcare (United States) | CO (system perceptions) |
Segal et al, 2019 [ |
Observational study (prospective) | 3160 patients (315 prescription alerts) | Sheba Medical Center (Israel) | AP (accuracy, clinical validity, and usefulness) |
Shimabukuro et al, 2017 [ |
RCT | 142 patients | University of California San Francisco Medical Center (United States) | PO (LOS, in-hospital mortality) |
Sim et al, 2020 [ |
Observational study (prospective) | 12 radiologists | 4 medical centers (United States and South Korea) | AP (sensitivity, FPPIp) |
Steiner et al, 2018 [ |
Experimental study (within subjects) | 6 anatomic pathologists | NR | CO (sensitivity, average review per image, interpretation difficulty) |
Su et al, 2020 [ |
RCT | 623 patients, 6 endoscopists | Qilu Hospital of Shandong University (China) | CO (ADR, PDR, number of adenomas and polyps, withdrawal time, adequate bowel preparation rate) |
Titano et al, 2018 [ |
RCT | 2 radiologists | NR | CO (time to diagnosis, queue of urgent cases) |
Vandenberghe et al, 2017 [ |
Observational study (prospective) | 1 pathologist and 2 HER2 raters | NR | CO (decision concordance, decision modification) |
Voerman et al, 2019 [ |
Before-after study | NR | Five Rivers Medical Center, Pocahontas (United States) | CEq (average total costs per patient) |
Wang et al, 2019 [ |
RCT | 1058 patients, 8 physicians | Sichuan Provincial People’s Hospital (China) | CO (ADR, PDR, number of adenomas per patient) |
Wang et al, 2019 [ |
RCT | 75 patients | 4 primary care clinics affiliated with Brigham and Women’s Hospital (United States) | CO (anticoagulation prescriptions) |
Wang et al, 2020 [ |
RCT | 962 patients | Caotang branch hospital of Sichuan Provincial People’s Hospital (China) | CO (ADR, PDR, number of adenomas and polyps per colonoscopy) |
Wijnberge et al, 2020 [ |
RCT | 68 patients | Amsterdam UMC (Netherlands) | PO (median time-weighted average of hypotension, median time of hypotension, treatment, time to intervention, adverse events) |
Wu et al, 2019 [ |
Observational study (prospective) | 3600 residents | 3 ophthalmologists, community healthcare centers (China) | AP (AUC) |
Wu et al, 2019 [ |
RCT | 303 patients, 6 endoscopists | Renmin hospital of Wuhan University (China) | AP (accuracy, completeness of photo documentation) |
Yoo et al, 2018 [ |
Observational study (prospective) | 50 patients, 1 radiologist | NR (Korea) | AP (sensitivity, specificity, PPV, NPV, accuracy) |
aAP: application performance.
bCO: clinician outcomes.
cCT: computed tomography.
dAUC: area under the curve.
eRCT: randomized controlled trial.
fPO: patient outcomes.
gICU: intensive care unit.
hLOS: length of stay.
iNR: not reported.
jPPV: positive-predictive value.
kNPV: negative-predictive value.
lADR: adenoma detection rate.
mPDR: polyp detection rate.
nACSC: ambulatory care sensitive admissions.
oAI: artificial intelligence.
pFFPI: false-positive per image.
qCE: cost-effectiveness.
Distribution of the included articles from 2010 to 2020.
Country distribution of the involved hospitals.
Considering the heterogeneity of study types included in the review, we only assessed the risk of bias of 13 RCTs using the Cochrane Collaboration Risk of Bias tool (
Among the 51 studies, two did not disclose any information regarding the AI techniques used. Among the remaining 49 studies, the most popular ML technique was neural networks (n=22), followed by random forests (n=3), Bayesian pattern matching (n=3), support vector machine (n=2), decision tree (n=2), and deep reinforcement learning (n=2). We also found that the included AI applications mainly provided decision support in the following four categories of clinical tasks: disease screening or triage (n=16), disease diagnosis (n=16), risk analysis (n=14), and treatment (n=7). Further, AI applications in 46 (94%) studies targeted one or more specific diseases and conditions. The most prevalent diseases and conditions were sepsis (n=6), breast cancer (n=5), diabetic retinopathy (n=4), polyp and adenoma (n=4), cataracts (n=2), and stroke (n=2). Details of AI application characteristics are provided in
We categorized the evaluation outcomes in our review studies into the following four types: performance of AI applications, clinician outcomes, patient outcomes, and cost-effectiveness, as can be seen in
Twenty-six studies evaluated the performance of AI applications in real-life clinical settings [
On the contrary, two studies found that AI applications failed to outperform health care providers and needed further improvement [
Thirty-three studies examined the effect of AI applications on clinician outcomes, that is, clinician decision making, clinician workflow and efficiency, and clinician evaluations and acceptance of AI applications [
AI applications have the potential to provide clinical decision support. From our review, 16 studies demonstrated that AI applications could enhance clinical decision-making capacity [
Seven studies were aimed at clinician workflow and efficiency [
Finally, clinician perceptions and acceptance of AI applications were examined in seven studies [
Fourteen studies reported patient outcomes [
Three studies examined how patients evaluated AI applications, and all of them reported positive results [
The economic impact of AI implementation in clinical practice was addressed in only one study [
AI applications have huge potential to augment clinician decision making, improve clinical care processes and patient outcomes, and reduce health care costs. Our review seeks to identify and summarize the existing studies on AI applications that have been implemented in real-life clinical practice. It yields the following interesting findings.
First, we note that the number of included studies was surprisingly small considering the tremendous number of studies on health care AI. In particular, most of the health care AI studies were proof-of-concept studies that focused on AI algorithm development and validation using retrospective clinical data sets. In contrast, only a handful of studies implemented and evaluated AI in a clinical environment. To ensure safe adoption, however, an AI application should provide solid scientific evidence for its effectiveness relative to the standard of care. Therefore, we urge the health care AI research community to work closely with health care providers and institutions to demonstrate the potential of AI in real-life clinical settings.
Second, more than two-thirds of the included articles were from developed economies, of which more than half were from the United States, suggesting that developed countries are at the forefront of health care AI development and deployment. This is consistent with the fact that top health AI companies and start-ups (eg, Google Health, IBM Watson Health, and Babylon Health) are mainly located in the United States and Europe. This finding should be interpreted with caution because we excluded non-English–written articles, even though our search had identified 890 non-English publications. We did not include these non-English articles because it is difficult to conduct an unbiased analysis owing to translation difficulty and variation. The imbalanced distribution of articles by country or economic development status could be attributed to the fact that researchers from low-income countries have a very low publication rate.
However, it is worth noting that 8 (16%) of our articles were from China, suggesting that China has been extensively applying health care AI and conducting health care AI research. Indeed, hospitals, technology companies, and the Chinese government have been driving clinical AI deployment with the aim to alleviate doctor shortages, relieve medical resource inequality, and reduce health care costs [
Third, the quality of research on clinical AI evaluation needs to be improved in the future. Our review revealed that only 13 (26%) studies were RCTs and most of them suffered from moderate to high risk of bias. Eight studies were experimental studies, and all of them adopted a cross-over design or within-subjects design and were hence susceptible to confounding effects. With respect to sample information, only 8 (16%) studies provided information on both patients and health care providers, and 14 (28%) studies used a sample size smaller than 20 (
Fourth, our analysis indicated that AI applications could provide effective decision support, albeit in certain contexts. For instance, the augmenting role of AI in clinical decision-making capacity can be affected by the level of expertise. In particular, two studies suggested that junior physicians were more likely to benefit from AI than senior physicians because they had a higher tendency to reconsider and modify their clinical decisions when encountering disconfirming AI suggestions [
With respect to AI acceptance, we observed that health care providers expressed negative feelings toward AI in two studies [
Fifth, for most of the included studies on patient outcomes, we found that they did not examine the clinical processes and interventions in detail. However, AI applications without appropriate and useful interventions may be ineffective at improving patient outcomes. For example, Bailey et al [
Moreover, three of the included studies suggested that patients and their families were highly satisfied with health care AI owing to its convenience and efficiency [
Finally, according to an Accenture survey, more than half of health care institutions are optimistic that AI will reduce costs and improve revenue despite the high initial costs associated with AI implementation [
This review has several limitations. First, we only included peer-reviewed English-written journal articles. It is plausible that some relevant articles were written in other languages or published in conferences, workshops, and news reports. As noted earlier, this may partly explain the imbalanced country distribution of the reviewed articles. Moreover, we did not include articles that were published before 2010 because AI only started to make in-roads in the clinical field in the last decade, as evident in our search results. Moreover, we only reviewed premium computer science conferences and journals without comprehensively examining engineering and computer science databases. This should be less of a concern here because we found that computer science conferences and journals mainly focus on the training and validation of novel AI algorithms without actual deployment. Still, future research can expand the search scope to gain deeper insights into state-of-the-art clinical AI algorithms.
Another concern is that some AI applications may have been implemented in real-world clinical practice without any openly accessible publications. For example, IDx-DR, the first FDA-approved AI system, has been implemented in more than 20 health care institutions such as University of Iowa Health Care [
AI applications have tremendous potential to improve patient outcomes and improve care processes. Based on the literature presented in this review, there is great interest to develop AI tools to support clinical workflows, with increasing high-quality evidence being generated. However, there is currently insufficient level 1 evidence to advocate the routine use of health care AI for decision support, hindering the growth of health care AI and presenting potential risks to patient safety. We thus conclude that it is important to conduct robust RCTs to benchmark AI-aided care processes and outcomes to the current best practice. A rigorous, robust, and comprehensive evaluation of health care AI will help move from theory to clinical practice.
Search strategy.
Quality assessments of randomized controlled trials based on the Cochrane Collaboration Risk of Bias Tool.
Artificial intelligence application characteristics.
Evaluation outcomes and main results of the included studies.
artificial intelligence
deep learning
Food and Drug Administration
intensive care unit
machine learning
natural language processing
negative-predictive value
positive-predictive value
randomized controlled trial
This study was supported by NSCP (grant no. N-171-000-499-001) from the National University of Singapore. It was also supported by Dean Strategic Fund-Health Informatics (grant no. C-251-000-068-001) from the National University of Singapore.
None declared.