This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence (AI) is rapidly expanding in medicine despite a lack of consensus on its application and evaluation.
We sought to identify current frameworks guiding the application and evaluation of AI for predictive analytics in medicine and to describe the content of these frameworks. We also assessed what stages along the AI translational spectrum (ie, AI development, reporting, evaluation, implementation, and surveillance) the content of each framework has been discussed.
We performed a literature review of frameworks regarding the oversight of AI in medicine. The search included key topics such as “artificial intelligence,” “machine learning,” “guidance as topic,” and “translational science,” and spanned the time period 2014-2022. Documents were included if they provided generalizable guidance regarding the use or evaluation of AI in medicine. Included frameworks are summarized descriptively and were subjected to content analysis. A novel evaluation matrix was developed and applied to appraise the frameworks’ coverage of content areas across translational stages.
Fourteen frameworks are featured in the review, including six frameworks that provide descriptive guidance and eight that provide reporting checklists for medical applications of AI. Content analysis revealed five considerations related to the oversight of AI in medicine across frameworks: transparency, reproducibility, ethics, effectiveness, and engagement. All frameworks include discussions regarding transparency, reproducibility, ethics, and effectiveness, while only half of the frameworks discuss engagement. The evaluation matrix revealed that frameworks were most likely to report AI considerations for the translational stage of development and were least likely to report considerations for the translational stage of surveillance.
Existing frameworks for the application and evaluation of AI in medicine notably offer less input on the role of engagement in oversight and regarding the translational stage of surveillance. Identifying and optimizing strategies for engagement are essential to ensure that AI can meaningfully benefit patients and other end users.
Artificial intelligence (AI) allows computers to accomplish tasks that normally require the use of human intelligence. Creating AI, or an AI computer system, begins when developers feed the system existing data and allow it to “learn.” This learning experience enables AI to understand, infer, communicate, and make decisions similar to, or better than, humans [
Numerous concerns have been raised regarding a lack of oversight for the rapid development and expansion of AI in medicine. Commentators have drawn attention to the potential weaknesses and limitations of AI in medicine, including challenges spanning ethical, legal, regulatory, methodological, and technical domains [
Translational science is the study of how to turn concepts, observations, or theories into actions and interventions by following defined stages of research and development. This is done to improve the health of individuals and society [
Developing robust guidance for the oversight of AI along its translational pathway is essential to facilitating its clinical impact [
Identifying considerations for the oversight of AI across the translational spectrum is essential to increasing the utility of AI in medicine. In this study, we explored and characterized existing frameworks regarding the oversight of AI in medicine. We then identified specific considerations raised in these frameworks and mapped them to different stages of the translational process for AI.
We performed a literature review to identify guidance on the use of predictive analytic AI in medicine. The search spanned the PubMed, Web of Science, and Embase databases, and also included a grey literature search of Google. Key terms for searching included “artificial intelligence,” “machine learning,” “guidance as topic,” and “translational science.” Documents were included if they provided generalized guidance (ie, were a framework) on applying or evaluating AI in medicine. Documents that described specific AI applications without offering overarching guidance on the use of AI were excluded. The reference lists of included frameworks were screened for additional relevant sources. Frameworks were not restricted to the use of AI in any specific condition or medical setting. The time period of the review was January 2014 to May 2022; 2014 was selected as the cut-off point, as this was the year when regulatory agencies in the United States and Europe began using the authorization designation of “software as a medical device,” which includes regulation over AI.
A structured abstraction process was used to collect general information about each framework, including title, author/affiliation, year, summary, and intended audience. Frameworks were analyzed using content analysis, which is an approach for exploring themes or patterns from textual documents [
Data were visualized using several approaches. First, we used spider plots to visualize, for each individual framework, how many stages of translation were discussed in relation to each of the five domains. Second, we applied a heatmap to depict the number of frameworks discussing a given domain across each translational stage. The heatmap cross-walked the domains across the five stages of translation.
A total of 14 documents were included in the review, which are summarized in
Summary of frameworks for the use of artificial intelligence (AI) in medicine.
Frameworks | Summary | Audience | |
|
|||
|
Describes general challenges and opportunities associated with the use of AI in medicine | AI developers, clinicians, patients, policymakers | |
|
Describes recommendations on evaluating the suitability of AI applications for clinical settings | Clinicians | |
|
Describes a roadmap for considering ethical aspects of AI with health care applications | AI developers, investigators, clinicians, policymakers | |
|
Describes an evaluation framework for the application of AI in medicine | Investigators, health care organizations | |
|
Describes an approach for assessing published literature using AI for medical diagnoses | Clinicians | |
|
Describes barriers to the implementation of AI in medicine and provides solutions to address them | Health care organizations | |
|
|||
|
Proposes 20 questions for evaluating the development and use of AI in research (20 reporting items) | Investigators, clinicians, patients, policymakers | |
|
Proposes a comprehensive checklist for the self-assessment and evaluation of medical papers (30 reporting items) | Investigators, editors and peer reviewers | |
|
Provides reporting guidelines for clinical trials evaluating interventions with an AI component (25 core and 15 AI-specific reporting items) | AI developers, investigators | |
|
Provides guidelines and an associated checklist for the reporting of AI research to clinicians (15 reporting items) | Investigators, developers, clinicians | |
|
Provides reporting guidelines for evaluations of early-stage clinical decision support systems developed using AI (10 generic and 17 AI-specific reporting items) | Investigators, clinicians, patients, policymakers | |
|
Provides guidelines for applying and reporting AI model specifications/results in biomedical research (12 reporting items) | AI developers, investigators | |
|
Provides minimum reporting standards for AI in health care (16 reporting items) | AI developers, investigators | |
|
Provides guidelines for clinical trials protocols evaluating interventions with an AI component (25 core and 15 AI-specific reporting items) | AI developers, investigators |
aPublication associated with a professional organization; AI in Healthcare=National Academy of Medicine; CONSORT-AI=CONSORT Group; DECIDE-AI=DECIDE-AI Expert Group; SPIRIT-AI=SPIRIT Group.
bCONSORT: Consolidated Standards of Reporting Trials.
cCAIR: Clinical AI Research.
dMINIMAR: Minimum Information for Medical AI Reporting.
eSPIRIT: Standard Protocol Items: Recommendations for Interventional Trials.
Matheny and colleagues [
Scott and colleagues [
Char and colleagues [
Park and colleagues [
Liu and colleagues [
After presenting clinical examples of beneficial AI use, Bates and colleagues [
Vollmer and colleagues [
Cabitza and Campagner [
Liu and colleagues [
Olczak and colleagues [
Vasey and colleagues [
Luo and colleagues [
Hernandez-Boussard and colleagues [
Rivera and colleagues [
We identified five domains through the content analysis, including transparency, reproducibility, ethics, effectiveness, and engagement. These domains are described in turn below.
Coverage of frameworks across content domains and translational stages.
Domain and stage | Descriptive frameworks | Reporting frameworks | |||||||||||||||
|
|
AIa in health care | Clinician Checklist | Ethical Considerations | Evaluating AI | Users’ Guide | Reporting and Implementing Interventions | 20 Critical Questions | Comprehensive Checklist | CONSORTb- |
CAIRc Checklist | DECIDE-AI | Guidelines for Developing and Reporting | MINIMARd | SPIRITe- |
||
|
|||||||||||||||||
|
Development | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Validation |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Reporting | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Implementation |
|
✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ | ||
|
Surveillance | ✓ |
|
|
✓ |
|
|
✓ | ✓ |
|
|
|
|
|
|
||
|
|||||||||||||||||
|
Development |
|
|
✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Validation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
|
✓ |
|
✓ |
|
||
|
Reporting |
|
✓ |
|
|
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Implementation |
|
✓ |
|
|
✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ | ✓ |
|
✓ | ||
|
Surveillance |
|
|
|
|
|
|
✓ |
|
|
|
|
|
|
|
||
|
|||||||||||||||||
|
Development | ✓ | ✓ | ✓ | ✓ |
|
✓ | ✓ |
|
|
✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Validation | ✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Reporting | ✓ | ✓ | ✓ |
|
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Implementation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ |
|
|
|
✓ | ||
|
Surveillance |
|
|
✓ | ✓ |
|
|
|
|
|
|
|
|
|
|
||
|
|||||||||||||||||
|
Development | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Validation |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Reporting | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
|
Implementation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ |
|
|
||
|
Surveillance | ✓ |
|
✓ | ✓ | ✓ |
|
✓ | ✓ |
|
|
|
|
|
|
||
|
|||||||||||||||||
|
Development | ✓ |
|
✓ |
|
|
|
✓ |
|
|
|
✓ |
|
|
|
||
|
Validation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
Reporting |
|
|
✓ |
|
|
|
|
|
|
|
✓ |
|
|
✓ | ||
|
Implementation |
|
|
|
|
|
|
✓ | ✓ |
|
|
|
|
|
|
||
|
Surveillance | ✓ |
|
✓ |
|
|
|
|
|
|
|
|
|
|
|
aAI: artificial intelligence.
bCONSORT: Consolidated Standards for Reporting Trials.
CCAIR: Clinical AI Research.
dMINIMAR: Minimum Information for Medical AI Reporting.
eSPIRIT: Standard Protocol Items: Recommendations for Interventional Trials.
Coverage of frameworks across content domains. AI: artificial intelligence; CAIR: Clinical AI Research; CONSORT: Consolidated Standards of Reporting Trials; MINIMAR: Minimum Information for Medical AI Reporting; SPIRIT: Standard Protocol Items: Recommendations for Interventional Trials.
Heatmap of the frameworks' coverage across the five stages of translation. Darker boxes indicate areas where more frameworks offered guidance, whereas lighter boxes indicate areas where fewer frameworks offered guidance.
Transparency describes how openly and thoroughly information is disclosed to the public and the scientific community [
All but one framework (
Reproducibility describes how likely it is that others could develop or apply an AI tool with similar results. Reproducibility is a basic tenet of good scientific practice [
All frameworks commented on reproducibility. Only one (
Ethics considers values such as benevolence, fairness, respect for autonomy, and privacy. Such values are essential to avoiding harm and ensuring societal benefit in AI use [
Only one framework commented on ethical considerations across all stages of translation (
Effectiveness describes the success and efficiency of models and methods when they are applied in a given context. Effectiveness is concerned with matters such as data quality and model fit during the development of AI models [
Four frameworks (
Engagement explores to what extent the opinions and values of patients and other end users or stakeholders are collected and accounted for in decision-making. The degree of engagement can range from consultation (lowest level) to partnership and shared leadership [
No frameworks considered engagement across all five stages. Engagement was discussed in relation to development by four frameworks (
Frameworks for applying and evaluating AI in medicine are rapidly emerging and address important considerations for the oversight of AI, such as those regarding transparency, reproducibility, ethics, and effectiveness. Providing guidance on integrating stakeholder engagement to inform AI is not a current strength of frameworks. Frameworks included in this review were the least likely to provide guidance on using engagement to inform the translation of AI in comparison to other considerations. The relative paucity of guidance on engagement reflects the larger AI landscape, which does not actively engage diverse end users in the translation of AI. For many stakeholders, AI remains a black box [
More than half of the frameworks provided reporting guidance on the use of AI in medicine. Additionally, nearly all frameworks in this review were published in 2019 or later. Given the rapid expansion of the field, it is essential to assess the consistency of recommendations across reporting frameworks to build shared understanding.
A near-miss in this review was the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) Statement [
The content domains and stages of translation that we have considered are far from exhaustive, and there are many other features and specific stages of AI development, application, and evaluation that are worthy of discussion. For instance, as the scope of AI in medicine expands, it will require broadened evaluation. For instance, there have been few economic evaluations of AI tools in medicine, which may be a barrier to their implementation [
None of the frameworks included in this review used an explicit translational science lens to provide explicit guidance across the AI life cycle. Having resources that detail considerations for AI application and evaluation at each stage of the translational process would be helpful for those seeking to develop AI with meaningful medical applications. Resources that could be helpful would include patient/community-centered educational resources about the value of AI, a framework to optimize the patient-centered translation of AI predictive analytics into clinical decision-making, and critical appraisal tools for use in comparing different applications of AI to inform medical decision-making.
There was a paucity of guidance regarding the surveillance of AI in medicine. Although some research has described the use of AI to inform primarily public health surveillance [
The goal of the framework evaluations was not intended to reflect the
The field of AI in medicine could stand to learn from the clearer methodological standards and best practices currently existent in established fields such as patient-centered outcomes research (PCOR) [
There is a growing literature offering input on the oversight of AI in medicine, with more guidance from regulatory bodies such as the US FDA forthcoming. Although existing frameworks provide general coverage of considerations for the oversight of AI in medicine, they fall short in their ability to offer input on the use of engagement in the development of AI, as well as in providing recommendations for the specific translational stage of surveilling AI. Frameworks should emphasize engaging patients, clinicians, and other end users in the development, use, and evaluation of AI in medicine.
artificial intelligence
Consolidated Standards for Reporting Trials
Food and Drug Administration
patient-centered outcomes research
Standard Protocol Items: Recommendations for Interventional Trials
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis
The project described was supported by Award Number UL1TR002733 from the National Center For Advancing Translational Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center For Advancing Translational Sciences or the National Institutes of Health.
NLC, SBB, and JFPB participated in study design. NLC, ME, JP, and JFPB participated in data collection, analysis, and in identification of data. All authors participated in writing of the report. All authors have reviewed and approved submission of this article.
None declared.