This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Clinical decision support systems are designed to utilize medical data, knowledge, and analysis engines and to generate patient-specific assessments or recommendations to health professionals in order to assist decision making. Artificial intelligence–enabled clinical decision support systems aid the decision-making process through an intelligent component. Well-defined evaluation methods are essential to ensure the seamless integration and contribution of these systems to clinical practice.
The purpose of this study was to develop and validate a measurement instrument and test the interrelationships of evaluation variables for an artificial intelligence–enabled clinical decision support system evaluation framework.
An artificial intelligence–enabled clinical decision support system evaluation framework consisting of 6 variables was developed. A Delphi process was conducted to develop the measurement instrument items. Cognitive interviews and pretesting were performed to refine the questions. Web-based survey response data were analyzed to remove irrelevant questions from the measurement instrument, to test dimensional structure, and to assess reliability and validity. The interrelationships of relevant variables were tested and verified using path analysis, and a 28-item measurement instrument was developed. Measurement instrument survey responses were collected from 156 respondents.
The Cronbach α of the measurement instrument was 0.963, and its content validity was 0.943. Values of average variance extracted ranged from 0.582 to 0.756, and values of the heterotrait-monotrait ratio ranged from 0.376 to 0.896. The final model had a good fit (
User acceptance is the central dimension of artificial intelligence–enabled clinical decision support system success. Acceptance was directly influenced by perceived ease of use, information quality, service quality, and perceived benefit. Acceptance was also indirectly influenced by system quality and information quality through perceived ease of use. User acceptance and perceived benefit were interrelated.
Clinical decision support systems are computer-based enterprise systems designed to utilize massive data, medical knowledge, and analysis engines as well as to generate patient-specific assessments or recommendations to health professionals in order to assist clinical decision making through human–computer interaction [
AI-enabled clinical decision support systems include an intelligent component [
Diagnostics are a primary use case of AI-enabled clinical decision support systems, and these systems have been applied in the field of rare disease diagnosis [
The greatest benefits of AI-enabled clinical decision support systems reside in their ability to learn from real-world use and experience (ie, training) and their capabilities for improving their performance (ie, adaptation) [
A comprehensive evaluation framework with common elements and interoperability is necessary to serve as a reference for AI-enabled clinical decision support system design and evaluation, with focuses on cross-disciplinary communication and collaboration, and there is a pressing need to develop robust methodologies and empirically based tools for such evaluation. The factors driving this need are the uncertain added value of AI-enabled clinical decision support system implementation, lack of attention, and the possible benefits of comprehensive evaluation implementations.
First, the added value of AI-enabled clinical decision support system implementations in a clinical setting is not firmly established, though evidence exists that such implementations offer potential benefit to patients, clinicians, and health care in general [
The approach to AI-enabled clinical decision support system evaluation is influenced by a sociotechnical regime, which informs and guides the development of the robust and focused evaluation method of this study. It has increasingly been acknowledged that evaluations of such systems are based on a sociological understanding of the complex practices in which the information technologies are to function [
A well-defined success measure, based on users’ perspectives, that specifies aspects of AI-enabled clinical decision support systems that determine their success [
A comprehensive evaluation methodology involves a multidisciplinary process and diverse stakeholder involvement, which, when applied to AI-enabled clinical decision support system evaluation, refers to a mixed methodology not only based on tenets in medicine and information technology but also social and cognitive psychology [
AI-enabled clinical decision support system interface with a diverse set of clinical and nonclinical users and stakeholders whose inputs are integral to the evaluation process. Health care enterprises are multiprofessional organizations that often include dual hierarchical structures involving clinical practitioners and managers [
We aimed to address the gap in evaluation knowledge and methodologies by identifying which variables influence AI-enabled clinical decision support system success and using these variables to develop a parsimonious evaluation framework. Specifically, we (1) proposed an evaluation framework with 6 variables and hypotheses about interrelationships between the 6 variables based on the literature review, (2) developed and validated an instrument using the 6 variables for assessing the success of diagnostic AI-enabled clinical decision support systems, and (3) tested the hypotheses using path analysis with latent variables in a structural equation model.
This study was approved by the Ethics Review Committee, Children’s Hospital of Shanghai/Shanghai Children’s Hospital, Shanghai Jiao Tong University (file number 2020R050-E01).
Our study combined qualitative and quantitative methodologies to validate a proposed evaluation framework, which consisted of a model with hypotheses and containing 6 variables.. A Chinese-language measurement instrument was developed with the goal to measure and quantify the 6 variables, following established instrument development paradigm. A literature review and a Delphi process were conducted to develop the measurement instrument items, cognitive interviews, pretest, and web-based survey. Exploratory factor analysis was used to construct the constituent questions of the measurement instrument, reliability and validity tests were performed, and the interrelations of the variables were tested and verified.
Evaluation methodologies are informed by a rich corpus of theory, which provides a robust foundation for designing an AI-enabled clinical decision support system evaluation framework. In this study and in previous review work [
An updated model of information systems success that captures multidimensionality and interdependency was proposed by DeLone and McLean in 2003 [
A set of evaluation model variables and a candidate set of medical AI and clinical decision support system evaluation items were collected through a literature review [
The candidate set of evaluation items was examined and finalized using a Delphi process. Delphi is a structured group communication process, designed to obtain a consensus of opinion from a group of experts [
Snowball sampling was used to identify a group of experts. Expert selection criteria were (1) clinical practitioners who worked in a medical specialty at least 10 years, preferably had a PhD (minimum postgraduate qualification), had a professional title at the advanced level or above, had an appointment or affiliation with a professional organization, and had more than 1 year of practical experience (with respect to AI-enabled clinical decision support systems); (2) hospital chief information officers who worked in an information system specialty at least 10 years, had a postgraduate qualification, had a midlevel professional title or above, and had an appointment or affiliation with a professional information system organization; or (3) information technology engineers working in medical information system enterprises who worked in AI or clinical decision support systems at least 5 years, had a postgraduate qualification, and had a midlevel position title or above.
In addition to these selection criteria, a measure of degree of expert authority was introduced to add or remove experts from each round of the Delphi process. The degree of expert authority
The experts were invited to participate in the modified Delphi process via email. Those who accepted were sent an email with a link to the round 1 consultation. Experts were required to provide a relevance score for each item in the candidate set using a 4-point Likert scale (1=not relevant, 2=relevant but requires major revision, 3=relevant but requires minor revision, 4=very relevant and requires no revision). Experts were given 2 weeks to complete each round. A reminder was sent 2 days before the deadline to those who had not completed the survey. The 2-round Delphi process was carried out from May to July 2020.
The content validity was assessed in the last round of the Delphi process. Item-content validity was calculated as the percentage of expert ratings ≥3; if item-content validity was ≥0.8 (ie, expert endorsement), the item was retained. The mean item-content validity, representing the content validity of the measurement instrument of all retained items from the last round was computed. At the end of this step, the set of evaluation items for the measurement instrument were finalized. The final set consisted of 29 evaluation items.
The measurement instrument consisted of the set of evaluation items measured by a web-based survey. A draft set of survey questions was refined by employing cognitive interviews and a pretest. Interviewees (n=5) who were postgraduates majoring in health informatics or end-users of AI-enabled clinical decision support systems (ie, clinicians) were asked to verbalize the mental process entailed in providing answers. The pretest included 20 end-users. The interviews and pretest were conducted in July 2020 and aimed to assess the extent to which the survey questions reflected the domain of interest and that answers produced valid measurements. Responses used a Likert scale from 1 (strongly disagree) to 7 (strongly agree). The wording of the questions was subsequently modified based on the feedback from the respondents. The web-based survey was initiated in July and was closed in September 2020.
The evaluation entities chosen in this study were AI-enabled clinical decision support systems designed to support the risk assessment of venous thromboembolism among inpatients: AI-enabled clinical decision support systems that automatically capture electronic medical records based on natural language processing supporting assessment based on individual risk of thrombosis (eg, Caprini scale or Wells scoring), with monitoring of users and reminders sent to users to provide additional data were targeted.
Users of target AI-enabled clinical decision support systems who had at least 1 month of user experience were included. The convenience sample participants were based in 3 hospitals in Shanghai that implemented venous thromboembolism risk assessment AI-enabled clinical decision support systems in clinical settings. We appointed an investigator at each hospital site who was responsible for stating the objective of the study, for identifying target respondents, and for monitoring the length of time it took the participants to complete the survey. This was a voluntary survey. The investigators transmitted the electronic questionnaire link to the respondents through the WeChat communication app.
To ensure usability for exploratory factor analysis [
Quality control measures were implemented to ensure logical consistency, with completeness checks before the questionnaire was submitted by the responders. Before submitting, respondents could review or change their answers. In order to avoid duplicates caused by repeat submissions, respondents accessed the survey via a WeChat account. Submitted questionnaires meeting the following criteria were deleted: (1) filling time <100 seconds, or (2) the answer of following 2 questions were contradictory: “How often do you use the AI-enabled clinical decision support systems?” versus “You use the AI-enabled clinical decision support systems frequently.” Finally, we asked the point-of-contact individuals in each hospital to send online notifications to survey respondents at least 3 times at regular intervals in order to improve the response rate.
Statistical analyses were performed (SPSS Amos, version 21, IBM Corp) to (1) identify items of measurement instrument that were not related to AI-enabled clinical decision support system success for deletion, (2) explore the latent constructs of the measurement instrument, and (3) evaluate reliability and validity of the measurement instrument.
Critical ratio and significance were calculated using independent
Construct of the measurement tool was tested using exploratory factor analysis. Principal component analysis was applied for factor extraction, and the Promax with Kaiser normalization rotation strategy was used to redefine the factors to improve their interpretability. The cutoff strategy was based on verify if the data set was suitable for exploratory factor analysis—the Bartlett test of sphericity should be statistically significant (
Cronbach α coefficients were calculated to assess internal consistencies of the scale and each subscale; values >.80 are preferred [
Interrelationships between variables selected for the evaluation framework were hypothesized in a model (
Evaluation model hypotheses.
Of the 11 experts invited to participate (
Accepted items in the Delphi process.
Variables and items | Item-content validity | Critical ratioa ( |
Item-scale correlationa | Corrected item-to-total correlation | Cronbach |
||||||
|
|
|
|
|
|
||||||
|
Learnability | 1.00 | 6.419 | 0.643 | 0.615 | .961 | |||||
|
Operability | 1.00 | 7.384 | 0.628 | 0.596 | .961 | |||||
|
User interface | 0.90 | 10.496 | 0.700 | 0.669 | .960 | |||||
|
Data entry | 1.00 | 10.530 | 0.655 | 0.622 | .961 | |||||
|
Advice display | 1.00 | 7.938 | 0.655 | 0.621 | .961 | |||||
|
Legibility | 1.00 | 7.836 | 0.666 | 0.641 | .961 | |||||
|
|
|
|
|
|
||||||
|
Response time | 1.00 | 7.826 | 0.606 | 0.565 | .961 | |||||
|
Stability | 1.00 | 7.949 | 0.541 | 0.498 | .962 | |||||
|
|
|
|
|
|
||||||
|
Security | 1.00 | 9.247 | 0.588 | 0.560 | .961 | |||||
|
Diagnostic performance | 1.00 | 11.346 | 0.746 | 0.726 | .960 | |||||
|
|
|
|
|
|
||||||
|
Changes in order behavior | 0.90 | 8.593 | 0.667 | 0.637 | .961 | |||||
|
Changes in diagnosis | 0.90 | 8.843 | 0.634 | 0.600 | .961 | |||||
|
|
|
|
|
|
||||||
|
Productivity | 1.00 | 11.112 | 0.726 | 0.699 | .960 | |||||
|
Effectiveness | 1.00 | 14.078 | 0.840 | 0.823 | .959 | |||||
|
Overall usefulness | 1.00 | 13.720 | 0.826 | 0.809 | .959 | |||||
|
Adherence to standards | 1.00 | 8.843 | 0.711 | 0.688 | .960 | |||||
|
Medical quality | 1.00 | 8.945 | 0.717 | 0.696 | .960 | |||||
|
User knowledge and skills | 0.80 | 8.366 | 0.715 | 0.692 | .960 | |||||
|
|
|
|
|
|
||||||
|
Change in clinical outcomes | 0.90 | 10.974 | 0.741 | 0.719 | .960 | |||||
|
Change in patient-reported outcomes | 0.80 | 10.769 | 0.716 | 0.692 | .960 | |||||
|
|
|
|
|
|
||||||
|
Operation and maintenance | 0.90 | 9.624 | 0.590 | 0.555 | .961 | |||||
|
Information updating to keep timeliness | 1.00 | 9.601 | 0.640 | 0.614 | .961 | |||||
|
|
|
|
|
|
||||||
|
Usage | 0.80 | 4.686 | 0.323b | 0.282b | .963b | |||||
|
Expectations confirmation | 1.00 | 14.174 | 0.856 | 0.841 | .959 | |||||
|
Satisfaction of system quality | 0.80 | 12.248 | 0.816 | 0.798 | .959 | |||||
|
Satisfaction of information quality | 0.80 | 13.437 | 0.828 | 0.813 | .959 | |||||
|
Satisfaction of service quality | 0.80 | 11.031 | 0.737 | 0.714 | .960 | |||||
|
Overall satisfaction | 1.00 | 15.053 | 0.873 | 0.860 | .959 | |||||
|
Intention of use | 0.90 | 13.500 | 0.855 | 0.840 | .959 |
aFor all values in this column,
bBased on this value, the item meets the standard for potential deletion.
Based on the feedback from the cognitive interviews and pretesting, we made modifications to the wording of 4 items and added explanations to 2 items in order to make them easier to understand. This self-administered measurement instrument with 29 items was used to collected survey data.
Survey responses were collected from a total of 201 respondents (
One item—usage behavior—was deleted based on item-scale correlation, corrected item-to-total correlation, and effect on Cronbach-α-if-the-item-was-deleted criteria (
Exploratory factor analysis was deemed to be appropriate (Kaiser-Meyer-Olkin .923;
Principal component analysis results.
Component | Extraction | Rotation | ||
|
Sums of squared loadings | Variance (%) | Cumulative variance (%) | Sums of squared loadings |
Perceived ease of use | 14.447 | 51.596 | 51.596 | 11.354 |
System quality | 2.504 | 8.941 | 60.537 | 9.824 |
Information quality | 1.423 | 5.082 | 65.620 | 11.299 |
Service quality | 1.212 | 4.328 | 69.948 | 5.687 |
Decision change | 0.841 | 3.005 | 72.953 | 6.449 |
Process change | 0.779 | 2.780 | 75.733 | 7.736 |
Outcome change | 0.715 | 2.555 | 78.288 | 6.588 |
Acceptance | 0.658 | 2.350 | 80.638 | 5.997 |
The 28-item scale appeared to be internally consistent (Cronbach α=.963). The Cronbach α for the 6 subscales ranged from .760 to .949. Content validity of the overall scale was 0.943. Values of average variance extracted ranged from .582 to .756 and met the >.50 restrictive criterion, which indicated acceptable convergent validity. The values of heterotrait-monotrait ratio ranged from 0.376 to 0.896 and met the <0.90 restrictive criterion, which indicated acceptable discriminant validity of constructs (
Internal consistency, convergent validity, and discriminant validity of constructs.
Variables | Heterotrait-monotrait ratio | Average variance extracted | Composite reliability | |||||||
|
Perceived ease of use | System |
Information quality | Service |
Perceived |
Acceptance |
|
|
||
Perceived ease of use | 1 | 0.753 | 0.765 | 0.412 | 0.657 | 0.736 | .582 | .892 | ||
System quality | 0.753 | 1 | 0.637 | 0.376 | 0.455 | 0.636 | .674 | .803 | ||
Information quality | 0.765 | 0.637 | 1 | 0.721 | 0.729 | 0.767 | .620 | .760 | ||
Service quality | 0.412 | 0.376 | 0.721 | 1 | 0.654 | 0.673 | .752 | .858 | ||
Perceived benefit | 0.657 | 0.455 | 0.729 | 0.654 | 1 | 0.896 | .595 | .935 | ||
Acceptance | 0.736 | 0.636 | 0.767 | 0.673 | 0.896 | 1 | .756 | .949 |
The chi-square of the hypothesized model was significant (
The chi-square of the revised model was not significant (
Final evaluation model (comparative fit index 0.991; goodness-of-fit index 0.957; root mean square error of approximation 0.052; standardized root mean square residual 0.028).
Parameter estimation for path coefficients.
Pathway | Regression weights | Standardized regression weights | Standard error | Critical ratio | ||
Perceived ease of use | System quality | 0.292 | 0.446 | 0.041 | 7.139 | <.001 |
Perceived ease of use | Information quality | 0.378 | 0.405 | 0.058 | 6.484 | <.001 |
Acceptance | Information quality | 0.117 | 0.099 | 0.057 | 2.070 | .04 |
Acceptance | Service quality | 0.235 | 0.232 | 0.052 | 4.525 | <.001 |
Acceptance | Perceived ease of use | 0.413 | 0.325 | 0.084 | 4.933 | <.001 |
Expectations confirmation | Acceptance | 1 | 0.866 | N/Aa | N/A | N/A |
User satisfaction | Acceptance | 0.522 | 0.536 | 0.072 | 7.241 | <.001 |
Intention of use | Acceptance | 0.981 | 0.893 | 0.062 | 15.804 | <.001 |
Decision change | Benefit | 1 | 0.595 | N/A | N/A | N/A |
Process change | Benefit | 1.274 | 0.923 | 0.161 | 7.935 | <.001 |
Outcome change | Benefit | 1.182 | 0.788 | 0.157 | 7.507 | <.001 |
Benefit | Acceptance | 0.599 | 0.925 | 0.078 | 7.657 | <.001 |
Acceptance | Benefit | 0.599 | 0.388 | 0.078 | 7.657 | <.001 |
aN/A: not applicable.
Squared multiple correlations.
Variables | Estimate |
Perceived ease of use | 0.538 |
Benefit | 0.932 |
Outcome change | 0.621 |
Process change | 0.851 |
Decision change | 0.491 |
Acceptance | 0.89 |
Expectations confirmation | 0.75 |
Intention of use | 0.797 |
User satisfaction | 0.853 |
User acceptance was established as central to AI-enabled clinical decision support system success in the evaluation framework. A 28-item measurement instrument was evaluated, yielding an instrument that quantifies 6 variables:
User acceptance is the traditional focus of evaluation in determining the success of an information system [
In this study, perceived ease of use encompassed human–computer interaction (eg, user interface, data entry, information display, legibility, response time), ease of learning, and workflow integration [
We recommend using
Process change, which is similar to perceived usefulness [
Outcome measures tended to be complicated indicators of AI-enabled clinical decision support system success, which often failed to be objective in clinical settings [
This study is an innovative attempt and pilot examination of an evaluation framework in relation to AI-enabled clinical decision support system success. This evaluation framework is widely applicable, with a broad scope in clinically common and multidisciplinary interoperable scenarios. In order to test the validity of the variables and the hypotheses about their relationships, an empirical methodology was needed. Specifically, the items of the measurement instrument were developed targeting diagnostic AI-enabled clinical decision support systems, and AI-enabled clinical decision support systems designed to support the risk assessment of the venous thromboembolism among inpatients was the focus. Thus, one potential limitation may arise due to this narrow focus. A future expanded evaluation framework would require validation among diverse populations and encompassing AI-enabled clinical decision support systems with diverse functions.
This study offers unique insight into AI-enabled clinical decision support system evaluation from a user-centric perspective, and the evaluation framework can support stakeholders to understand user acceptance of AI-enabled clinical decision support system products with various functionalities. Given the commonality and interoperability of this evaluation framework, it is widely applicable in different implementations, that is, this framework can be used to evaluate success of various AI-enabled clinical decision support systems.
From a theoretical point of view, this framework can be an evaluation approach to help in describing and understanding AI-enabled clinical decision support system success with a user acceptance–centric evaluation process. There are also practical implications in terms of how this evaluation framework is applied in clinical settings. The 28-item diagnostic AI-enabled clinical decision support system success measurement instrument, divided into 6 model variables, showed good psychometric qualities. The measurement instrument can be a useful resource for health care organizations or academic institutions designing and conducting evaluation projects on specific AI-enabled clinical decision support systems. At the same time, if the measurement instrument is to be used for AI-enabled clinical decision support system products with different functionalities in a specific scenario, item modifications, cross-cultural adaptation, and tests of reliability and validity testing (in accordance with scale development guidelines [
Evaluation target of model variables.
Characteristics of the Delphi expert panel.
Sociodemographic characteristics of respondents.
Structure matrix of measurement instrument.
Component correlation matrix.
Standardized factor loading of the measurement instrument.
Parameter estimation of error in measurement.
Standardized total effects.
Standardized direct effects.
Standardized indirect effects.
artificial intelligence
This work was supported by the Doctoral Innovation Fund in Shanghai Jiao Tong University School of Medicine 2019 [BXJ201906]; the Shanghai Municipal Education Commission-Gaoyuan Nursing Grant Support [Hlgy1906dxk]; and the Shanghai Municipal Commission of Health and Family Planning (Grant No. 2018ZHYL0223).
None declared.