This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Artificial intelligence (AI) methods can potentially be used to relieve the pressure that the COVID-19 pandemic has exerted on public health. In cases of medical resource shortages caused by the pandemic, changes in people’s preferences for AI clinicians and traditional clinicians are worth exploring.
We aimed to quantify and compare people’s preferences for AI clinicians and traditional clinicians before and during the COVID-19 pandemic, and to assess whether people’s preferences were affected by the pressure of pandemic.
We used the propensity score matching method to match two different groups of respondents with similar demographic characteristics. Respondents were recruited in 2017 and 2020. A total of 2048 respondents (2017: n=1520; 2020: n=528) completed the questionnaire and were included in the analysis. Multinomial logit models and latent class models were used to assess people’s preferences for different diagnosis methods.
In total, 84.7% (1115/1317) of respondents in the 2017 group and 91.3% (482/528) of respondents in the 2020 group were confident that AI diagnosis methods would outperform human clinician diagnosis methods in the future. Both groups of matched respondents believed that the most important attribute of diagnosis was accuracy, and they preferred to receive combined diagnoses from both AI and human clinicians (2017: odds ratio [OR] 1.645, 95% CI 1.535-1.763;
Individuals’ preferences for receiving clinical diagnoses from AI and human clinicians were generally unaffected by the pandemic. Respondents believed that accuracy and expense were the most important attributes of diagnosis. These findings can be used to guide policies that are relevant to the development of AI-based health care.
Artificial intelligence (AI) technology, which is also called machine intelligence technology, has been used in various fields, such as automation, language, image understanding and analysis, and genetic algorithm research. AI technology can perform better than a human when it comes to performing particular tasks, and such technology has the potential to replace several traditional human occupations. This is the result of continuous advances in medicine, neuroscience, robotics, and statistics. In the medical and health care field [
As of November 13, 2020, the novel COVID-19 disease has spread in over 217 countries [
The combination of AI technology and human clinician–operated convolutional neural networks [
This study aimed to compare people’s preferences for AI diagnoses and traditional diagnoses (ie, human clinicians’ diagnoses) before and during the COVID-19 pandemic. We assessed two groups of respondents with similar demographic characteristics. We recruited one group in 2017 and the other group in 2020 to learn whether people’s preferences for AI and traditional human clinicians were affected by the pressure of the COVID-19 pandemic. We performed propensity score matching (PSM) to match the two groups. We also conducted a discrete choice experiment (DCE) to quantify and measure peoples’ preferences for different diagnosis methods and identify factors that disrupted and impacted peoples’ decision-making behaviors.
We designed a web-based questionnaire to collect participants’ demographic information and investigate patients’ preferences for different diagnosis strategies (
We used the PSM method to match two different groups of respondents (ie, the 2017 group and the 2020 group) with similar demographic characteristics. In addition, we used multinomial logit (MNL) models [
Individuals could choose different levels of health care services for each diagnosis attribute. Patients from the outpatient queues of The First Affiliated Hospital of Jinan University (Guangzhou Overseas Chinese Hospital) and The First Affiliated Hospital of Sun Yat-sen University were randomly selected for this study. Each patient was prompted to hypothesize which diagnosis methods or attributes had a large impact on their decision (ie, the methods/attributes that were of prominent importance to each participant).
After assessing patients’ hypotheses and related literature [
Description: the diagnosis method that patients prefer
Levels: clinician diagnosis, artificial intelligence and clinician diagnosis, and artificial intelligence diagnosis
Description: the amount of time that patients wait in a queue before the diagnosis process
Levels: 0 minutes, 20 minutes, 40 minutes, 60 minutes, 80 minutes, and 100 minutes
Description: the amount of time before a patient obtains a diagnosis
Levels: 0 minutes, 15 minutes, and 30 minutes
Description: the rate of correct diagnosis
Levels: 60%, 70%, 80%, 90%, and 100%
Description: case tracking and follow-ups after diagnosis
Levels: Yes and no
Description: the cost of diagnosis
Levels: ¥0, ¥50, ¥100, ¥150, ¥200, and ¥250 (a currency exchange rate of ¥1=US $0.16 is applicable)
With regard to the design our DCE instrument, we used the fractional factorial design method [
The DCE questionnaire contained 2 parts. The first part required the respondents to fill in their demographic information, such as age (ie, 18-20, 21-25, 26-30, 31-35, 36-40, 41-45, 46-50, 51-55, 56-60, 61-65, 66-70, 71-75, 76-80, and 81-85 years), sex (ie, male or female), and educational level (ie, primary school student, primary school graduate, middle school student, middle school graduate, high school student, high school graduate, undergraduate, bachelor’s degree, graduate student, master’s degree, postgraduate student, and doctorate degree). The second part required the respondents to consider seven different scenarios. For each scenario, respondents were to imagine that they were in an outpatient queue waiting for a diagnosis. They were then asked to choose a preferred diagnosis strategy. At the end of the questionnaire, respondents were required to estimate the number of years (ie, 5 years, 10 years, 15 years, 20 years, 30 years, 40 years, or never) it would take for AI clinicians to surpass human clinicians. The scenarios and the options for the different types of clinicians are presented in
In October 2017 and August 2020, we sent our website link to people of different age groups by using various social media platforms, such as WeChat (Tencent Inc) and QQ (Tencent Inc). People could use the link to access the DCE questionnaire, which was the same for each participant. To increase the response rate, we provided incentives (ie, a lottery for a Fitbit watch and cash prizes) for completing the questionnaire.
At the beginning of the questionnaire, we provided a brief background on the applications of AI in medicine. This included information on the potential advantages and disadvantages of AI clinicians and traditional clinicians, and the purpose of our DCE. The questionnaire only took 5-10 minutes to complete. Respondents had to click the “Agree to take the survey” button to start filling out the questionnaire. Once respondents clicked the “Agree to take the survey” button, they were notified that they willingly chose to participate in this study. Respondents were also notified that their privacy was protected by the law.
PSM is a regression method for identifying treatment group and control group patients with similar basic characteristics. This method is prevalently used in the study of impact factors and causal effects, such as those in medical treatments, policy decisions, or case studies. PSM involves the following five steps [
Although there are various matching algorithms [
There are various analysis models that can be used to conduct DCE-related statistical analyses, such as random effects binary probit and logit models, MNL models, and mixed logit models [
We used the MNL model to analyze people’s preferences for different attribute levels. Our independent variable only accounted for attributes that were related to health care plans; it did not account for any information that was related to participants. The MNL model was used to analyze respondents’ health care plans, which were chosen based on the relative importance of the plans’ attributes and the “none” option. The coded value of each participants’ chosen health care plan was calculated based on participants’ coded responses to questions about queuing times, diagnosis times, and diagnostic costs. We used a maximum likelihood approach to analyze MNL model data.
The results from the MNL model were determined by the options for health care plans, as the data for this attribute were grouped before analysis. In the MNL model, “effect” is synonymous with “utility.” Therefore, positive MNL model coefficients indicated that individuals preferred one level of service over other levels for the same attribute. The MNL model in this study was based on a similar logistic regression model. The MNL model–based observations correlated with those in blocks that corresponded with the same individual. Instead of having 1 level line per individual like in the classical logit model, the MNL model had 1 level line per attribute level of interest (ie, for each individual). For example, in this study, we analyzed three types of diagnoses (ie, clinician diagnoses, AI and clinician diagnoses, and AI diagnoses), and each type had its own characteristics. However, an individual could only choose 1 of the 3 types of diagnoses. As per the characteristics of the MNL model, all three options were presented to each respondent, and all respondents could choose their preferred option. We reported the odds ratios (ORs) of respondents’ preferences for different attribute levels.
We used an LCM [
Willingness to pay (WTP) is an efficient metric for measuring how much an individual is willing to sacrifice (ie, economic sacrifices) to choose one diagnosis attribute level over another (ie, the reference attribute level). We analyzed participants’ WTP to identify homogeneity and heterogeneity in participants’ preferences.
Propensity score matching was conducted with Stata 16 (StataCorp LLC), and the MNL model and LCMs were created with Lighthouse Studio version 9.8.1 (Sawtooth Software).
Of the 1520 individuals who visited our DCE website in 2017, 1317 (86.6%) completed the questionnaire and were included in the analysis. Of these 1317 respondents, 1317 (100%) were aged 18-85 years, 731 (55.5%) were female, and 1115 (84.7%) believed that AI clinicians would surpass or replace human clinicians.
Of the 874 individuals who visited our new DCE website in 2020, 528 (60.4%) completed the questionnaire. Of these 528 participants, 272 (51.5%) were female and 482 (91.3%) were confident that AI diagnoses were better than traditional diagnoses.
Of the 1317 respondents who were recruited in 2017, 528 (40.1%) were matched (ie, via PSM) to the 528 respondents who were recruited in 2020. The PSM procedure is presented in
Propensity score matching procedure.
Demographic characteristics of nonmatched and propensity score–matched respondents.
Baseline matching characteristics | Nonmatched respondents | Propensity score–matched respondents | |||||
|
2017 group (n=1317), n (%) | 2020 group (n=528), n (%) | 2017 group (n=528), n (%) | 2020 group (n=528), n (%) | |||
|
<.001 |
|
|
.97 | |||
|
Male | 586 (44.5) | 256 (48.48) |
|
250 (47.35) | 256 (48.48) |
|
|
Female | 731 (55.5) | 272 (51.52) |
|
278 (52.65) | 272 (51.52) |
|
|
<.001 |
|
|
.69 | |||
|
<35 | 1106 (83.98) | 348 (65.91) |
|
379 (71.78) | 348 (65.91) |
|
|
≥35 | 211 (16.02) | 180 (34.09) |
|
149 (28.22) | 180 (34.09) |
|
|
<.001 |
|
|
.13 | |||
|
Primary school graduate to undergraduate | 1033 (78.44) | 336 (63.64) |
|
385 (72.92) | 336 (63.64) |
|
|
Bachelor’s degree to doctorate degree | 284 (21.56) | 192 (36.36) |
|
143 (27.08) | 192 (36.36) |
|
General results of the multinomial logit model. Data on propensity score–matched respondents’ preferences for diagnosis attributes in 2017 and 2020 are reported (N=528).
Attributes and levels | 2017 group | 2020 group | |||||||||||
|
Effect coefficient | Odds ratio (95% CI) | Effect coefficient | Odds ratio (95% CI) | |||||||||
|
|||||||||||||
|
Clinician | −0.15 | <.001 | Reference | −0.05 | .12 | Reference | ||||||
|
Artificial intelligence and clinician | 0.35 | <.001 | 1.64 (1.535-1.763) | 0.36 | <.001 | 1.51 (1.413-1.621) | ||||||
|
Artificial intelligence | −0.20 | <.001 | 0.95 (0.885-1.016) | −0.31 | <.001 | 0.78 (0.725-0.833) | ||||||
|
|||||||||||||
|
0 | 0.31 | <.001 | Reference | 0.15 | .01 | Reference | ||||||
|
20 | 0.12 | .03 | 0.82 (0.741-0.914) | 0.26 | <.001 | 1.12 (1.013-1.245) | ||||||
|
40 | −0.03 | .57 | 0.71 (0.639-0.789) | −0.02 | .72 | 0.85 (0.762-0.942) | ||||||
|
60 | −0.08 | .12 | 0.67 (0.606-0.748) | −0.20 | <.001 | 0.71 (0.640-0.788) | ||||||
|
80 | −0.31 | <.001 | 0.54 (0.482-0.595) | −0.20 | <.001 | 0.71 (0.640-0.789) | ||||||
|
|||||||||||||
|
0 | 0.05 | .19 | Reference | −0.02 | .57 | Reference | ||||||
|
15 | −0.07 | .06 | 0.89 (0.834-0.957) | −0.01 | .83 | 1.01 (0.946-1.084) | ||||||
|
30 | 0.02 | .53 | 0.98 (0.912-1.046) | 0.03 | .43 | 1.05 (0.980-1.122) | ||||||
|
|||||||||||||
|
60 | −0.83 | <.001 | Reference | −0.83 | <.001 | Reference | ||||||
|
70 | −0.35 | <.001 | 1.62 (1.458-1.802) | −0.41 | <.001 | 1.52 (1.365-1.684) | ||||||
|
80 | 0.07 | .16 | 2.47 (2.235-2.737) | −0.02 | .72 | 2.25 (2.033-2.487) | ||||||
|
90 | 0.32 | <.001 | 3.18 (2.867-3.526) | 0.43 | <.001 | 3.51 (3.169-3.891) | ||||||
|
100 | 0.79 | <.001 | 5.04 (4.534-5.609) | 0.83 | <.001 | 5.26 (4.734-5.852) | ||||||
|
|||||||||||||
|
Yes | 0.20 | <.001 | Reference | 0.19 | <.001 | Reference | ||||||
|
No | −0.20 | <.001 | 0.67 (0.620-0.698) | −0.19 | <.001 | 0.69 (0.656-0.715) | ||||||
|
|||||||||||||
|
0 | 0.42 | <.001 | Reference | 0.36 | <.001 | Reference | ||||||
|
50 | 0.28 | <.001 | 0.87 (0.769-0.976) | 0.23 | <.001 | 0.88 (0.782-0.989) | ||||||
|
100 | −0.01 | .82 | 0.65 (0.576-0.730) | 0.18 | <.001 | 0.83 (0.738-0.935) | ||||||
|
150 | 0.03 | .66 | 0.67 (0.599-0.760) | −0.06 | .30 | 0.65 (0.580-0.736) | ||||||
|
200 | −0.24 | <.001 | 0.52 (0.459-0.585) | −0.19 | <.001 | 0.58 (0.510-0.648) | ||||||
|
250 | −0.47 | <.001 | 0.41 (0.363-0.465) | −0.52 | <.001 | 0.41 (0.366-0.468) |
aA currency exchange rate of ¥1=US $0.16 is applicable.
General estimated weighted importance of diagnosis attributes in 2017 and 2020.
In 2017, respondents were willing to pay ¥13.99 to receive combined diagnoses from AI and human clinicians. Additionally, people were not willing to pay for longer outpatient waiting times, but they were willing to pay for higher diagnosis accuracy (ie, ¥1.60 per 1% increase in accuracy). In 2020, respondents were willing to pay ¥0.79 to receive combined diagnoses from AI and human clinicians instead of clinician-only diagnoses. Compared to respondents’ WTP for certain diagnosis methods in 2017, respondents’ WTP in 2020 was lower. Furthermore, similar to the 2017 group, respondents in the 2020 group were also not willing to pay for longer outpatient waiting times. However, they were willing to pay for higher diagnosis accuracy.
After comparing the Akaike information criteria, Bayesian information criteria, and Akaike/Bayesian information criteria of the various potential classes, we chose three classes that were the most appropriate for the matched respondents in the 2017 and 2020 groups. The proportions of matched respondents from the 2017 group in each of the three classes were 43.2% (class 1: 228/528), 42.2% (class 2: 223/528) and 14.6% (class 3: 77/528). The proportions of matched respondents from the 2020 group in each of the three classes were 44.8% (class 1: 237/528), 48.2% (class 2: 254/528) and 7% (class 3: 37/528).
With regard to class 1 (n=228),
Weighted importance of diagnosis attributes in 2017 and 2020, as determined by the latent class model.
According to our ORs for classes 1 and 2, the respondents in the 2017 group (Table S1 in
In classes 1 and 2, the respondents from the 2020 group (Table S2 in
Preference weights stratified by year (ie, 2017 and 2020) and class (ie, classes 1, 2, and 3), as determined by the latent class model.
We found that respondents’ WTP was highly consistent with the corresponding ORs of each attribute. In classes 1 and 2, the respondents from the 2017 group (
In classes 1 and 2, the respondents from the 2020 group (
Respondents’ WTPa in 2017.b
Attribute | Overall WTP (N=528), ¥ (US $) | WTP in class 1 (n=228), ¥ (US $) | WTP in class 2 (n=223), ¥ (US $) | WTP in class 3 (n=77), ¥ (US $) | |
|
|||||
|
Artificial intelligence and clinician | −13.99 (−2.24) | −3.03 (−0.48) | −0.22 (−0.04) | 0.31 (0.05) |
|
Artificial intelligence | 1.50 (0.24) | −0.52 (−0.08) | 0.25 (0.04) | 1.22 (0.20) |
Outpatient waiting time | 8.92 (1.43) | 0.62 (0.10) | 0.96 (0.15) | 0.53 (0.09) | |
Diagnosis time | −0.57 (−0.09) | 0.07 (0.01) | 0.07 (0.01) | −0.44 (−0.07) | |
Diagnosis accuracy | −1.14 (−0.18) | −0.44 (−0.07) | −2.85 (−0.46) | −1.20 (−0.19) | |
Follow-up after diagnosis | 11.32 (1.81) | 1.22 (0.20) | 0.95 (0.15) | 0.62 (0.10) | |
Diagnosis expenses | Reference | Reference | Reference | Reference |
aWTP: willingness to pay.
bNegative currency values refer to the amount that respondents were willing to pay for another level.
Respondents’ WTPa in 2020.b
Attribute | Overall WTP (N=528), ¥ (US $) | WTP in class 1 (n=237), ¥ (US $) | WTP in class 2 (n=254), ¥ (US $) | WTP in class 3 (n=37), ¥ (US $) | |
|
|||||
|
Artificial intelligence and clinician | −0.79 (−0.13) | −0.17 (−0.03) | −1.33 (−0.21) | −1.31 (−0.21) |
|
Artificial intelligence | 0.48 (0.07) | 0.54 (0.09) | 0.42 (0.07) | −1.62 (−0.26) |
Outpatient waiting time | 0.38 (0.06) | 0.70 (0.11) | 0.19 (0.03) | 0.61 (0.10) | |
Diagnosis time | −0.05 (−0.01) | −0.04 (−0.01) | 0.004 (0.001) | 0.06 (0.01) | |
Diagnosis accuracy | −1.60 (−0.26) | −3 (−0.48) | −0.44 (−0.07) | −5.65 (−0.90) | |
Follow-up after diagnosis | 0.73 (0.12) | 1.46 (0.23) | 0.25 (0.04) | 2.31 (0.37) | |
Diagnosis expenses | Reference | Reference | Reference | Reference |
aWTP: willingness to pay.
bNegative currency values refer to the amount that respondents were willing to pay for another level.
According to the LCM, which stratified data according to sex, male respondents in the 2017 group (
Weighted importance of diagnosis attributes in 2017 and 2020, as determined by the latent class model, which stratified data according to sex (ie, male and female respondents).
In this study, we collected information on people’s preferences for AI-based diagnosis by analyzing two different groups of individuals who were recruited in 2017 and 2020 (ie, before and during the COVID-19 pandemic). We used the PSM method to match two groups of respondents with similar demographic characteristics (ie, age, sex, and educational level). After comparing the demographically similar respondents in the 2017 and 2020 groups, we did not find any substantial differences in respondents’ preferences. Diagnosis accuracy and diagnosis expenses were the most important factors that influenced respondents’ preferences.
The success of a DCE questionnaire always depends on the response rate. In other words, people who actively click the website link and complete the questionnaire are essential for expanding sample sizes and the scope of a study. By using the PSM method, we were able to easily assess whether people’s preferences during normal times changed during unusual times (ie, the COVID-19 pandemic).
In this study, we used two different models—the MNL model and the LCM. Both models have various advantages and drawbacks with regard to quantifying respondents’ preferences. According to the general PSM logit model, respondents in both groups consistently believed that accuracy was the most important diagnosis attribute, regardless of their preferences for diagnosis methods. Moreover, diagnosis expense was an important factor that influenced respondents’ decisions in both 2017 and 2020. Respondents believed that this attribute was the second most important attribute. The limited accessibility and availability of medical resources are big problems in China, especially in several rural areas of China. These problems are the result of insufficient medical insurance distribution [
We found that people’s preferences for different diagnoses were largely similar. This indicates that people’s decisions and their preferences for different diagnoses are not considerably affected by pandemic-related factors. However, according to our LCM, there was slight heterogeneity in the preferences of different groups of respondents (eg, male and female respondents). This heterogeneity was not observed in the logit model. Although the weighted importance of accuracy remained consistent across all classes, it might not be the most important factor that affects people’s decisions. In class 1, the respondents from the 2017 and 2020 groups believed that diagnosis expense was the most important factor that affected their decisions, followed by diagnosis method. Based on the LCM results, male respondents in the 2017 and 2020 groups believed that diagnosis accuracy was the most important attribute to consider when choosing a diagnosis strategy.
With regard to attribute levels, we found that respondents typically preferred to receive a combined diagnosis from both AI and human clinicians over a diagnosis from a single source (ie, AI diagnoses or human clinician diagnoses). This is understandable, since respondents typically believed that diagnosis accuracy could be improved by combining different modes of diagnosis. Additionally, it should be noted that several respondents preferred longer diagnosis and outpatient queuing times. Although no studies have reported that diagnosis time and outpatient time correlate with diagnosis accuracy, it is possible that some patients prefer waiting for a doctor over receiving a quicker diagnosis, as they may believe that waiting results in more accurate diagnoses. The low accessibility and high price of AI services are important issues, especially in rural or low-income areas. Therefore, before pricing an AI technology–based service, it is advisable to survey residents and analyze their disposable income. With regard to residents in rural areas, governments should consider adding AI diagnoses to health insurance plans or related subsidy projects. Another AI diagnosis factor that should be considered is accuracy, since companies should only promote and advertise products/services with a high accuracy. When an AI technology–based service enters the market, relevant users should consider combining AI technology with human wisdom during the early stage of market penetration. Therefore, in the future, AI diagnosis technology developers should focus on improving diagnosis accuracy and reducing the cost of diagnoses to make such technology accessible to a wide range of patients.
Our study has several shortcomings and limitations, especially with regard to our data collection process. It was clear that our small sample size limited the power of our analyses. Additionally, our sample might not be representative of the entire Chinese population. Furthermore, the deployment/distribution of AI technology–based medical services is limited, especially in rural areas [
Our study shows that respondents’ preferences for AI clinicians in 2017 did not substantially differ from those in 2020. Therefore, people’s preferences for AI diagnoses and clinical diagnoses were unaffected by the COVID-19 pandemic. However, preferences for high diagnostic accuracy and low diagnosis expenses were evident, regardless of people’s preferences for diagnosis methods, waiting times, and follow-up services.
In summary, affordability and accuracy are the two principal factors that should be considered when promoting AI-based health care. The combination of AI-based and professional health care will be more easily accepted by the general public as AI technology develops.
Survey introduction.
Supplementary questionnaire.
Propensity score matching method.
Random utility model.
Supplementary tables.
artificial intelligence
discrete choice experiment
latent class model
multinomial logit
odds ratio
propensity score matching
willingness to pay
None declared.