This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The unannounced standardized patient (USP) is the gold standard for primary health care (PHC) quality assessment but has many restrictions associated with high human and resource costs. Virtual patient (VP) is a valid, low-cost software option for simulating clinical scenarios and is widely used in medical education. It is unclear whether VP can be used to assess the quality of PHC.
This study aimed to examine the agreement between VP and USP assessments of PHC quality and to identify factors influencing the VP-USP agreement.
Eleven matched VP and USP case designs were developed based on clinical guidelines and were implemented in a convenience sample of urban PHC facilities in the capital cities of the 7 study provinces. A total of 720 USP visits were conducted, during which on-duty PHC providers who met the inclusion criteria were randomly selected by the USPs. The same providers underwent a VP assessment using the same case condition at least a week later. The VP-USP agreement was measured by the concordance correlation coefficient (CCC) for continuity scores and the weighted
Only 146 VP scores were matched with the corresponding USP scores. The CCC for medical history was 0.37 (95% CI 0.24-0.49); for physical examination, 0.27 (95% CI 0.12-0.42); for laboratory and imaging tests, –0.03 (95% CI –0.20 to 0.14); and for treatment, 0.22 (95% CI 0.07-0.37). The weighted
There was low agreement between VPs and USPs in PHC quality assessment. This may reflect the “know-do” gap. VP test results were also influenced by different case conditions, interactive design, and usability. Modifications to VPs and the reasons for the low VP-USP agreement require further study.
Improving primary health care (PHC) services is one approach to increasing universal health coverage [
The unannounced standardized patient (USP) is regarded as the gold standard to assess the quality of PHC services [
It is unknown whether assessments of quality based on VP agree with those based on USP. Prior studies mainly applied VP in medical education [
The current study belongs to a family of studies of PHC quality assessments in China based on the multicenter, nation-wide ACACIA (Health Care Quality Cohort in China) study [
This multicenter, cross-sectional pilot study is part of the ACACIA family of studies. The ACACIA protocol has been published previously [
Altogether, 720 USP visits were conducted. On-duty PHC providers who met our criteria were randomly selected for USP visits. The PHC providers who were visited by the USPs received a VP assessment of the same cases at least a week later to prevent the practice effect [
Ethical approval was obtained from the Ethics Committee of Sun Yat-sen University (2017-007), and all PHC providers participating in the VP tests provided informed consent.
The USPs and VPs shared identical case designs to ensure consistency and simplify the development process. The selection and development process for these case designs was reported previously [
The USP actors all received at least one week of competency-based online-offline training and were assessed by specialists who were not members of the research team [
The VP was hosted on an online platform that could be accessed via a mobile phone or computer. The 5 VP modules used 3 different interface designs. For the medical history and diagnosis modules, the PHC providers were required to search for keywords with at least 2 characters to trigger relevant inquiries for selection. The physical examination module displayed all possible options. In the laboratory and imaging test module and the treatment module, some general options (eg, ordering blood tests or electrocardiograms) could be chosen directly, while specific options were made available after searching for keywords. All actions were recorded and uploaded online automatically.
For the field testing, PHC providers who agreed to participate in the VP tests received the VP install package for their mobile phone or personal computer alongside a user demonstration video. For each PHC provider, the cases for the VP test were the same as those for the USP test. The VPs were masked to avoid bias due to providers noticing the tested cases. The VP tests included a training VP case, which allowed the PHC providers to become familiarized with the operation of the system. There were no time limits for any of the VP tests to avoid underestimated results caused by a lack of proficiency. To facilitate the use of the VPs, some tests were organized on-site, which may have led to test results that differed from those completed by the PHC providers independently. Thus, the location and manner of the tests, as well as the number of VP tests assigned to each PHC provider and the age and sex of the providers, were recorded for analysis.
The F1 score, recall, and precision were used to measure the continuity of physical examinations, laboratory and imaging tests, and treatment [
In the equation and in
The results of the diagnoses were classified as ordinal variables in line with clinical guidelines and were rated as completely correct, partly correct, or incorrect.
Explanation of the relationship between test results and case design for virtual patients and unannounced standardized patients. Recall is the number of performed necessary actions divided by the number of necessary actions, while precision is the number of performed necessary actions divided by the number of performed actions.
|
Performed actions | Unperformed actions |
Necessary actions | Number of performed necessary actions | Number of missing necessary actions |
Unnecessary actions | Number of performed unnecessary actions | N/Aa |
aN/A: not applicable.
Characteristics of the PHC providers and VP test information are shown as the mean (SD) for continuous variables and percentages for categorical variables. CCC, which reflects the criterion validity of the VP tests [
Multiple linear regression was used to identify factors influencing VP-USP agreement. Using the VP tests as the dependent variable and USP tests as the independent variable, several multiple linear regression models were established, and the models were stepwise adjusted according to cases, characteristics of the PHC providers (ie, age, sex, and city), and test conditions (ie, test deployment and number of tests). Significant covariates in these models were controlled jointly in a fully adjusted model. Partial regression coefficients of the USP tests are reported. Statistical analysis was carried out using the R (version 4.0.5; R Foundation for Statistical Computing) packages stats (version 4.0.5), Desc Tools (version 0.99.43), and psych (version 2.1.9).
The recruitment process is shown in
The characteristics of the PHC providers included in the analysis were as follows: over 80% (67/80) were between 30 and 50 years old and most were male (48/80, 60%). About 40% (35/80) of the PHC providers worked in Guangzhou. Test deployment type for the VP tests was mainly field-testing (59/80, 74%) and more than half (42/80, 53%) of the PHC providers were tested by a single case. The average VP test time was 13.49 (SD 9.33) minutes. The most frequently tested case design was low back pain, with 25 tests. The least frequently tested cases were asthma, gastritis, migraine, and postpartum depression, with 7 tests each. Details are shown in
Flow chart of the recruitment process. USP: unannounced standardized patient; VP: virtual patient.
Characteristics of primary health care providers (N=80).
Categories | Values, n (%) | |
|
||
|
<30 | 5 (6) |
|
30-50 | 67 (83) |
|
≥50 | 8 (10) |
|
||
|
Male | 48 (60) |
|
Female | 32 (40) |
|
||
|
Changsha | 12 (15) |
|
Xi’an | 7 (9) |
|
Guangzhou | 35 (44) |
|
Lanzhou | 8 (10) |
|
Hohhot | 11 (14) |
|
Guiyang | 5 (6) |
|
Chengdu | 2 (3) |
|
||
|
Angina | 9 (6) |
|
Asthma | 7 (5) |
|
Gastritis | 7 (5) |
|
Cold | 16 (11) |
|
Type 2 diabetes | 18 (12) |
|
Diarrhea | 13 (9) |
|
Hypertension | 13 (9) |
|
Low back pain | 25 (17) |
|
Migraine | 7 (5) |
|
Postpartum depression | 7 (5) |
|
Stress urinary incontinence | 24 (16) |
Virtual patient test situations (N=80).
Categories | Values, n (%) | |
|
|
|
|
Field-testing | 59 (74) |
|
Self-testing | 21 (26) |
|
|
|
|
1 | 42 (53) |
|
2 | 22 (28) |
|
≥3 | 16 (20) |
Test outcomes and CCCs for the medical history, physical examination, laboratory and imaging tests, and treatment modules are listed in
Test outcomes and concordance correlation coefficients. Precision and F1 score could not be calculated for unannounced standardized patients for medical history due to missing consultation records.
Test modules | Precision (SD) | Recall (SD) | F1 score (SD) | ||||||||
USPsa | VPsb | CCCc (95% CI) | USPs | VPs | CCC (95% CI) | USPs | VPs | CCC (95% CI) | |||
Medical historya | —d | 0.51 (0.35) | — | 0.19 (0.15) | 0.13 (0.13) | 0.37 (0.24 to 0.49) | — | 0.18 (0.16) | — | ||
Physical examination | 0.47 (0.50) | 0.25 (0.30) | 0.13 (0.01 to 0.26) | 0.11 (0.15) | 0.34 (0.31) | 0.04 (–0.05 to 0.13) | 0.17 (0.21) | 0.20 (0.19) | 0.27 (0.12 to 0.42) | ||
Laboratory and imaging tests | 0.47 (0.49) | 0.45 (0.34) | 0.21 (0.03 to 0.38) | 0.18 (0.20) | 0.57 (0.36) | –0.06 (–0.15 to 0.03) | 0.25 (0.27) | 0.43 (0.27) | –0.03 (–0.20 to 0.14) | ||
Treatment | 0.77 (0.41) | 0.45 (0.48) | 0.07 (–0.06 to 0.20) | 0.21 (0.18) | 0.20 (0.25) | 0.24 (0.10 to 0.39) | 0.31 (0.23) | 0.26 (0.31) | 0.22 (0.07 to 0.37) |
aUSP: unannounced standardized patient.
bVP: virtual patient.
cCCC: concordance correlation coefficient.
dNot available.
To explore factors that affected VP-USP agreement, we used multiple linear regression. For medical history, there was a significant correlation between VP and USP scores that remained stable after adjustment (ranging from 0.32 to 0.34,
Using stepwise variable selection in the fully adjusted model, all the correlations between VP and USP scores became weaker after adjustment. The partial correlation coefficients were 0.314 (95% CI 0.183-0.445) for recall for the USPs for medical history; 0.071 (95% CI –0.090 to 0.023) for F1 score for physical examination; –0.025 (95% CI –0.169 to 0.118) for F1 score for laboratory and imaging tests; and 0.045 (95% CI –0.133 to 0.223) for F1 score for treatment. Furthermore, for medical history, female sex (versus male) and Changsha and Lanzhou (versus Guangzhou) were negatively associated with recall for VPs, while test time was positively associated with recall for VPs. The F1 scores for the physical examination module and the laboratory and imaging test module were only associated with case design. The F1 score for treatment was only associated with cases and the cities where the PHC providers worked. Combining the results of these models revealed that the major influencing factors were case design and city. Details are shown in
The association between assessments using virtual patients and unannounced standardized patients using stepwise regression for each module.
Test modules | Standardized |
|||||||
|
||||||||
|
Recall for USPsa | .314 (.183 to .445) | <.001 | .351 | ||||
|
Female sex | –.049 (–.089 to –.009) | .02 | –.366 | ||||
|
|
|||||||
|
|
Guangzhou | 0 (ref) |
|
|
|||
|
|
Changsha | –.067 (–.126 to –.009) | .03 | –.506 | |||
|
|
Lanzhou | –.065 (–.129 to –.001) | .049 | –.489 | |||
|
Test time | .002 (.001 to .004) | .006 | .205 | ||||
|
||||||||
|
F1 score for USPs | .071 (–.090 to .023) | .39 | .080 | ||||
|
|
|||||||
|
|
Low back pain | 0 (ref) |
|
|
|||
|
|
Cold | .203 (.100 to .306) | <.001 | 1.086 | |||
|
|
Gastritis | .169 (.028 to .311) | .02 | .907 | |||
|
||||||||
|
F1 score for USPs | –.025 (–.169 to .118) | .74 | –.025 | ||||
|
|
|||||||
|
|
Low back pain | 0 (ref) |
|
|
|||
|
|
Cold | .206 (.074 to .337) | .003 | .768 | |||
|
|
Stress urinary incontinence | –.269 (–.401 to –.137) | <.001 | –1.005 | |||
|
|
Type 2 diabetes | –.289 (–.416 to –.161) | <.001 | –1.079 | |||
|
|
Gastritis | –.386 (–.543 to –.228) | .003 | –1.440 | |||
|
||||||||
|
F1 score for USPs | .045 (–.133 to .223) | .62 | .034 | ||||
|
|
|||||||
|
|
Low back pain | 0 (ref) |
|
|
|||
|
|
Cold | .432 (.300 to .563) | <.001 | 1.404 | |||
|
|
Hypertension | .419 (.283 to .555) | <.001 | 1.363 | |||
|
|
Type 2 diabetes | .350 (.223 to .477) | <.001 | 1.138 | |||
|
|
Postpartum depression | .173 (.003 to .344) | .05 | .564 | |||
|
|
Gastritis | –.176 (–.347 to –.006) | .05 | –.573 | |||
|
|
Stress urinary incontinence | –.189 (–.301 to –.078) | .001 | –.615 | |||
|
|
Migraine | –.198 (–.374 to –.023) | .03 | –.645 | |||
|
|
|||||||
|
|
Guangzhou | 0 (ref) |
|
|
|||
|
|
Lanzhou | –.190 (–.299 to –.082) | <.001 | –.619 | |||
|
|
Xi’an | –.178 (–.313 to –.042) | .01 | –.577 |
aUSP: unannounced standardized patient.
Our study examined the agreement between using VPs and USPs to assess the quality of PHC in China. We found that the agreement between VP and USP results was low in general, which may result from the “know-do” gap. The VP tests might also have been influenced by different case conditions, different interface designs of the VPs, and the usability of the VPs.
We found that the agreement between VP and USP scores was low in our study sample. The USP scores were low in terms of recall, indicating that our study participants performed only some of the necessary actions, especially for the physical examination module and the laboratory and imaging test module. This suggests that PHC providers only partially performed the guideline-recommended checklist items in actual practice, which might be the result of a lack of incentives or limited time and resources [
Further analysis using multiple linear regression suggested that VP-based performance varied with case design, indicating that PHC providers’ competency differed with clinical case design. Specifically, variations for type 2 diabetes and gastritis were observed for the test modules and low scores were observed for laboratory and imaging tests for both case designs, while higher scores were seen for treatment for type 2 diabetes than for gastritis. This finding may indicate that PHC providers were more familiar with type 2 diabetes, which is commonly seen in PHC and has a distinctive medical history and physical signs. As a result, laboratory and imaging tests for type 2 diabetes were more likely to be omitted, while appropriate treatment was more likely to be conducted in the VP tests. In contrast, PHC providers might prefer to conduct simple physical examinations for the symptoms of abdominal pain, but they were reluctant to conduct the complex laboratory and imaging testing and treatment that should be offered in accordance with the clinical guidelines for gastritis [
Furthermore, the VP interface and usability also influenced the VP testing. By and large, 2 types of interface were used in the VP testing: searching for keywords for consultation for medical history and selecting from multiple choices for physical examinations (for other modules, a mixture of both interface formats was adopted). Specifically, our results showed that the recall score for medical history for the VP testing was two-thirds of the score for USP testing, while the corresponding VP score for physical examination was more than twice that of the USP score. These findings indicate that multiple choices might provide more hints, allowing PHC providers to guess a correct action more easily [
The study had several limitations. First, as a purposive sampling approach was used, our research sample may be more likely to have included PHC providers who were receptive to technological innovations, and the extent to which our findings apply to providers who are less receptive needs verification. Although our user experience analysis showed promising results, only a few participants answered the user experience questionnaire. Second, due to substantial missing data for the sociodemographic characteristics of the PHC providers, we failed to identify any remaining influence of these factors on the agreement between VP and USP scores. Third, although we found that the VP interface was a key factor influencing the VP testing, we did not perform a direct comparison of different interfaces with the same disease module. Last, we used a summary score for each module to indicate the providers’ performance, assuming that individual consultation or action items had equal importance. Nevertheless, a hierarchic order may exist among consultation and action items that is specific to the disease conditions under consideration, such that a weighted score may have been better suited for quantifying the providers’ performance [
Our findings highlight the need for further modifications to the VP platform. To improve the design of the VPs to bring them as close as possible to real clinical conditions, strict testing time limits should be implemented to enhance the sense of time pressure. Besides this, the interactive design of the VPs should opt for keyword searching over multiple choices to minimize hints. The creation of clinical settings and the application of keyword searching can be enhanced via advanced technologies, such as virtual simulation, voice input, and fuzzy retrieval [
Moreover, to facilitate the implementation of VPs, an add-on program to widely used social software such as WeChat would be preferable to a separate application that requires installation. A short demonstration (of less than 5 minutes) of the main action steps of the VP should be embedded in the program and shown as a mandatory preview for first-time users. If needed, initial training with VP cases should also be provided, so that users can become familiar with the platform.
To better understand the agreement between VP and USP testing, future studies would benefit from systematically collecting information on potential factors contributing to differences between VPs and USPs, using both quantitative and qualitative approaches. For instance, the preferences of PHC providers for different VP interfaces could be examined with questionnaires or interviews to assess differences in perceived authenticity, cognitive load, and motivation [
The agreement between VP and USP testing for PHC quality assessment was low. This low agreement may mainly reflect the “know-do” gap, while the VP test results were also influenced by different case conditions, interface design, and usability. To improve VP usability in the resource-limited settings found in PHC, VPs should be modified to be more user-centered, paying attention to the balance between enhancing usability and avoiding hints. Factors influencing the agreement between VP and USP testing need further study.
Case development, modification, and validity.
Example of REDCap.
Classification of diagnosis.
Regression result of different adjustment.
concordance correlation coefficient
primary health care
unannounced standardized patient
virtual patient
The project received support from the China Medical Board 18-300, National Natural Science Foundation of China (71974211), the Swiss Agency for Development and Cooperation (81067392), and the National Key R&D Program of China (2021ZD0113401). The authors thank the unannounced standardized patient team for the case designs and field tests and JL for proofreading the manuscript. The funders of this study had no role in study design, data collection, data analysis, data interpretation, or writing of the report.
Due to data safety and property rights, the data sets generated and analyzed during the current study are not publicly accessible but are available upon request to the corresponding author.
JL, YC, HL, YL, XW, and DX co-designed the study. JL, DX, and YC requested the funding. JL, YC, and DX take responsibility for project administration. All authors conducted or managed the research and investigation process. MZ, YC, and JL took part in virtual patient development. MZ, YC, and JL designed the methodology. MZ, JL, YC, and JC carried out or supported the data analysis, including the statistical analyses. YC, MZ, JL, and JC take responsibility for the integrity of the data and the accuracy of the data analysis, with each author having been responsible for specific parts of the raw data set. MZ wrote the manuscript. QH and JL reviewed the manuscript. All authors critically reviewed the manuscript and contributed to the interpretation of the results. All authors had access to the data, critically reviewed the paper before publication, and take final responsibility for the decision to submit the research for publication.
None declared.