This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Symptom checkers (SCs) are tools developed to provide clinical decision support to laypersons. Apart from suggesting probable diagnoses, they commonly advise when users should seek care (
This study aims to compare triage accuracy between SCs and their potential users (ie, laypersons).
On Amazon Mechanical Turk, we recruited 91 adults from the United States who had no professional medical background. In a web-based survey, the participants evaluated 45 fictitious clinical case vignettes. Data for 15 SCs that had processed the same vignettes were obtained from a previous study. As main outcome measures, we assessed the accuracy of the triage assessments made by participants and SCs for each of the three triage levels (ie,
The mean overall triage accuracy was similar for participants (60.9%, SD 6.8%; 95% CI 59.5%-62.3%) and SCs (58%, SD 12.8%). Most participants outperformed all but 5 SCs. On average, SCs more reliably detected emergencies (80.6%, SD 17.9%) than laypersons did (67.5%, SD 16.4%; 95% CI 64.1%-70.8%). Although both SCs and participants struggled with cases requiring self-care (the least urgent triage category), SCs more often wrongly classified these cases as emergencies (43/174, 24.7%) compared with laypersons (56/1365, 4.10%).
Most SCs had no greater triage capability than an average layperson, although the triage accuracy of the five best SCs was superior to the accuracy of most participants. SCs might improve early detection of emergencies but might also needlessly increase resource utilization in health care. Laypersons sometimes require support in deciding when to rely on self-care but it is in that very situation where SCs perform the worst. Further research is needed to determine how to best combine the strengths of humans and SCs.
Patients obtain health-related information from health care professionals, but more frequently, information for patients is provided in print; on the web; and, most recently, via smartphone apps. Patients not only use these resources to supplement information received from health care professionals but also as a decision-support tool to advise them on whether and where to seek adequate health care, especially as health care pathways grow more complex. Symptom checkers (SCs) are tools developed to provide support to laypersons. Users can enter their complaints and, with some SCs, demographic or health-related information (eg, age, sex, and past medical history) to obtain advice on the urgency of their complaints (
Despite their popularity, there is no established framework to evaluate the performance of SCs [
A study showing that laypersons are just as capable of predicting criminal recidivism as a complex commercial algorithm [
In addition to advising the individual user, SCs are also said to have the potential to reduce the burden on health care services. Unfortunately, not only has this potential benefit not materialized yet [
The purpose of this study is to benchmark the triage capability of SCs against that of their potential users, that is, laypersons.
This study was approved by the Ethics Committee of the Department of Psychology and Ergonomics (Institut für Psychologie und Arbeitswissenschaft) at Technische Universität Berlin (tracking number: FEU_03_20180615). Participants volunteered to participate in the survey, and informed consent was required.
This investigation builds on a prior study by Semigran et al [
All participants were US residents, at least 18 years of age, and had no professional medical background. Our investigation was limited to US residents, as the triage level definitions and the gold standard solutions assigned to the case vignettes by Semigran et al [
We created an online survey with UNIPARK (QuestBack GmbH) [
Participants were asked to classify each vignette into 1 of 3 triage categories, as defined by Semigran et al [
The 45 standardized case vignettes included 15 cases for each triage level. The vignettes, as chosen by Semigran et al [
For the purpose of our study, the vignettes were adapted to increase the comprehensibility of lay individuals. First, we transformed the bullet points into complete sentences. Second, we paraphrased technical terms. For example, we replaced “rhinorrhea” with “runny nose” and “tender” with “painful to the touch.” In very few cases, explanations required elaboration. Our overall aim was to provide participants with the same information used by Semigran et al [
Understandability and paraphrasing were cross-validated by two native English speakers: one was a medical professional (RM) and the other was without a professional medical background (MALS). The adapted vignettes are shown in
We recruited the participants through Amazon Web Service
Two methods were employed to ensure that the participants paid close attention to the survey questions. First, we added 5 attention checks to the set of 45 case vignettes. These attention checks were formatted similarly to the case vignettes but included prompts to choose specific answer options. Participants were excluded from the analysis if they answered any of the 5 attention checks incorrectly. Second, upon completion of the survey, participants were asked to affirm that they were attentive and honest to improve the reliability of our data, as suggested in a reliability analysis on MTurk data [
The survey on MTurk was published on 3 different days (March 21, 2020, at 2 PM Pacific Daylight Time [PDT]; March 22, 2020, at 1:45 PM PDT; and March 29, 2020, at 1 PM PDT). By selecting the weekend day and early afternoon PDTs, we attempted to reach an MTurk population as diverse as possible, following a 2017 study on the intertemporal variation of the MTurk population [
Due to limited funding, the sample size was ultimately determined by the availability of funds and the number of participants who performed well enough to earn a bonus.
Data were cleaned and explored using
Following Semigran et al [
To assess the effects of demographic variables (age, sex, and educational level), a logistic regression was performed with the correct triage of a case vignette as a dependent variable. We calculated 95% CIs for the marginal probabilities of the fixed effects using the Wald method to assess whether demographic variables had a significant effect on participants’ accuracy. The α level was set at .05.
For the comparison of SCs and participants, we performed (1) a comparison between participants and all rated SCs aggregated and (2) between participants and individual SCs.
The performance of the SCs was obtained from the appendix of the audit study by Semigran et al [
Following Semigran et al [
The risk aversion of the SCs and the participants was determined using the ratio of overtriaged vignettes to undertriaged vignettes. We deemed a ratio greater than 1:1, which is more case vignettes overtriaged than undertriaged, as risk averse. To determine what type of triage mistakes were most likely to occur, we calculated the proportion of triage recommendations given in each triage category by SCs and by participants (eg, the proportion of evaluations in which participants recommended emergency care when self-care was appropriate or the proportion of evaluations in which SCs recommended nonemergency care when emergency care would have been the correct solution) and compared these proportions using the Pearson
To analyze whether SCs and participants were challenged by the same case vignettes, the degree of difficulty of a case was calculated using the proportion of SCs and participants correctly triaging it. For example, if a case vignette was solved correctly by every SC, the vignette’s degree of difficulty for SCs was 100%. SCs that did not evaluate the respective case vignette for technical reasons were not included in the denominator. A linear correlation analysis was then conducted to determine the relationship between case difficulty for SCs and case difficulty for participants.
As users are likely to use only one or very few SCs, there is no basis for recommendations about using or not using SCs on an aggregated analysis alone. Therefore, additional analyses compared the performance of the participant group with each SC. Considering that most SCs did not evaluate every case vignette (due to technical reasons, see the study by Semigran et al [
Our survey was accessed 142 times in 3 days during which it was available in total, 51 participants were excluded, either for failing attention checks (n=41) or for not fulfilling the eligibility criteria (n=10). All the remaining participants affirmed that they had paid close attention during the survey and answered honestly. This yielded a total of 91 participants, each having assessed all 45 case vignettes, which totaled 4095 case evaluations by participants, 1365 for each triage level (
The median time for completion of the survey (excluding the time for obtaining informed consent) was 20 minutes and 12 seconds (1st quartile=15 minutes:43 seconds; 3rd quartile=27 minutes:23 seconds). There was no significant difference in the participants’ mean accuracy between the 3 sampling days. We detected no statistically significant influence of demographic variables on participants’ triage accuracy.
Participant characteristics (N=91).
Characteristics | Values | |
Age (years), median (range) | 37 (20-73) | |
|
||
|
Female | 36 (40) |
|
Male | 55 (60) |
|
||
|
Non–high school graduate | 0 (0) |
|
High school graduate | 18 (20) |
|
Some college | 33 (36) |
|
Bachelor’s degree | 36 (40) |
|
Graduate degree | 4 (4) |
|
||
|
Recently consulted an SC | 20 (22) |
|
Recently faced triage decision | 23 (25) |
|
Neither faced triage decision nor consulted an SC recently | 62 (69) |
|
||
|
No training | 80 (88) |
|
Basic first aid training | 11 (12) |
aRecent was defined as “in the last 6 months.”
Overall, the participants triaged 3 out of 5 case vignettes correctly (2462/4065, 60.57%), and most participants qualified for the bonus payment (56/91, 62%). Their mean accuracy varied with triage level, roughly balanced for emergency and nonemergency situations (67.5% and 68.4%, respectively) but dropped below 50% for self-care vignettes. Of the 39.43% (1603/4065) of incorrect assessments, the majority (956/4065, 23.52%) were
As most SCs were unable to evaluate at least one of the case vignettes, the 15 SCs assessing the 45 case vignettes yielded only 532 case evaluations (see the study by Semigran et al [
At the aggregate level, SCs (58.0%; SD 12.8%) and participants (60.9%; SD 6.8%) showed very similar mean accuracies (
Mean triage accuracy of symptom checkers and participants.
Triage level | Percent triage accuracy, mean (SD) | 95% CI | ||
|
All 15 SCsa | Subset of 11 SCsb | Participantsc |
|
Emergency cases | 80.6 (17.9) | 79.8 (17.2) | 67.5 (16.4) | 64.1-70.8 |
Nonemergency cases | 58.5 (29.1) | 61.6 (27.8) | 68.4 (13.8) | 65.6-71.2 |
Self-care cases | 30.6 (25.7) | 41.8 (20.3) | 46.7 (15.9) | 43.4-49.8 |
Overall | 58.0 (12.8) | 61.6 (11.0) | 60.9 (6.8) | 59.5-62.3 |
aSC: symptom checker.
bFor the subset of 11 SCs, SCs never recommending self-care or always recommending emergency care by design were excluded.
cFor the participant sample, 95% CIs were calculated using bootstrapping.
The SCs were risk averse and overtriaged in more than a third of the evaluations (182/532, 34.2%), whereas undertriaging occurred in only 9.2% (49/532). Although participants also tended to be risk averse, this tendency was less pronounced (
Triage evaluations by participants and SCs and triage level. “11 SCs” refers to the SC sample after exclusion of SCs that never recommend self-care (the least urgent triage level). SC: symptom checker.
How challenging a case vignette was for SCs and participants varied widely: 3 vignettes were solved correctly by every SC and 1 vignette by none. Similarly, 4 vignettes were solved correctly by more than 90% of the participants and 2 by less than 10%. At every triage level, a broad variation in the degree of difficulty among case vignettes was observed. A very weak or no relationship could be detected for SCs and participants regarding case difficulty within each triage level (
Distribution of case difficulty for participants and SCs. Case difficulty is defined as the proportion of the group (SC or participants) evaluating the respective case correctly. The dashed line models a linear relationship. SC: symptom checker.
As previously mentioned, an aggregated analysis of SCs is less meaningful than a direct comparison between the participant population and each SC, as users are likely to consult only one or very few SCs. The overall trend shows that the accuracy of both participants and SCs decreases for self-care vignettes (
A total of 5 SCs (
All but 2 SCs (
Accuracy of SCs and participants by triage level (Em), nonemergency, and S-c. The accuracy of individual participants is indicated with blue dots. The aggregate accuracies of participants are shown as box plots. Em: emergency; SC: symptom checker; S-c: self-care.
Comparison of accuracy between symptom checkers and participants.
SCa,b name | Accuracyc, n (%) | Participants | Comparison | |
|
|
Percent accuracyd,e, mean (SD) | 95% CI | Percentage of participants outperforming the SC (95% CI)d,e |
HMSf Family Health Guide, n=40 | 32 (80) | 59.5 (7.1) | 58.0-60.9 | 0 (0-0) |
Healthy Children, n=15 | 11 (73) | 49.9 (10.1) | 47.7-52.1 | 1.1 (0-3.3) |
Steps2Care, n=42 | 30 (71) | 59.7 (7.2) | 58.2-61.1 | 1.1 (0-3.3) |
Symptify, n=40 | 28 (70) | 60.2 (7.2) | 58.2-61.7 | 5.5 (1.1-11.0) |
Symptomateg, n=14 | 9 (64) | 60.9 (11.6) | 58.6-63.2 | 26.4 (17.6-35.2) |
Drugs.com, n=42 | 25 (59) | 60.6 (6.5) | 59.3-61.9 | 51.6 (41.8-61.5) |
FreeMD, n=44 | 26 (59) | 60.2 (6.7) | 58.9-61.6 | 56.0 (45.1-65.9) |
Doctor Diagnose, n=16 | 10 (62) | 69.5 (10.9) | 67.3-71.7 | 63.7 (53.8-73.6) |
Family Doctor, n=41 | 22 (53) | 58.1 (7.0) | 56.7-59.6 | 68.1 (58.2-78.0) |
Early Doc, n=17 | 9 (52) | 63.4 (11.4) | 61.1-65.7 | 76.9 (68.1-85.7) |
Isabelg, n=45 | 23 (51) | 60.9 (6.8) | 59.4-62.2 | 89 (82.4-94.5) |
NHSh, n=44 | 23 (52) | 62.0 (6.9) | 60.9-63.4 | 89 (82.4-94.5) |
Symcatg, n=45 | 20 (44) | 60.9 (6.8) | 59.5-62.2 | 97.8 (94.5-100) |
Healthwise, n=44 | 19 (43) | 61.2 (7) | 59.7-62.6 | 98.9 (96.7-100) |
iTriageh,i, n=43 | 14 (32) | 60.5 (6.9) | 59.1-61.9 | 100 (100-100) |
aSC: symptom checkers
bSCs are listed in order by the proportion of participants outperforming them.
cMost SCs did not evaluate every case vignette. Their accuracy is given as the proportion of correctly solved vignettes of the total vignettes that they evaluated.
dThe participants’ accuracy is based on their assessment of the same case vignettes assessed by the respective SC.
eFor the participant sample, 95% CIs were calculated using bootstrapping.
fHMS: Harvard Medical School.
gFour SCs were apparently designed never to recommend self-care.
hNHS: National Health Service.
iOne SC advised seeking emergency care for all case vignettes.
Comparison of the overtriage inclination of symptom checkers (SCs) and participants. The dashed line shows where proportions of over and undertriaged errors are equal. Proximity to the left lower corner indicates a high triage accuracy. The red dot marks the respective symptom checker. The faded blue dots refer to the performance of individual participants. The larger blue dot marks their average performance. The SCs are ordered from left to right and top to bottom by the proportion of participants outperforming them, with the lowest proportional difference at the top left and the highest proportional difference on the bottom right.
Our study suggests that an average SC has no greater overall triage accuracy than an average user. However, this does not imply that SCs are not useful. Specifically, our data confirm a prior study showing that the lay population has difficulties reliably identifying medical emergencies [
Most SCs tended to overtriage. From a clinical and legal perspective, it can make sense to accept the resulting inflated cost of false alarms to avoid potentially missing an emergency (
Studies on the effects of SC advice on users are scarce. Therefore, general recommendations on whether laypersons should use SCs cannot be formulated as yet. On the basis of a detailed analysis of the performance variation among SCs and human decision makers, we showed that the five best SCs that Semigran et al [
Finally, SCs and participants struggled with different kinds of case vignettes, that is, SCs performed poorly in some clinical situations, whereas in others, their performance was superior to that of their users. For example, the 15 pediatric cases evaluated by the SC
Compared with the general population of the United States [
The data on SCs date back to a study published in 2015 [
As we built our study on the materials of the Semigran et al study [
In general, assessing triage capability with case vignettes has limited validity. This limitation is arguably greater for human participants than for SCs. Although SCs assess a case with a set algorithm and are therefore dependent only on input, contextual (social, emotional, etc) factors play a greater role in human decision making. In a real-life setting, humans might also notice and process more or less information than presented in a case vignette. In addition, reading “back pain” in a dry case vignette is surely a different matter than experiencing it. Thus, our results might be more valid for situations where SC users utilize the tool to triage someone other than themselves. Research shows that this is common practice, as up to 50% of online health information seekers do so on behalf of someone else [
Prior publications have emphasized the need for a framework within which the safety and usefulness of SCs should be analyzed. Assessing the average performance of SCs, as has often been done, fosters few actionable recommendations. Given the high-performance variability among SCs, we consider benchmarking with case vignettes as a valuable first step in identifying the best SCs, which could then be tested extensively against relevant competitors.
Although comparing SCs’ triage capability against that of health care professionals is certainly useful [
Following this approach, our study suggests that the lay population would benefit from some SCs to some extent. Although SCs detect emergencies more reliably than the average user, they are more risk averse than the general population and recommend emergency care more often than is actually necessary. This is a cause for concern, as it might unnecessarily increase the burden on already overwhelmed health care services. Thus, advice on when not to seek emergency care would be the most useful feature of SCs, but it is precisely in that situation that they performed the worst. Further research should investigate which user groups benefit the most from using SCs and whether it is possible to identify the characteristics of scenarios where laypersons are superior to SCs in assessing triage levels. The detailed analyses presented in this paper provide a first step toward a framework for comparatively assessing the respective weaknesses and strengths of both SCs and human decision makers to be able to determine when humans should rely on SCs rather than on their gut feeling and vice versa.
Adapted case vignettes and case difficulty level.
Harvard Medical School
Mechanical Turk
Pacific Daylight Time
symptom checker
The authors express their gratitude to the participants, to Felix Grün for his support in designing the questionnaire and for his valuable feedback, to Eike Richter for his advice on statistical methods, and to Frances Lorié for proofreading the manuscript. The project was funded by the home institutions of the previous authors (MF and FB). No external funding was required for this study. The authors acknowledge support from the German Research Foundation (DFG) and the Open Access Publication Fund of Charité—Universitätsmedizin Berlin.
MS conceived the study, created the questionnaire, designed and conducted the analyses, and wrote the first draft of the paper. MALS assisted with case vignette adaptations. RM assisted with case vignette adaptations and manuscript development. FB and MF provided critical input and advised on the study and questionnaire design, analysis methods, and drafts of the paper. FB and MF contributed equally and share the last authorship. All authors accept full responsibility for the final version of the paper.
The lead author affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
All authors have completed the International Committee of Medical Journal Editors uniform disclosure form and declare no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous 3 years; and no other relationships or activities that could appear to have influenced the submitted work.