This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Dialog agents (chatbots) have a long history of application in health care, where they have been used for tasks such as supporting patient self-management and providing counseling. Their use is expected to grow with increasing demands on health systems and improving artificial intelligence (AI) capability. Approaches to the evaluation of health care chatbots, however, appear to be diverse and haphazard, resulting in a potential barrier to the advancement of the field.
This study aims to identify the technical (nonclinical) metrics used by previous studies to evaluate health care chatbots.
Studies were identified by searching 7 bibliographic databases (eg, MEDLINE and PsycINFO) in addition to conducting backward and forward reference list checking of the included studies and relevant reviews. The studies were independently selected by two reviewers who then extracted data from the included studies. Extracted data were synthesized narratively by grouping the identified metrics into categories based on the aspect of chatbots that the metrics evaluated.
Of the 1498 citations retrieved, 65 studies were included in this review. Chatbots were evaluated using 27 technical metrics, which were related to chatbots as a whole (eg, usability, classifier performance, speed), response generation (eg, comprehensibility, realism, repetitiveness), response understanding (eg, chatbot understanding as assessed by users, word error rate, concept error rate), and esthetics (eg, appearance of the virtual agent, background color, and content).
The technical metrics of health chatbot studies were diverse, with survey designs and global usability metrics dominating. The lack of standardization and paucity of objective measures make it difficult to compare the performance of health chatbots and could inhibit advancement of the field. We suggest that researchers more frequently include metrics computed from conversation logs. In addition, we recommend the development of a framework of technical metrics with recommendations for specific circumstances for their inclusion in chatbot studies.
The potential of human-computer dialog to provide health care benefits has been perceived for many decades. In 1966, Weizenbaum’s ELIZA system caught the public imagination with its imitation of a psychotherapist through the relatively simple linguistic token manipulation possible at the time [
With the advent of smartphones, the distribution of highly interactive chatbots has been greatly facilitated, particularly with the ubiquitous use of app stores and wide installation of chat apps that can include chatbots, notably Facebook Messenger. Chatbots, as with other electronic health (eHealth) interventions, offer scalability and 24-hour availability to plug gaps in unmet health needs. For example, Woebot delivers cognitive behavior therapy and has been tested with students with depression [
To be an evidence-based discipline requires measurement of performance. The impact of health chatbots on clinical outcomes is the ultimate measure of success. For example, did the condition (eg, depression, diabetes) improve to a statistically significant degree on an accepted measure (eg, PHQ-9 [
As an alternative and useful precursor to clinical outcome metrics, technical metrics concern the performance of the chatbot itself (eg, did participants feel that it was usable, give appropriate responses, and understand their input?). Appropriateness refers to the relevance of the provided information in addressing the problem prompted [
Previously, we had introduced a framework for evaluation measures of health chatbots to provide guidance to developers [
To achieve the aforementioned objective, a scoping review was conducted. To conduct a transparent and replicable review, we followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Extension for Scoping Reviews (PRISMA-ScR) guidelines [
For this review, we searched the following bibliographic databases November 1-3, 2019: MEDLINE (via EBSCO), EMBASE (Excerpta Medica Database; via Ovid), PsycINFO (via Ovid), CINAHL (Cumulative Index of Nursing and Allied Health Literature; via EBSCO), IEEE (Institute of Electrical and Electronics Engineers) Xplore, ACM (Association for Computing Machinery) Digital Library, and Google Scholar. We screened only the first 100 hits retrieved by searching Google Scholar, as it usually retrieves several thousand references ordered by their relevance to the search topic. We checked the reference list of the included studies to identify further studies relevant to the current review (ie, backward reference list checking). Additionally, we used the “cited by” function available in Google Scholar to find and screen studies that cited the included studies (ie, forward reference list checking).
The search terms were derived from previously published literature and the opinions of informatics experts. For health-related databases, we used search terms related to the intervention of interest (eg, chatbot, conversational agent, and chat-bot). In addition to intervention-related terms, we used terms related to the context (eg, health, disease, and medical) for non–health-related databases (eg, IEEE and ACM digital library).
The intervention of interest in this review was chatbots that are aimed at delivering health care services to patients. Chatbots implemented in stand-alone software or web-based platforms were included. However, we excluded chatbots operated by a human (Wizard-of-Oz) or integrated into robotics, serious games, SMS text messaging, or telephone systems. To be included, studies had to report a technical evaluation of a chatbot (eg, usability, classifier performance, and word error rate). We included peer-reviewed articles, dissertations, and conference proceedings, and we excluded reviews, proposals, editorials, and conference abstracts. This review included studies written in the English language only. No restrictions were considered regarding the study design, study setting, year of publication, and country of publication.
Authors MA and ZS independently screened the titles and abstracts of all retrieved references and then independently read the full texts of included studies. Any disagreements between the two reviewers were resolved by AA. We assessed the intercoder agreement by calculating Cohen, which was 0.82 for screening titles and abstracts and 0.91 for reading full texts, indicating a very good agreement [
To conduct a reliable and accurate extraction of data from the included studies, a data extraction form was developed and piloted using 8 included studies (
A narrative approach was used to synthesize the extracted data. After identifying all technical metrics used by the included studies to evaluate chatbots, we divided them into 4 categories based on the aspect of chatbots that the metrics evaluate. The 4 categories were formed after a discussion by the authors in which consensus was reached. For each metric, we identified how the studies measured it. Data synthesis was managed using Microsoft Excel (Microsoft Corporation).
By searching the 7 electronic databases, 1498 citations were retrieved. After removing 199 (13.3%) duplicates of these citations, 1299 (86.7%) titles and abstracts were screened. The screening process resulted in excluding 1113 (74.3%) titles and abstracts due to several reasons detailed in
Flowchart of the study selection process.
Characteristics of the included studies are detailed in
The sample size was reported in 61 studies, and 38 studies (62%) had 50 or fewer participants. In 44 studies, the age of participants was reported; the mean age of participants was 39 years, with a range of 13-79 years. Sex of participants was reported in 54 studies, where the mean percentage of males was 48.1%. Of the 62 studies that reported participants’ health conditions, 34 (54.8%) studies recruited participants from a clinical population (ie, those with health issues). Participants were recruited from clinical settings (n=30, 49.2%), community (n=20, 32.8%), and educational settings (n=18, 29.5%). Metadata and population characteristics of each included study are presented in
Chatbots were used for self-management (n=17, 26.2%), therapeutic purposes (n=12, 18.5%), counselling (n=12, 18.5%), education (n=10, 15.4%), screening (n=9, 13.8%), training (n=7, 10.8%), and diagnosing (n=3, 4.6%). Although chatbots were implemented in stand-alone software in about 62% (n=40) of studies, chatbots were implemented in web-based platforms in the remaining studies (n=25, 39%). Chatbot responses were generated based on predefined rules, machine learning approaches, or both methods (hybrid) in 82% (N=53), 17% (n=11), and 2% (n=1) of the included studies, respectively. In the majority of studies (n=58, 89%), chatbots led the dialogue. In about 62% (n=40) of studies, users interacted with chatbots only by typing in their utterances (texts). The most common modalities used by chatbots to interact with users were a combination of text, voice, and nonverbal language (ie, facial expression and body language; n=21, 32%), text only (n=20, 31%), and a combination of voice and nonverbal language (n=19, 29%). The most common disorders targeted by chatbots were any health condition (n=20, 31%) and depression (n=15, 23%).
Characteristics of the included studies (N=65).
Parameters and characteristics | Studies, n (%)a | |||
|
||||
|
|
|||
|
|
Survey | 41 (63) | |
|
|
Quasi-experiment | 11 (17) | |
|
|
Randomized controlled trial | 13 (20) | |
|
|
|||
|
|
Journal article | 37 (57) | |
|
|
Conference proceeding | 25 (39) | |
|
|
Thesis | 3 (5) | |
|
|
|||
|
|
United States | 33 (51) | |
|
|
France | 5 (8) | |
|
|
Netherlands | 3 (5) | |
|
|
Japan | 3 (5) | |
|
|
Australia | 3 (5) | |
|
|
Italy | 2 (3) | |
|
|
Switzerland and Netherlands | 2 (3) | |
|
|
Finland | 1 (2) | |
|
|
Sweden | 1 (2) | |
|
|
Turkey | 1 (2) | |
|
|
United Kingdom | 1 (2) | |
|
|
Switzerland & Germany | 1 (2) | |
|
|
Mexico | 1 (2) | |
|
|
Spain | 1 (2) | |
|
|
Global population | 1 (2) | |
|
|
Romania, Spain and Scotland | 1 (2) | |
|
|
Philippines | 1 (2) | |
|
|
Switzerland | 1 (2) | |
|
|
New Zealand | 1 (2) | |
|
|
Spain and New Zealand | 1 (2) | |
|
|
South Africa | 1 (2) | |
|
|
|||
|
|
Before 2010 | 3 (5) | |
|
|
2010-2014 | 17 (26) | |
|
|
2015-2019 | 45 (70) | |
|
||||
|
|
|||
|
|
≤50 | 38 (62) | |
|
|
51-100 | 11 (18) | |
|
|
101-200 | 9 (15) | |
|
|
>200 | 3 (5) | |
|
|
|||
|
|
Mean (range) | 39 (13-79) | |
|
|
|||
|
|
Male | 48.1 | |
|
|
|||
|
|
Clinical sample | 34 (55) | |
|
|
Nonclinical sample | 28 (45) | |
|
|
|||
|
|
Clinical | 30 (50) | |
|
|
Community | 20 (33) | |
|
|
Educational | 18 (30) | |
|
||||
|
|
|||
|
|
Self-management | 17 (26) | |
|
|
Therapy | 12 (19) | |
|
|
Counselling | 12 (19) | |
|
|
Education | 10 (15) | |
|
|
Screening | 9 (14) | |
|
|
Training | 7 (11) | |
|
|
Diagnosing | 3 (5) | |
|
|
|||
|
|
Stand-alone software | 40 (62) | |
|
|
Web-based | 25 (39) | |
|
|
|||
|
|
Rule-based | 53 (82) | |
|
|
Artificial intelligence | 11 (17) | |
|
|
Hybrid | 1 (2) | |
|
|
|||
|
|
Chatbot | 58 (89) | |
|
|
Users | 4 (6) | |
|
|
Both | 3 (5) | |
|
|
|||
|
|
Text | 40 (62) | |
|
|
Voice | 9 (14) | |
|
|
Voice and nonverbal | 8 (12) | |
|
|
Text and voice | 6 (9) | |
|
|
Text and nonverbal | 2 (3) | |
|
|
|||
|
|
Text, voice and nonverbal | 21 (32) | |
|
|
Text | 20 (31) | |
|
|
Voice and nonverbal | 19 (29) | |
|
|
Text & voice | 4 (6) | |
|
|
Voice | 1 (2) | |
|
|
|||
|
|
Any health condition | 20 (31) | |
|
|
Depression | 15 (23) | |
|
|
Autism | 5 (8) | |
|
|
Anxiety | 5 (8) | |
|
|
Substance use disorder | 5 (8) | |
|
|
Posttraumatic stress disorder | 5 (8) | |
|
|
Mental disorders | 3 (5) | |
|
|
Sexually transmitted diseases | 3 (5) | |
|
|
Sleep disorders | 2 (3) | |
|
|
Diabetes | 2 (3) | |
|
|
Alzheimer | 1 (2) | |
|
|
Asthma | 1 (2) | |
|
|
Cervical cancer | 1 (2) | |
|
|
Dementia | 1 (2) | |
|
|
Schizophrenia | 1 (2) | |
|
|
Stress | 1 (2) | |
|
|
Genetic variants | 1 (2) | |
|
|
Cognitive impairment | 1 (2) | |
|
|
Atrial fibrillation | 1 (2) | |
|
aPercentages were rounded and may not sum to 100.
bSample size was reported in 61 studies.
cMean age was reported in 44 studies.
dN/A: not applicable.
eSex was reported in 54 studies.
fSample type was reported in 62 studies.
gSetting was reported in 61 studies.
hNumbers do not add up as several chatbots focused on more than one health condition.
iNumbers do not add up as several chatbots have more than one purpose.
jNumbers do not add up as several chatbots target more than one health condition.
The included studies evaluated chatbots using many technical metrics, which were categorized into 4 main groups: metrics related to chatbots as a whole (global metrics), metrics related to response generation, metrics related to response understanding, and metrics related to esthetics. More details about metrics are presented in the following sections.
The included studies evaluated chatbots as a whole using the following metrics: usability, classifier performance, speed, technical issues, intelligence, task completion rate, dialogue efficiency, dialogue handling, context awareness, and error management.
Usability of chatbots was assessed in 37 (56.9%) studies [
Classifier performance of chatbots was evaluated in 8 (12.3%) studies [
Technical issues (eg, errors/glitches) in chatbots were examined in 4 studies (6.2%) [
Of the reviewed studies, 2 (3.1%) studies examined chatbot flexibility in dialogue handling (eg, its ability to maintain a conversation and deal with users’ generic questions or responses that require more, less, or different information than was requested) using interviews [
The following metrics were utilized by the included studies to evaluate response generation by chatbots: appropriateness of responses, comprehensibility, realism, speed of response, empathy, repetitiveness, clarity of speech, and linguistic accuracy.
Of the reviewed studies, 15 (23.1%) examined the appropriateness and adequacy of verbal [
Comprehensibility of responses, which refers to the degree to which a chatbot generates responses understandable by users, was evaluated by 14 (21.5%) studies [
In total, 14 (21.5%) studies assessed how human-like chatbots are (realism) [
Altogether, 11 (16.9%) studies assessed the speed of a chatbot’s responses [
Repetitiveness of a chatbot’s responses was examined in 9 (13.8%) studies [
The included studies evaluated chatbot understanding of users’ responses using the following metrics: understanding as assessed by users, word error rate, concept error rate, and attention estimator errors.
Chatbot understanding, which refers to a chatbot’s ability to adequately understand the verbal and nonverbal responses of users, was assessed by 20 (30.8%) studies [
Word error rate, which assesses the performance of speech recognition in chatbots, was examined in 2 (3.7%) studies using conversational logs [
The included studies evaluated the esthetics of chatbots using the following metrics: appearance of the virtual agent, background color and content, font type and size, button color, shape, icon, and background color contrast.
In total, 5 (7.7%) studies assessed the appearance of the virtual agent using a single question in a self-administrated questionnaire [
It became clear that there is currently no standard method in use to evaluate health chatbots. Most aspects are studied using self-administered questionnaires or user interviews. Common metrics are response speed, word error rate, concept error rate, dialogue efficiency, attention estimation, and task completion. Various studies assessed different aspects of chatbots, complicating direct comparison. Although some of this variation may be due to the individual characteristics of chatbot implementations and their distinct use cases, it is difficult to see why metrics such as appropriateness of responses, comprehensibility, realism, speed of response, empathy and repetitiveness are each only applicable to a small percentage of cases. Further, objective quantitative metrics (eg, those based on log reviews) were comparatively rarely used in the reported studies. We thus suggest continuing research and development toward an evaluation framework for technical metrics with recommendations for specific circumstances for their inclusion in chatbot studies.
Jadeja et al [
We found usability to be the most commonly assessed aspect of health chatbots. The system usability scale (SUS [
Conversational-turns per session (CPS) has been suggested as a success metric for social chatbots as exemplified by XiaoIce [
A further area for standardization would be in the quality of responses. We observed response generation to be widely measured but in very diverse ways. Emergence of standard measures for response generation and understanding would greatly advance the comparability of studies. Development of validated instruments in this area would be a useful contribution to chatbot research.
We commend the inclusion of classifier performance in health chatbot studies where this is applicable and practical to assess. It could be less meaningful to compare raw performance (eg, as area under the curve) across domains due to differences in difficulty; ideally, chatbot performance would be compared to the performance of a human expert for the task at hand. Further, we perceive the opportunity for a progression of performance measures in health chatbot studies as a given product gains maturity. Good early-stage metrics would be those that assess response quality and response understanding to establish that the product is working well. Subsequent experiments can advance the assessment of self-reported usability and metrics of social engagement. Where applicable, classifier performance can round out technical performance evaluation to establish whether trials to assess clinical outcomes are warranted.
This study is the first review that summaries the technical metrics used by previous studies to evaluate health care chatbots. This helps readers explore how chatbots were evaluated in health care. Given that this review was executed and reported in line with PRISMA-ScR guidelines [
To retrieve as many relevant studies as possible, the most commonly used databases in the fields of health and information technology were searched. Further, we searched Google Scholar and conducted backward and forward reference list checking to retrieve gray literature and minimize the risk of publication bias.
As two reviewers independently selected the studies and extracted the data, the selection bias in this review was minimal. This review can be considered a comprehensive review given that we did not apply restrictions regarding the study design, study setting, year of publication, and country of publication.
Laranjo et al [
This review focused on chatbots that are aimed at delivering health care services to patients and work on stand-alone software and web browsers; it excluded chatbots that used robotics, serious games, SMS text messaging, Wizard-of-Oz, and telephones. Thus, this review did not include many technical metrics used to evaluate chatbots for other users (eg, physicians, nurses, and caregivers), in other fields (eg, business and education), or with alternative modes of delivery (eg, SMS text messaging, Wizard-of-Oz, and telephones). The abovementioned restrictions were applied by previous reviews about chatbots as these features are not part of ordinary chatbots [
Due to practical constraints, we could not search interdisciplinary databases (eg, Web of Science and ProQuest), conduct a manual search, or contact experts. Further, the search in this review was restricted to English-language studies. Accordingly, it is likely that this review missed some studies.
From this review, we perceive the need for health chatbot evaluators to consider measurements across a range of aspects in any given study or study series, including usability, social experience, response quality, and, where applicable, classifier performance. The establishment of standard measures would greatly enhance comparability across studies with the SUS and CPS as leading candidates for usability and social experience, respectively. It would be ideal to develop guidelines for health chatbot evaluators indicating what should be measured and at what stages in product development. Development of validated measurement instruments in this domain is sparse and such instruments would benefit the field, especially for response quality metrics.
Search string.
Data extraction form.
Metadata and population characteristics of each included study.
Characteristics of the intervention in each included study.
Association for Computing Machinery
artificial intelligence
Cumulative Index of Nursing and Allied Health Literature
conversational-turns per session
electronic health
Excerpta Medica Database
Institute of Electrical and Electronics Engineers
information retrieval
Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Extension for Scoping Reviews
user experience
AA developed the protocol and conducted the search with guidance from and under the supervision of KD and MH. Study selection and data extraction were carried out independently by MA and ZS. AA solved any disagreements between the two reviewers. AA synthesized the data. AA and KD drafted the manuscript, and it was revised critically for important intellectual content by all authors. KD and JW reviewed the related literature and interpreted the results. All authors approved the manuscript for publication and agree to be accountable for all aspects of the work.
None declared.