This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
The diagnosis of pigmented skin lesion is error prone and requires domain-specific expertise, which is not readily available in many parts of the world. Collective intelligence could potentially decrease the error rates of nonexperts.
The aim of this study was to evaluate the feasibility and impact of collective intelligence for the detection of skin cancer.
We created a gamified study platform on a stack of established Web technologies and presented 4216 dermatoscopic images of the most common benign and malignant pigmented skin lesions to 1245 human raters with different levels of experience. Raters were recruited via scientific meetings, mailing lists, and social media posts. Education was self-declared, and domain-specific experience was tested by screening tests. In the target test, the readers had to assign 30 dermatoscopic images to 1 of the 7 disease categories. The readers could repeat the test with different lesions at their own discretion. Collective human intelligence was achieved by sampling answers from multiple readers. The disease category with most votes was regarded as the collective vote per image.
We collected 111,019 single ratings, with a mean of 25.2 (SD 18.5) ratings per image. As single raters, nonexperts achieved a lower mean accuracy (58.6%) than experts (68.4%; mean difference=−9.4%; 95% CI −10.74% to −8.1%;
A high number of raters can be attracted by elements of gamification and Web-based marketing via mailing lists and social media. Nonexperts increase their accuracy to expert level when acting as a collective, and faster answers correspond to higher accuracy. This information could be useful in a teledermatology setting.
Accurate diagnosis of pigmented skin lesions requires experience and depends on the availability of specifically trained physicians [
Aside the help of computer algorithms, the use of collective intelligence could improve the accuracy of diagnosis to, or even beyond, the level of experts. Collective intelligence has emerged in the last two decades with the rise of interconnected groups of people and computers, collectively solving difficult tasks [
To measure the effect of swarm intelligence for the diagnosis of pigmented skin lesions, we created a publicly available dataset of 10,015 dermatoscopic images and developed a Web-based study platform with elements of gamification. We also aimed at providing a publicly available benchmark dataset with human diagnoses and corresponding metadata that will be helpful for developing and testing future machine learning algorithms.
The aim of this study was to find out whether a collective of nonexperts can reach expert-level (ie, being frequently consulted by experts and advanced users) accuracy in diagnosing skin cancer on dermatoscopic images and to find out the collective size needed for this.
The Web-based platform DermaChallenge [
To participate in the study, participants had to register with a username, a valid email address, and a password. In addition, we asked participants their age (age groups spanning 10 years), gender, country, profession, and years of experience in dermatoscopy. Ordered options for the last item were (1) less than 1 year, (2) opportunistic use for more than 1 year, (3) regular use for 1 to 5 years, (4) regular use for more than 5 years, or (5) more than 10 years of experience. Groups 1 to 4 were regarded as
For gamification, the training platform is structured into stepwise levels with different tasks of varying degrees of difficulty. As the platform is available to any user with a valid email address, we needed to verify plausibility of self-declared experience status. For this, the first 3 levels were
Raters were allowed to play more than 1 round per level. When a round was completed, raters were able to see their scoring rank in an
We used mailing lists, social media posts, and talks at scientific conferences to recruit participants. To compare recruitment strategies, we continuously monitored the number of new registrations and related them to specific recruitment events, if they occurred within 4 days after the event. To analyze registration and dropout, we categorized participants into (1) registered but not verified by email; (2) registered and verified but did not complete any level; (3) registered, verified, played, and completed 1 of the screening levels at least once; and (4) registered, verified, played, and completed all screening levels and the target level at least once. We analyzed engagement with the unbounded retention measure used in game analytics [
We calculated baseline measures of accuracy (ie, correct specific prediction of a disease category, not just malignancy) for each dermatoscopic image and per disease category for single raters. To calculate the measures of accuracy for collective intelligence, we applied bootstrapping (random sampling with replacement), simulating multiple second opinions. We let the size of the collectives range from 3 to 8. Dermatoscopic images for which the number of raters was lower than the size of the virtual collective were excluded. The disease category with most votes (ie, first-past-the-post voting) was regarded as the collective vote per dermatoscopic image; ties were broken at random. This procedure was repeated for answers of nonexperts as well as for the answers with high and low confidence.
We used the answering time as a surrogate measure for the level of confidence for each image. To allow an unbiased comparison, we calculated the mean answering time for every rater individually. Furthermore, if a rater needed more time than the mean individual answering time for the disease category, the level of confidence for a given answer was regarded as low. If, on the other hand, the answering time was lesser than or equal to the individual mean of this specific rater, the level of confidence was regarded as high. The confidence
The primary outcome metric is the mean accuracy, which we defined as the arithmetic mean of accuracies of every image within a rater group. All calculated accuracies per image were compared pairwise with the baseline accuracy of nonexperts. Point estimates for the difference in accuracy, confidence intervals, and
As some raters took the target level significantly more often than others, we restricted the number of rounds per rater to 30 to prevent bias. If the rounds were not completed, we included the answers only if more than 50% of cases (ie, 15 dermatoscopic images) per round were rated.
Classic measures of diagnostic values (sensitivities, specificities, and predictive values) were calculated per rater group and according to standard formulas [
Descriptive continuous values are presented as mean with standard deviation; estimates are provided with 95% confidence intervals. We used paired
The study was approved by the ethics review boards of the University of Queensland (protocol number 2017001223) and the Medical University of Vienna (protocol number 1804/2017). During registration, human raters provided written consent to allow analyzing anonymized ratings. A total of 4 participants demanded all their data to be deleted; therefore, their ratings are not included in this study.
Of the 2497 individuals (1538/2497, 61.59% female) who registered between June 15, 2018, and June 14, 2019, 44.09% (1101/2497) were board-certified dermatologists, 25.55% (638/2497) were dermatology residents, and 16.58% (414/2497) were general practitioners. In the 365 days, the survey page was visited 21,948 times. The raters came from 5 continents (Africa, n=112; Asia, n=204; Europe, n=1260; Americas, n=594; and Australia/Oceania, n=327).
The raters used mobile phones in 56.80% (13,042/22,961), a desktop computer in 30.80% (7061/22,925), and a tablet in 6.70% (1546/23,074) of visits to the survey page. The mean time spent on the site per visit was 4 min 37 seconds (SD 3 min 2 seconds). Of the 2497 registered raters, 367 (14.69%) dropped out before playing at least one level, 1330 (53.26%) completed the screening levels and started playing the target level, and 1245 (49.85%) completed the target level at least once. The distribution of age, gender, continent of origin, and experience was similar among registered raters who finished the screening tests and played the target level and those who dropped out. Peaks of registrations could be attributed to specific recruitment events. Most participants were recruited from social media (701/2497, 28.07%) or through mailing lists (732/2497, 29.31%). Only 1.96% (49/2497) of the participants were recruited from scientific meetings; the remaining 40.64% (1015/2497) could not be attributed to a specific event. The highest number of participants recruited per day was 676, after a social media campaign. Without any social media marketing, the number of visitors spanned from 15 to 40 visitors per day (at the time of submission). Participants with less than 1 year of experience had the lowest 30-day unbounded retention rate (21.9%), and participants with less than 3 years of experience had the highest 30-day unbounded retention rate (33.7%).
In the target level, we collected 111,019 single ratings, with a mean of 25.2 (SD 18.5) ratings per image. Only the 4216 images with 8 or more ratings were included in this analysis (AKIEC, n=327; BCC, n=514; BKL, n=1099; DF, n=115; MEL, n=1113; NV, n=907; and VASC, n=142). At the nonexpert level, data of 1208 participants, 4216 different images, 4102 rounds, and 101,271 ratings were included. At the expert level, data of 37 participants, 2609 different images, 193 rounds, and 4762 ratings were included.
Comparison of mean accuracy of single nonexperts to mean accuracy of different collective sizes and confidence levels.
Experience | Collective size | Confidencea | Mean accuracy, % | Mean difference (95% CI) | |
Nonexperts | —b | All | 58.60 | Reference | Reference |
Nonexperts | 4 | All | 64.93 | 6.33 (6.09 to 6.57) | <.001 |
Nonexperts | 8 | All | 68.51 | 9.91 (9.52 to 10.29) | <.001 |
Nonexperts | — | Low | 51.90 | −6.20 (−6.77 to −5.64) | <.001 |
Nonexperts | 4 | Low | 56.01 | −2.10 (−2.72 to −1.47) | <.001 |
Nonexperts | 8 | Low | 59.27 | 1.16 (0.45 to 1.88) | .007 |
Nonexperts | — | High | 61.40 | 2.77 (2.44 to 3.09) | <.001 |
Nonexperts | 4 | High | 65.85 | 7.22 (6.77 to 7.66) | <.001 |
Nonexperts | 8 | High | 68.40 | 9.77 (9.21 to 10.32) | <.001 |
Experts | — | All | 68.36 | 9.43 (8.11 to 10.74) | <.001 |
Experts | — | Low | 55.61 | 4.67 (2.27 to 7.06) | <.001 |
Experts | — | High | 74.08 | 11.91 (10.43 to 13.38) | <.001 |
aConfidence groups denote whether all answers of raters were measured (All) or only answers given with low or high confidence.
bNo collective size.
Nonexperts with low confidence had the lowest mean accuracy (51.9%, SD 28.9), whereas confident experts had the highest mean accuracy (74.1%, SD 41.7). A
Mean accuracy (dots) per disease category of nonexpert collectives (ranging from 3 to 8) compared with the mean sensitivity of single experts, single experts with high confidence, and single nonexperts. AKIEC: actinic keratosis/intraepithelial carcinoma; BCC: basal cell carcinoma; BKL: benign keratinocytic lesions; DF: dermatofibroma; MEL: melanoma; NV: nevus; VASC: vascular lesions.
The mean time to answer a case was below 5 seconds for both nonexperts (4.7 seconds, SD 4.05) and experts (3.9 seconds, SD 3.47), below 3 seconds in case of high confidence (2.8 seconds, SD 1.75 and 2.3 seconds, SD 1.29, respectively), and above 7 seconds in case of low confidence (7.0 seconds, SD 4.74 and 7.0 seconds, SD 4.22, respectively). The sensitivity for malignant cases increased for both nonexperts and experts in the presence of high confidence compared with low confidence (low vs high nonexperts: low 66.3% vs high 77.6% and experts: low 64.6% vs high 79.4%; see
Diagnostic values measuring detection of malignant skin lesions for different confidence levels.
Experience | Collective size | Confidencea | Sensitivity, % (95% CI) | Specificity, % (95% CI) | Positive predictive values, % (95% CI) | Negative predictive values, % (95% CI) |
Nonexpert | —b | All | 73.1 (72.7 to 73.5) | 77.4 (77.0 to 77.7) | 75.1 (74.7 to 75.5) | 75.5 (75.1 to 75.9) |
Nonexpert | 4 | All | 74.6 (73.9 to 75.2) | 78.0 (77.5 to 78.5) | 74.5 (73.9 to 75.2) | 78.0 (77.5 to 78.6) |
Nonexpert | 8 | All | 76.9 (76.3 to 77.5) | 80.1 (79.6 to 80.6) | 77.0 (76.4 to 77.6) | 80.1 (79.5 to 80.6) |
Nonexpert | — | Low | 66.3 (65.6 to 67.0) | 69.7 (69.1 to 70.4) | 69.4 (68.7 to 70.1) | 66.7 (66.0 to 67.3) |
Nonexpert | 4 | Low | 68.7 (68.0 to 69.4) | 74.6 (74.0 to 75.2) | 70.3 (69.7 to 71.0) | 73.1 (72.5 to 73.7) |
Nonexpert | 8 | Low | 70.9 (70.3 to 71.6) | 76.5 (75.9 to 77.0) | 72.5 (71.9 to 73.2) | 75.0 (74.4 to 75.6) |
Nonexpert | — | High | 77.6 (77.1 to 78.0) | 81.6 (81.2 to 82.0) | 78.7 (78.3 to 79.2) | 80.5 (80.1 to 81.0) |
Nonexpert | 4 | High | 76.9 (76.3 to 77.5) | 77.6 (77.0 to 78.1) | 74.8 (74.2 to 75.4) | 79.5 (79.0 to 80.0) |
Nonexpert | 8 | High | 78.3 (77.7 to 78.8) | 79.0 (78.4 to 79.5) | 76.3 (75.7 to 76.9) | 80.8 (80.2 to 81.3) |
Expert | — | All | 74.0 (72.2 to 75.8) | 85.8 (84.4 to 87.2) | 83.1 (81.4 to 84.7) | 77.8 (76.2 to 79.4) |
Expert | — | Low | 64.6 (61.3 to 67.8) | 77.4 (74.3 to 80.3) | 75.6 (72.2 to 78.7) | 66.9 (63.7 to 70.0) |
Expert | — | High | 79.4 (77.2 to 81.4) | 89.7 (88.2 to 91.1) | 87.1 (85.2 to 88.9) | 83.3 (81.5 to 85.0) |
aConfidence groups denote whether all answers of raters were measured (All) or only answers given with low or high confidence.
bNo collective size.
In this study, we showed that collective human intelligence increases the accuracy of nonexperts for the diagnosis of pigmented skin lesions. As collectives, nonexperts reached expert-level accuracies. Although experts were significantly more accurate than nonexperts in general, this difference vanished when average experts were compared with collectives of 8 nonexperts (
We also found that not all ratings from nonexperts were equally helpful. The ratings given with lower confidence, defined as slow answers in comparison with the mean answer time of a rater, did not increase the accuracy of the collectives of nonexperts. The ratings given with lower confidence even reduced the mean accuracy of small collectives (
In practice, collective intelligence models, as simulated herein, could be harnessed in different ways to obtain second opinions in difficult cases. Although our method is mainly suitable for store-and-forward approaches, one could also think of real-time simultaneous interaction among readers to possibly further increase accuracy [
Our results are promising with regard to the detection of malignant skin lesions. A collective of 8 confident raters was able to raise single nonexperts’ sensitivity from 73.1% to 78.3% and specificity from 77.4% to 79.9%. Interestingly, although the mean specific accuracy of 8 confident nonexperts was at the level of experts (+0.34% difference), their operating point regarding sensitivity and specificity was more in favor of sensitivity. Therefore, such a nonexpert collective would detect more malignant skin lesions at the cost of more interventions. This, however, may be mitigated by a second line of assessments.
The diagnostic accuracy alone will not be the only consideration in a potential implementation of collective ratings in practice. Although the availability of nonexperts is higher than that of experts, the more the nonexperts involved, the higher the costs and the longer it will take to get a collective vote. With regard to the optimal number of nonexperts, the benefits, such as gain in accuracy for each additional rater, will have to be weighed against these costs. For example, 4 confident nonexperts increase the sensitivity substantially in comparison with a single unconfident nonexpert (from 66.3% to 76.9%), but the additional gain achieved by 8 nonexperts is only marginal (78.3%).
In this study, only dermatoscopic images of pigmented skin lesions were included; however, we estimate that similar improvements are possible with nonpigmented tumors and inflammatory diseases [
We also demonstrated that a high number of raters could be attracted by online marketing and by including elements of gamification. Readers with little experience had a lower unbounded retention rate, which can probably be enhanced by adding additional elements of gamification such as avatars, progress bars (ie, Zeigarnik effect [
actinic keratosis/intraepithelial carcinoma
basal cell carcinoma
benign keratinocytic lesions
dermatofibroma
Human Against Machine with 10000 training images
melanoma
nevus
vascular lesions
None declared.