This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Crowdsourcing services, such as Amazon Mechanical Turk (AMT), allow researchers to use the collective intelligence of a wide range of web users for labor-intensive tasks. As the manual verification of the quality of the collected results is difficult because of the large volume of data and the quick turnaround time of the process, many questions remain to be explored regarding the reliability of these resources for developing digital public health systems.
This study aims to explore and evaluate the application of crowdsourcing, generally, and AMT, specifically, for developing digital public health surveillance systems.
We collected 296,166 crowd-generated labels for 98,722 tweets, labeled by 610 AMT workers, to develop machine learning (ML) models for detecting behaviors related to physical activity, sedentary behavior, and sleep quality among Twitter users. To infer the ground truth labels and explore the quality of these labels, we studied 4 statistical consensus methods that are agnostic of task features and only focus on worker labeling behavior. Moreover, to model the meta-information associated with each labeling task and leverage the potential of context-sensitive data in the truth inference process, we developed 7 ML models, including traditional classifiers (offline and active), a deep learning–based classification model, and a hybrid convolutional neural network model.
Although most crowdsourcing-based studies in public health have often equated majority vote with quality, the results of our study using a truth set of 9000 manually labeled tweets showed that consensus-based inference models mask underlying uncertainty in data and overlook the importance of task meta-information. Our evaluations across 3 physical activity, sedentary behavior, and sleep quality data sets showed that truth inference is a context-sensitive process, and none of the methods studied in this paper were consistently superior to others in predicting the truth label. We also found that the performance of the ML models trained on crowd-labeled data was sensitive to the quality of these labels, and poor-quality labels led to incorrect assessment of these models. Finally, we have provided a set of practical recommendations to improve the quality and reliability of crowdsourced data.
Our findings indicate the importance of the quality of crowd-generated labels in developing ML models designed for decision-making purposes, such as public health surveillance decisions. A combination of inference models outlined and analyzed in this study could be used to quantitatively measure and improve the quality of crowd-generated labels for training ML models.
In recent years, social media data have been extensively used in different areas of public health [
Although linguistic annotation is crucial for developing machine learning (ML) and natural language processing (NLP) models, manual labeling of a large volume of data is a notorious problem because of its high cost and labor-intensive nature. In recent years, this problem has been tackled using crowdsourcing technologies such as Amazon Mechanical Turk (AMT) [
However, because of the uncertain quality of AMT workers with unknown expertise, their labels are sometimes unreliable, forcing researchers and practitioners to collect information redundantly, which poses new challenges in the field. Given that in large-scale crowdsourcing tasks the same workers cannot label all the examples, measuring interannotator agreement and managing the quality of workers differ from those of a team of in-house expert workers. Despite the growing popularity of AMT for developing ML models in public health research, the reliability and validity of this service have not yet been investigated. At least several public health studies have used AMT for training data-driven ML models without external gold standard comparisons [
Similarly, to characterize sleep quality using Twitter, McIver et al [
The primary aim of this study is to evaluate the application of AMT for training data-driven ML models by analyzing the quality of crowd-generated labels. As the quality of crowd-generated labels, regardless of the type of the task being studied, is critical to the robustness of ML models trained based on these labels, we created a gold standard data set of labels and applied several statistical and ML-based models to assess the reliability of using the crowd-labeling task from different perspectives (eg, process, design, and inference). To interpret the results of our quality assessment and explore the effect of noisy labels on the applicability of inference models in dealing with these labels, our approach involved evaluating the performance of 4 consensus methods, which do not involve task features in their truth inference, and exploring their feasibility in improving the quality of crowd-labeled data. As these methods are modeled purely as a function of worker behaviors concerning labeling tasks, they cannot leverage the value of context-sensitive information (ie, the task’s meta-information) in their inference decisions. Thus, we collected additional features for our labeling data set and developed 7 ML models, including a deep learning (DL) model and a hybrid convolutional neural network (CNN) architecture to couple worker behaviors with the task’s meta-information when inferring the truth label. To detect and correct noisy labels, we also developed 5 pool-based active learners to iteratively detect the most informative samples (ie, samples with more uncertainty) and remove them from the validation set. Finally, we used SHAP (Shapley Additive Explanations) [
The crowdsourcing tasks, referred to as HITs by AMT, were designed to collect 5 labels based on 2 conditions, self-reported and recent PASS experience, to develop binary and multiclass classification models that can detect PASS-related behavior in Twitter users. The labels of the multiclass prediction models were defined as 11, 10, 01, and 00, based on the value of each condition (Figure S1 in
A sample labeling task (ie, human intelligence task [HIT]) for sedentary behavior. Each HIT contains 4 questions (section 1), and each asks if the presented tweet is a self-reported physical activity, sedentary behavior, or sleep quality–related behavior (section 2). The fourth question is an easy, qualification question that was used to check the quality of the worker (section 3).
We implemented a pipeline to create the HITs, post them on AMT, collect the labels through a quality check process, approve or reject the HITs, and store the results. To minimize noisy and low-quality data, we added a qualification requirement to our tasks and granted labeling access to workers who had demonstrated a high degree of success in performing a wide range of HITs across AMT (ie, master qualification). In addition, we added a simple qualification question to each HIT to detect spammers or irresponsible workers. Each HIT contained 4 questions, including the qualification question, and was assigned to 3 workers (
We collected data for this study from Twitter using the Twitter livestream application programming interface (API) for the period between November 28, 2018, and June 30, 2020. The data set was filtered to include only Canadian tweets relevant to PASS. A total of 103,911 tweets were selected from 22,729,110 Canadian tweets using keywords and regular expressions related to PASS categories. Each of these 103,911 tweets was labeled by 3 AMT workers, from which 98,722 tweets received 3 valid labels, with almost half of them related to physical activity.
The demographic variables of age and gender and the information about the source of each tweet (eg, organization vs real users) were not available within the data set collected from Twitter. We estimated these variables for each tweet using the M3 inference package in Python [
We have made the Twitter data set used in this study publicly available [
Tweets have a bounding box of coordinates, which enables spatial mapping to their respective city locations. As the Twitter API returns datetime values in Coordinated Universal Time, we used a time zone finder in Python and adjusted the time of each tweet based on its spatial data. Given that daytime, month, and weekday can be influential factors in twitting about each of the PASS categories, and to better use the datetime data (%a %b %d %H: %M: %S %Y), we extracted a: weekday, b: month, and H: hour fields and stored them as separate features.
We cleaned the text column by eliminating all special characters (eg, #, &, and @), punctuations, weblinks, and numbers. We also replaced common contractions with their uncontracted forms; for example,
To develop the ML models, all categorical data were encoded into dummy variables using one-hot encoding, and as we only approved HITs with complete answers, this data set did not contain any missing data.
To measure the consistency of answers given by the workers, we calculated label consistency (LC) as the average entropy of the collected labels for each PASS category [
|s| denotes the size of the surveillance category
To investigate the viability of unsupervised inference models in predicting truth labels from crowd-labeled data and compare it with that of supervised predictive models, we used a random sample of our data set as a ground truth set (ie, 9000 tweets: 4000 tweets for physical activity, 3000 tweets for sleep quality, and 2000 tweets for sedentary behavior). In total, 6 data scientists manually labeled this sample, and the entire labeled data set was reviewed manually and relabeled by an experienced in-house domain expert in both ML and public health surveillance. The disagreements between this data set and the crowd-labeled data set were manually checked to exclude any labeling bias that could impact the results of this study.
The majority voting (MV) approach estimates the actual ground truth based on most labels submitted by different workers. For example, defining the estimated label as
The David and Skene (DS) [
As not all workers need to label all the tasks, and a worker may label the same task more than once, sparsity can be a problem in large-scale labeling tasks when using the DS approach [
The generative model of labels, abilities, and difficulties (GLAD) [
Similar to DS, Raykar algorithm (RY) [
As the meta-information associated with each task may reveal its underlying complexity and thus help model worker behaviors, we developed a set of ML models to involve this metadata in the inference process. Models were trained based on quintuple
To mitigate the risk of biased results caused by a specific learning algorithm and overcome the overfitting problem, we developed and evaluated 5 standard ML classifiers with different architectures, including generalized linear (logistic regression [LR]), kernel-based (support vector machines [SVM]), decision-tree–based (random forest and XGBoost), and sample-based (K-nearest neighbors [KNN]) classifiers. Moreover, to incorporate textual features into our analysis, we developed a hybrid DL architecture in which a CNN based on long short-term memory (LSTM) learns textual data
The pipeline of the deep learning model used to predict labels using both textual information and meta-information. LSTM: long short-term memory.
To counter the bias caused by class imbalance, for both multiclass and binary classification tasks, we used the class-weight approach to incorporate the weight of each class into the cost function by assigning higher weights to minority classes and lower weights to the majority classes. We also used the SMOTE (Synthetic Minority Oversampling Technique-Nominal Continuous) [
As the main goal of both supervised and unsupervised label inference models was to minimize the number of false-negative and false-positive inferences, to evaluate the models developed in this study, we used precision, recall, F1, and precision-recall area under the curve (AUCPR) metrics.
All the computations and predictive models were implemented using Python 3.7 with TensorFlow 2.0 [
In total, 610 unique workers participated in our data labeling tasks and completed 103,911 HITs, from which 5189 HITs were removed as they did not receive 3 valid answers. We approved 98,722 tasks for further analysis. Most workers (530/610, 86.9%) completed <100 HITs, of which 164 completed only 1 HIT. Among the workers who completed >5000 HITs, 1 worker completed 21,801 HITs and 3 workers completed between 5000 and 10,000 HITs (
The number of workers who completed different numbers of human intelligence tasks (HITs). Most workers completed a relatively small number of HITs.
Details of the collected labels and label consistency (LC) score for each of the physical activity, sleep quality, and sedentary behavior categories. LC ranges from 0 to 1, and the values close to 1 show more consistency among workers’ input.
Type | Tweets, n (%) | LCmulti | LCbinary | Workers, n (%) |
Physical activity | 48,576 (49.2) | 0.54 | 0.75 | 232 (38) |
Sedentary behavior | 17,367 (17.6) | 0.55 | 0.74 | 157 (25.7) |
Sleep quality | 32,779 (33.2) | 0.58 | 0.77 | 221 (36.2) |
Total | 98,722 (100) | 0.56 | 0.75 | 610 (100) |
DL
Characteristics of the ground truth data set used to develop and evaluate the supervised and unsupervised inference models.
Variable | Physical activity (n=4000) | Sedentary behavior (n=2000) | Sleep quality (n=3000) | |||||
|
||||||||
|
|
|||||||
|
|
Yes | 1629 (40.73) | 726 (36.3) | 1063 (35.43) | |||
|
|
No | 2371 (59.28) | 1274 (63.7) | 1937 (64.57) | |||
|
|
|||||||
|
|
YYa | 1629 (40.73) | 726 (36.3) | 1063 (35.43) | |||
|
|
YNb | 550 (13.75) | 395 (19.75) | 862 (28.73) | |||
|
|
NYc | 179 (4.48) | 19 (0.95) | 52 (1.73) | |||
|
|
NNd | 1642 (41.05) | 860 (43) | 1023 (34.1) | |||
|
||||||||
|
Female | 1131 (28.28) | 576 (28.80) | 469 (15.63) | ||||
|
Male | 1980 (49.50) | 906 (45.30) | 490 (16.34) | ||||
|
Unknown | 889 (22.22) | 518 (25.90) | 2041 (68.03) | ||||
|
||||||||
|
≤18 | 204 (5.10) | 170 (8.50) | 150 (5) | ||||
|
19-29 | 743 (18.58) | 475 (23.75) | 331 (11.03) | ||||
|
30-39 | 897 (22.42) | 365 (18.25) | 249 (8.30) | ||||
|
≥40 | 1267 (31.68) | 472 (23.60) | 229 (7.64) | ||||
|
Unknown | 889 (22.22) | 518 (25.90) | 2041 (68.03) | ||||
|
||||||||
|
Sunday | 664 (16.60) | 325 (16.25) | 440 (14.66) | ||||
|
Monday | 595 (14.88) | 307 (15.35) | 440 (14.66) | ||||
|
Tuesday | 493 (12.32) | 245 (12.25) | 435 (14.50) | ||||
|
Wednesday | 504 (12.60) | 278 (13.9) | )393 (13.10) | ||||
|
Thursday | 525 (13.12) | 270 (13.50) | 416 (13.86) | ||||
|
Friday | 531 (13.28) | 274 (13.70) | 421 (14.03) | ||||
|
Saturday | 668 (16.70) | 283 (14.15) | 2433 (14.43) | ||||
|
Unknown | 20 (0.50) | 18 (0.90) | 22 (0.76) | ||||
Time (24 hours), Q1-Q3 | 10-19 | 10-19 | 5-18 | |||||
Month (range) | February to July | April to September | January to August | |||||
|
||||||||
|
Organization | 563 (14.08) | 179 (8.95) | 97 (3.23) | ||||
|
Users | 3437 (85.93) | 1821 (91.05) | 2903 (96.77) |
aYY: self-reported and recent physical activity, sedentary behavior, and sleep quality experience.
bYN: self-reported but not recent physical activity, sedentary behavior, and sleep quality experience.
cNY: not self-reported but recent physical activity, sedentary behavior, and sleep quality experience.
dNN: neither self-reported nor recent physical activity, sedentary behavior, and sleep quality experience.
Performance of the truth interference methods using a ground truth data set of 9000 labeled tweets: 4000 physical activity, 2000 sedentary behavior, and 3000 sleep quality tweets. The top 4 rows of each PASS (physical activity, sedentary behavior, and sleep quality) category represent the results of the applied unsupervised truth inference models.
Tweets and method | Precision (%) | Recall (%) | F1 (%) | AUCPRa (%) | ||||||||
|
Multiclass | Binary | Multiclass | Binary | Multiclass | Binary | Multiclass | Binary | ||||
|
||||||||||||
|
MVb | 72 | 85 | 70 |
|
71 | 84 | 56 | 85 | |||
|
DSd | 74 | 85 | 68 |
|
70 | 84 | 54 | 85 | |||
|
GLADe | 73 | 84 | 70 | 84 | 71 | 83 | 57 | 84 | |||
|
RYf | 74 | 85 | 68 |
|
70 | 84 | 54 | 84 | |||
|
LRg | 74 | 85 |
|
|
|
|
|
87 | |||
|
KNNh | 74 | 85 | 74 |
|
73 | 84 | 60 |
|
|||
|
SVMi | 72 |
|
73 |
|
73 |
|
|
|
|||
|
RFj | 73 | 85 | 74 | 84 | 73 |
|
60 | 87 | |||
|
XGBoost | 72 | 81 | 72 | 81 | 71 | 81 | 58 | 83 | |||
|
DLmetak |
|
84 | 68 | 84 | 73 | 84 | 60 | 78 | |||
|
DLtext_and_meta | 78 | 84 | 70 | 84 | 73 | 84 | 60 | 78 | |||
|
||||||||||||
|
MV | 71 | 82 | 68 | 82 | 68 | 82 | 54 | 80 | |||
|
DS | 70 | 81 | 62 | 81 | 65 | 81 | 48 | 79 | |||
|
GLAD | 71 | 79 | 68 | 79 | 68 | 79 | 54 | 77 | |||
|
RY | 70 | 81 | 62 | 81 | 65 | 81 | 48 | 79 | |||
|
LR | 72 |
|
|
|
70 |
|
|
|
|||
|
KNN | 71 | 82 | 71 | 82 | 67 | 82 | 56 | 80 | |||
|
SVM | 73 |
|
|
|
70 |
|
|
|
|||
|
RF | 72 |
|
|
82 | 69 |
|
57 |
|
|||
|
XGBoost | 68 | 82 | 69 | 82 | 67 | 82 | 54 | 80 | |||
|
DLmeta |
|
80 | 65 | 80 |
|
80 | 56 | 73 | |||
|
DLtext/meta |
|
80 | 65 | 80 |
|
80 | 56 | 75 | |||
|
||||||||||||
|
MV | 78 |
|
74 |
|
75 |
|
61 | 87 | |||
|
DS | 80 |
|
74 |
|
|
|
62 | 87 | |||
|
GLAD | 79 | 85 | 75 | 85 | 76 | 85 | 62 | 82 | |||
|
RY | 80 |
|
74 |
|
76 |
|
62 | 87 | |||
|
LR | 76 | 88 |
|
87 |
|
88 | 64 | 88 | |||
|
KNN | 76 |
|
|
|
|
|
63 |
|
|||
|
SVM | 76 | 88 |
|
88 |
|
88 | 64 | 88 | |||
|
RF | 75 |
|
76 |
|
76 |
|
63 |
|
|||
|
XGBoost | 72 | 87 | 72 |
|
72 | 87 | 58 | 87 | |||
|
DLmeta |
|
86 | 72 | 86 | 76 | 86 | 63 | 81 | |||
|
DLtext/meta | 80 | 87 | 72 | 87 | 76 | 87 |
|
82 |
aAUCPR: precision-recall area under the curve.
bMV: majority voting.
cItalicization indicates best performance for the metric and each PASS (physical activity, sedentary behavior, and sleep quality) category.
dDS: David and Skene.
eGLAD: generative model of labels, abilities, and difficulties.
fRY: Raykar algorithm.
gLR: logistic regression.
hKNN: K-nearest neighbors.
iSVM: support vector machine.
jRF: random forest.
kDL: deep learning.
Across all data sets, supervised models consistently performed better than unsupervised methods. This highlights the value of the context-sensitive information that was used as meta-information when training supervised models. However, on sleep quality, a data set with the same features and level of complexity as physical activity and sedentary behavior data sets, MV appears sufficient for the binary inference task, with supervised models providing little or no improvement.
The hybrid CNN architecture did not provide any gain on either the unsupervised inference models or the supervised predictive models (ie, LR, KNN, SVM, RF, XGBoost, and DLmeta), and in some ways, underperformed them. It is possible that the LSTM stream could not capture the underlying dynamics of the features because of the inconsistencies between the poorly labeled tasks and the textual features.
To further explore the feasibility of correcting mislabeled samples, we used pool-based active learning [
Our results show that, during the learning process, the accuracy of the classifiers generally increased, slightly degraded at some iterations, and stabilized around iteration 60 for KNN and iteration 20 for other classifiers (
Incremental classification accuracy using pool-based active learning. KNN: K-nearest neighbors; LR: logistic regression; RF: random forest; SVM: support vector machine; XGB: XGBoost.
We start this section with some practical recommendations and guidelines on the use of AMT in specific and crowdsourcing in general for developing ML-based public health surveillance systems. Even under the assumption that more advanced artificial intelligence models, including pretrained models on general scope data sets and transfer-learning techniques, can cope with the poor quality of crow-generated labels, the guidelines provided in this study can still improve the implementation, design, and qualification of the crowd-labeling as well as the label inference processes. These guidelines are supported by the results described earlier and the findings and further analysis discussed in the rest of this section.
First, although the demographics of AMT workers are not available, we can still implement the crowdsourcing process in a way that accommodates a greater diversity of workers. A longitudinal labeling process, rather than one-time labeling, allows researchers to monitor the quality of the collected data over time, and mitigates the impact of spammers, irresponsible workers, and workers who are biased or mistake prone. Second, the overall quality of AMT workers can be context-sensitive and vary based on the type of labeling task. For example, the familiarity of the workers in the context of the tasks in the sleep quality data set, contrasting the broad context of physical activity and sedentary behavior concepts, resulted in higher data quality. Researchers should also be aware of the exclusion rate (eg, 5189/103,911, 4.99% in this study) and need to consider this when planning for their study’s budget and design. Third, our study results show that consensus-based inference models that do not consider the task’s features may not always be efficient for integrating crowdsourced labels and thus negatively impact the performance of ML models. Fourth, in addition to qualification requirements to filter crowdsourcing participants, sound and illustrative instruction is a less direct way to increase data quality. During the course of this project, we received nearly 70 emails from AMT workers, with most of them asked about scenarios that were mentioned in the instructions. This implies that the instruction changed their default understanding of the tasks, thereby improving the quality of the labels. Finally, when controlling the quality of workers using a qualification question, we recommend not informing the worker that this technique is being used, as they might guess the questions based on their simplicity.
Despite all the alternative models developed in this study to improve the inference accuracy, there were still considerable discrepancies between workers and the truth labels. These disagreements may be attributable to the underlying uncertainty in the data. Although reducing uncertainty by collecting more labels from more workers might simplify the process of label inference, it limits the learning ability of ML models in modeling the inherent uncertainty of data and prevents them from recovering from the mistakes made early during the inference process [
We observed from our inference results that, regardless of the type of the classification task, none of the 11 methods outperformed other methods across all data sets (
Compared with supervised models that require a large volume of labeled data to integrate crowd-generated labels, using unsupervised inference models is simple and straightforward. However, this simplicity is gained through the cost of throwing away the contextual characteristics of tasks, which may sacrifice quality in context-sensitive scenarios. For example, the time that a tweet is posted during a day can contribute to the decision about its relevance to physical activity or sleep quality contexts. The importance of these characteristics was far more pronounced in the multiclass inference tasks than in the binary tasks (
In this study, we used two levels of quality control: (1) through the task assignment process by accepting only workers with a master qualification and (2) through the design and implementation of the tasks by adding a qualification question to our HITs and iteratively observing workers’ performance based on their answer to this question. Our results show that even though defining these requirements improved the quality of crowd-generated labels to a great extent, 12.45% (498/4000), 13.3% (266/2000), and 7.7% (231/3000) of physical activity, sedentary behavior, and sleep quality tweets, respectively, were still mislabeled by all three workers, regardless of their context or complexity level, indicating the need for further quality assessment of crowdsourced data. These mislabeled samples were not misclassified due to sample uncertainty or difficulty, and our further analysis shows that they were not informative enough (ie, prediction scores) to improve the performance of predictive models through the iterative process of active learning (Figure S4 in
To further investigate the reliability of using crowdsourcing for developing ML models, we used bidirectional encoder representations from transformers [
To interpret the results of our predictive models in terms of the individual contribution of each feature to the prediction results, we used SHAP [
The estimated impact of each piece of meta-information on XGBoost when predicting the truth label. Age is in years. D&S: David and Skene; GLAD: generative model of labels, abilities, and difficulties; LFC: Learning from Crowds (Raykar algorithm); SHAP: Shapley additive explanations.
We further used Shapely values to cluster our data set based on the explanation similarity of samples, using hierarchical agglomerative clustering (
Using the additive nature of Shapley values, we integrated all the local feature values for each data point and calculated the global contribution (
To triangulate the dominant impact of the crowdsourced labels, we excluded all the samples for which
This study has several limitations. First, the compensation paid to the workers could impact the quality of the collected labels, and consequently, the evaluation results of this study. Workers may show a higher quality in exchange for higher payments. To investigate this, during the course of the project, we increased HITs’ reward from US $0.03 to US $0.05 and did not notice any significant changes in quality. However, this is still debatable and requires further investigation.
Second, to develop the supervised models, we assumed that all the tasks share the same level of complexity, whereas in reality, some examples are more difficult than others. For example, labeling “I can’t sleep” to a self-reported sleep problem is more straightforward than labeling “I’m kind of envious of anyone who is able to fall asleep before 2am.” We attempted to address this by incorporating inherent task difficulties in the prediction models by developing a hybrid CNN model. However, crowd-generated labels dominated other features of our data set, which had the greatest impact on their inference decisions. Building crowdsourcing models sensitive to the complexity of tasks to allocate more resources (workers) to more difficult tasks is a worthwhile direction for future research.
Third, the way we designed and presented the HITs on AMT could impact the performance of workers in various ways. Considering the central role of people in maximizing the benefits of crowdsourcing services, human factors should be considered when designing crowdsourcing tasks [
Fourth, we defined workers’ qualifications based only on their historical performance in completing HITs across AMT (ie, master qualification). Although this provided some degree of quality control on the collected labels, alternative qualification requirements such as workers’ education, work background, and language could have also impacted our study results. To further study the role of qualification filtering, we pilot-tested the labeling process without any qualification requirements for 4500 physical activity tasks. These tasks were completed in <12 hours with a consistency score (
Fifth, various physical activities, based on their energy requirements in metabolic equivalents (METs), can be categorized into different movement behaviors, such as light (1.6-2.9 METs), moderate (3-5.9 METs), and vigorous (≥6 METs) [
Despite these limitations, our study is one of the first to rigorously investigate the challenges of using crowdsourcing to develop ML-based public health surveillance systems. Our findings support the argument that crowdsourcing, despite its low cost and short turnaround time, yields noisier data than in-house labeling. On the flip side, crowdsourcing can reduce annotation bias by involving a more diverse set of annotators [
The results of this study may inspire future research to investigate and evaluate the application of crowdsourcing for the development of ML-based digital public health surveillance systems deployed and used in national surveillance decision-making. As the potential for success of ML-based digital public health surveillance relies on robust and reliable data sets, a sensitivity analysis of health-related incidents detected by ML-based surveillance models trained on crowd-generated labels versus relevant national datasets is required to ascertain this potential. Moreover, to assess whether our conclusions are sensitive to the background and expertise of participants, further investigation is required using a cohort of experts who are familiar with the public health context under study. Likewise, to untangle the effect of task context and the quality of the crow-generated labels, replicating the approach adopted in this study using other domains, including other public health domains, remains a future work. Finally, as there is a chance that the quality of the crowd-generated labels is subject to the compensation amount, confounded by the socioeconomic characteristics of the participant cohort, future investigations are required to calibrate the results of this study considering these factors.
Additional figures that describe the Amazon Mechanical Turk labeling task, predictive model performance, and incorrectly labeled tweets in more detail.
Amazon Mechanical Turk
precision-recall area under the curve
convolutional neural network
deep learning
David and Skene
expectation–maximization
generative model of labels, abilities, and difficulties
Human Intelligence Task
K-nearest neighbors
logistic regression
long short-term memory
metabolic equivalent
machine learning
majority voting
natural language processing
physical activity, sedentary behavior, and sleep quality
Rectified Linear Unit
Raykar algorithm
Shapley Additive Explanations
Synthetic Minority Oversampling Technique-Nominal Continuous
support vector machine
This work was supported by a postdoctoral scholarship from the Libin Cardiovascular Institute and the Cumming School of Medicine, University of Calgary. This work was also supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2014-04743). The Public Health Agency of Canada funded the Amazon Mechanical Turk costs. The funders of the study had no role in the study design, data collection and analysis, interpretation of results, and preparation of the manuscript.
ZSHA was responsible for data collection and curation, model development, data analysis, and visualization, and wrote the paper. GPB and WT reviewed the paper and provided comments. JL conceived and designed the study and revised the manuscript.
None declared.