Advertisement: Preregister now for the Medicine 2.0 Congress
Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing
Haijun Zhai*, PhD; Todd Lingren*, MA; Louise Deleger, PhD; Qi Li, PhD; Megan Kaiser, BA; Laura Stoutenborough, BSN; Imre Solti, MD, PhD
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, United States
*these authors contributed equally
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center
3333 Burnet Avenue
Cincinnati, OH, 45229
Phone: 1 513 636 1020
Fax: 1 513 636 1020
Background: A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora.
Objective: Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora.
Methods: To build the gold standard for evaluating the crowdsourcing workers’ performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd’s work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations.
Results: The agreement between the crowd’s annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task.
(J Med Internet Res 2013;15(4):e73)
clinical informatics; natural language processing; named entity; reference standards; crowdsourcing; user computer interface; quality control
One of the key components of supervised machine learning-based clinical natural language processing (NLP) systems is the high-quality gold standard used for training and testing. In clinical NLP projects, expert annotators are traditionally asked to double annotate the text for the purposes of the gold standard. Expert annotators could be clinicians or extensively trained laypeople . Unless the expert annotators are volunteers, they are very costly to pay and it is usually not easy to build a sufficiently large group of expert annotators locally and, consequently, fast contingent of annotators. To reduce the cost of expert human annotation, many projects in general NLP have turned to crowdsourcing, which involves submitting a large number of smaller subtasks to a coordinated marketplace of workers on the Internet. These workers (called turkers) are paid small amounts (usually a few cents) for each task, sometimes resulting in considerable overall savings over the traditional expert annotator model. The trade-off is usually between the accuracy of the annotation result and the cost savings. Because anonymous turkers from all over the world have different levels of proficiency in the task and are not trained to accomplish the task, efficient quality control and judgment voting methods are required to generate good results.
Many studies have been conducted in the area of crowdsourcing tasks. As early as 2008, Snow et al  were the first to explore the feasibility of crowdsourcing in NLP. Five NLP tasks were published on Amazon Mechanical Turk (AMT, ) to turkers. Their results indicated that non-expert labellers could obtain high-quality annotations. Since then, data created by crowdsourcing has been widely studied for different research areas. Lawson et al  described how using a competitive payment system and interannotator agreement improved the quality of named entity annotations on AMT. Unlike traditional named entity experiments, Finin et al  presented their experience by leveraging AMT and CrowdFlower  to annotate named entities in Twitter data. It was the first work of named entity recognition in the new domains of Facebook and Twitter. Meanwhile, several studies attempted to use crowdsourcing to create data for machine translation systems. Ambati and Vogel  explored the effectiveness of using AMT to do sentence translation for creating parallel corpora. Denkowski et al [8-10] attempted to generate annotated data in a variety of languages. In addition, crowdsourcing was also applied to transcription [11-13], part-of-speech tagging , and other tasks [15,16].
In a recent publication of the Journal of Medical Internet Research, Turner et al  reported on the use of crowdsourcing to collect feedback on the design of health promotion messages for oral health. Luengo-Oroz et al  evaluated the feasibility of crowdsourcing to conduct malaria image analysis. Gathering a large number of high quality annotations is a critical challenge in biomedical NLP, which was presented in detail in the editorial of Chapman et al . As demonstrated by studies in the general NLP field, crowdsourcing is a decidedly promising solution to this research area. However, in contrast to the general NLP domain, there are only a few studies involving crowdsourcing in biomedical NLP and almost none for clinical NLP. Most recently, Burger et al  performed a task of extracting the gene-mutation relations in Medical Literature Analysis and Retrieval System Online (MEDLINE) abstracts on AMT. In their work, candidate mutations were extracted from 250 MEDLINE abstracts using the Extractor of Mutations (EMU) presented together with the curated gene lists from the National Center for Biotechnology Information (NCBI). Using a customized interface, it was feasible for turkers to apply their judgments. They reported that the weighted accuracy was 82%. This work was somewhat similar to our linking of medications and their attributes, but it focused on a very specific gene-mutation domain. Norman et al  investigated leveraging crowdsourcing to facilitate the discovery of new medicines.
Improving the quality of judgments is one of the most important issues in crowdsourcing, especially for the tasks without strong quality control. A variety of methods have been proposed to assess the quality of judgments from turkers. Kumar and Lease [23,24] presented a weighted voting method based on turkers’ accuracies, which can be estimated by taking the full set of labels into account. Jung and Lease  conducted a large-scale consensus study on relevant judgements between query/document pairs for Web search on the ClueWeb09 dataset . In their work, approximately 20,000 labels were collected from 766 Mechanical Turk workers. They reported that computing the Z-score could filter noisy labels and achieve a significant improvement, in comparison to a majority vote baseline. Based on the previous work, a semi-supervised approach was proposed to maximize the benefit from consensus  with consensus labels from both labelled and unlabeled examples. As these studies indicated, though much progress has been made, quality control and aggregating judgments are still the major challenges of crowdsourcing. The highest reported performance of medication name entity annotation from earlier crowdsourcing attempts in the biomedical NLP domain was 0.68 (F-measure) for agreement between traditional and crowd-generated corpora [21,22].
In our research, we applied strict quality control to select qualified turkers and investigated multiple approaches to aggregating judgments. The goals of our study were to build upon previous work from the general crowdsourcing research and to evaluate the usability of crowdsourcing approach in the clinical NLP domain. This will help us automate clinical trial eligibility screening. The clinical NLP tasks that we used for the purpose of evaluation were medical named entity recognition and entity linking in a clinical trial announcement (CTA) corpus. The entities involved were medication names and medication types, as well as their attributes. During our research, we first studied the turkers’ performance to annotate medical named entities on a large-scale data set. Second, we proposed to use crowdsourcing to link named entities and their attributes, in which the entities and attributes were pre-annotated in the text and the crowdsourcing task was to identify entity/attribute pairs that are associated in the text. Third, we attempted to find a new solution to produce a more robust, manually-created gold standard (ie, correction) by investigating whether an iterative model of crowdsourcing tasks can correct errors from previous generations of tasks. Finally, we studied 3 methods to aggregate multiple annotations of the same text to generate a better gold standard.
Our research contributed to the field of clinical NLP by: (1) evaluating the usability of crowdsourcing in the clinical NLP domain, (2) publicly releasing the user interface software that is necessary for crowdsourced, named-entity annotation, and (3) implementing a 4-component quality control strategy to improve the crowd generated annotation, including an introductory quiz to filter the automated scripts, a geographical constraints for turkers, training turkers for the task, and continuous performance monitoring. We will release the annotated corpora in December 2013 when our NIH grant funding concludes.
Definition of Annotated Named Entities and Linkages
This section presents the definitions and examples of medication entities (medication names and medication types) and medication-attribute linkages annotated in this work.
Medication names are specific names of drugs, biological substances, and treatments. Some examples of mediation names are ibuprophen, phosphonoacetic acid, vancomycin, and ganciclovir.
Medication types refer to classes of drugs (eg, antibiotics, anti-inflammatory drugs, benzodiazapines), types of drug therapy (eg, chemotherapy), and general references to medications (eg, “study drug”, “other drugs”, “medication”).
Attributes define how much, how often, and in what form medications or medication types are taken. We distinguished between the following categories of attributes (based on the schema of the SHARPn project ):
- Date: indicating all dates associated with the medication (eg, start dates, concluding dates)
- Strength: indicating the strength number and units of the prescribed drug
- Dosage: indicating the amount of each medication used by the patient and type of dose it is (eg, high dose, low dose, stable dose)
- Form: indicating the shape or configuration of the medication (eg, tablet, capsule, liquid, injection, infusion)
- Frequency: indicating how often each dose of the medication should be taken
- Duration: indicating how long the patient is expected to take the drug
- Route: indicating route or method of the medication (eg, intravenous, oral, chew, topical)
- Status change: indicating whether the medication is currently being taken or not (eg, active, inactive, hold, incomplete, started, discontinued, increased, decreased, no change)
- Modifier: indicating mentions that could exist under certain circumstances (eg, conditional modifier), develop or alter a mention (eg, course modifier), or generic modifier (eg, conventional)
The linking task associates attributes to their corresponding medication entities, assuming medications and attributes have already been pre-annotated. The following sentence demonstrates the linking task: “Advair 250/50 diskus 1 puff and Singulair 5mg chewable 1 tablet once a day”. In this sentence, “Advair and Singulair” are the medication names and “250/50, diskus, 1, puff, 5mg, chewable, 1, tablet, and once a day” are the attributes. In this example, “250/50, diskus, 1, and puff” are the attributes of Advair, “5mg, chewable, 1, tablet and once a day” are the attributes of Singulair, as shown in Figure 1.
[view this figure]
|Figure 1. Example of linkages between medications and their attributes.|
Gold Standard to Evaluate Turker Performance
In one of our previous projects, CTA were annotated for medication extraction. In this paper, we present the most important features of the gold standard used in the study. Details of the corpora and the process of the gold standard development were thoroughly described in a separate manuscript that was published in the 2012 AMIA Annual Conference Proceedings . The corpus was double annotated for medication names, types, and attributes by two annotators (college graduates with bachelor degrees) to create a gold standard, at a cost of 20 days per annotator for annotation of medication names and types and an additional 20 days per annotator for the attributes. Additionally, each attribute was linked to its respective medication name or medication type.
The CTA corpus was composed of 3000 CTAs randomly selected from the ClinicalTrials website (105,598 documents as of March 2011). We annotated only the eligibility criteria sections of the trial announcements. Table 1 shows the descriptive statistics of the corpus (number of documents and number of annotations in the traditional gold standard). In this study we used crowdsourcing to annotate only medication names and medication types. The linking crowdsourcing experiment utilized pre-annotated text: medication names, medication types, and attributes.
[view this table]
|Table 1. CTA corpus statistics.|
Because the CTAs were longer than the text of many crowdsourcing tasks, and considering the difficulty of clinical NLP annotations, we decided to break up the CTAs into smaller paragraph-length sizes for the tasks. Based on a tokenizer we wrote to count discrete basic units, the average token count in a CTA document was 212. In the paragraph-size tasks, we split the CTAs into paragraphs with at least 50 tokens, preserving the original format and the integrity of the CTA file (no paragraphs spanned into different CTAs). This resulted in 9773 paragraphs or “units”.
Crowdsourcing User Interface
In addition to the customizable GUI, another key benefit of using the CF crowdsourcing platform over directly accessing AMT is that it has strict quality control measures. CF provides an interface for creating and editing “gold standard answers” for quality control. “Gold standard answers” are randomly included (without the turkers being aware of their presence) in the submitted data and a turker is required to meet a minimum threshold of accuracy in these “gold” examples in order to continue submitting tasks. When a turker meets this threshold, he/she is deemed “trusted”. Only the “trusted” turkers’ data are collected to establish final judgments. If an example has 3 medications in the unit and the turker annotates only two correctly, the system will score the judgment as incorrect as there are no partial scores in determining a turker’s trust status within a particular unit of annotated text.
In pilots, we experimented with different thresholds. Lower thresholds resulted in lower agreement of the turker-annotated corpora with the gold standard corpora. Higher thresholds prevented the successful completion of the task by eliminating too many turkers. Because of the complexity of the task, we experimented with a trust-based threshold and found 50% (on unit level) to be the most feasible threshold number. A turker presented with “gold standard” examples must accurately annotate 50% of the unit-based responses. That is, if the turker annotated 4 units of “gold” examples, at least 2 (of the 4) had to be exact matches for him/her to establish trustworthiness. The 50% threshold was evaluated on the unit’s level and not on the named entity level. That is, all of a unit’s annotations, or judgments, had to be matched exactly with the “gold standard” answers, irrespective the number of named entities per unit.
We also found that the training mode of CF was very helpful in winnowing the pool of turker candidates to only the highest quality annotators. In training mode, the turkers were directed to several training examples first. All of the training examples were gold standard examples and the turker must complete 4 examples correctly to proceed to the production annotation task. Based on these interfaces, we implemented our quality control strategy. Of all our tasks, 20% of the total number of units submitted for judgment were uploaded and setup as “gold” units. That is, 20% of the annotated units were gold standard units where the CF system could continuously gauge the trustworthiness of the turker. If a turker’s trustworthiness slipped below 50%, the turker was warned. If his/her performance did not improve during the next two gold tests, then the turker’s entire output was excluded from the collected data and the system subsequently blocked the turker from submitting any further judgements.
In the in-house experts’ generated gold standard, approximately 30% of the CTAs had no medications or medication types. Due to the splitting of the CTAs into smaller units, however, the empty percentage grew to 42%. Several initial pilot experiments were conducted regarding the study’s design features, including training mode, trusted-turker accuracy threshold, and whether empty tasks were included or not. We tested the performance of excluding empty units (where the data included at least one entity from the in-house gold standard in every unit and a turker had to mark at least one entity to submit) and including empty units (ie, units that have no entities from the in-house gold standard). To mirror the original task given to the traditional annotators and to keep the annotated sample representative of the full CTA corpus, we kept the empty units at 30% of the crowdsourced units.
During the pilot annotations, we had difficulty with a large numbers of untrusted turkers and judgments coming from Asia so we restricted the project to turkers from Australia, Canada, the United Kingdom, and the United States. We also requested 5 judgments per unit (from 5 different turkers) in order to allow flexibility with voting measures and methods. In addition to the training mode, a qualification quiz was presented to each turker the first time they signed up for our tasks. They had to read and understand the instructions, and answer a short quiz (3 multiple choice questions) in order to gain access to the job. The quiz blocked “robot scripts” from participating in our tasks.
The “word selection” function supported double-clicking to select a single word, automatic word-extending and invalid character-shrinking to improve the accuracy and efficiency of the turkers’ operations. Two buttons (“extend highlight” and “shrink highlight”) were provided to extend and shrink the highlighted (annotated) area on the right hand side by one character at a time. After selecting one word or more, a menu with two options (“Medication Name” and “Medication Type”) popped up for the turkers to select the target annotation type. After the turkers clicked the option, the selected word(s) was highlighted by a corresponding color (eg, green was for “Medication Type,” as shown in Figure 3) and all the current annotation information was displayed in the table named “Annotated Entity List”. If the turkers needed to remove annotations already highlighted with a label or if they wanted to change the label of the highlighted word, they had to left-click on the highlighted word and click “OK” to confirm their choice to remove the annotation from the table at the bottom of the page. It should be noted that entities comprised of discontinuous tokens could be highlighted as a single entity by concurrently pressing down CTRL.
The interface for the correction task (correcting previously annotated entities) was similar to the annotation task with the only difference being that some words were pre-annotated (highlighted). The offsets associated with these highlighted words were prefilled into the unit judgment table.
[view this figure]
[view this figure]
|Figure 3. Medication named entity recognition task interface.|
[view this figure]
|Figure 4. Linking task interface.|
After the initial, smaller pilot experiments, we selected a larger number of units for the complete named-entity recognition task. In an earlier unpublished project to develop a machine learning-based medication entity-extraction pipeline, we determined that 1042 CTAs were necessary for the training set to achieve higher than 0.80 F-measures (0.86 for medication name and 0.82 for medication type, using Conditional Random Fields algorithm for information extraction). We used this empirically determined corpus size of 1042 CTAs, corresponding to 3400 units as mentioned in previous section for both the medical name-entity recognition and entity-linking jobs. Several samples annotated by turkers and their corresponding gold standard are presented in the Multimedia Appendix 2.
Based on the pilot medication named-entity annotation experiments, the correction experiment was performed by taking a smaller data set with 200 units and its corresponding 1000 judgments (5 judgments for each unit) and submitting the unique judgments to another crowdsourcing job. The previous experiment had 735 unique judgments (out of 1000). If a particular unit had 3 unique judgments and two additional duplicate judgments, we resubmitted only the 3 unique judgments for the correction job. A judgment was defined by the response of a turker to a unit, covering all of the entities annotated for that unit. In this example, the original job had 5 judgments for the unit and the correction job had 15 judgments for that same unit (3 unique judgments submitted for 5 correction judgments each). For each correction judgment, a turker had the opportunity to remove annotations, add additional annotations, or provide no change to that unit.
In this paper, standard named-entity recognition and classification measurements were adopted to evaluate the performance of the experiments, including Precision (P), Recall (R), and F-measure (F), which are defined in the Multimedia Appendix 3.
One of the aims of this study was to evaluate different methods of voting on judgments from crowdsourced outputs. Because these are named-entity and linking tasks, the calculation is on the entity and linking level. We experimented with 3 voting methods for the medication name and medication-type entity recognition and the medication attribute linking tasks.
We investigated 3 voting methods: simple voting (simple), trusted score weighted voting (trust), and turker experience weighted voting (experience). All voting was performed at the entity and linking level (micro average), regardless of the number of entities and linkages in a given task unit. Equations (1), (2), and (3) shown in Multimedia Appendix 4 describe the formulas we used for the 3 voting methods. Let e be the number of votes for a particular named entity and let J be the number of judgments (number of turkers who submitted responses) in this unit. Let ti be the trust score of turker i who annotated the entity. Let ui be the total number of judgments user i performed and let m be the maximum number of judgments the most prolific turker performed. For simple voting presented in Equation 1, if there were 2 or more annotations (out of 5 judgments/responses) for a particular entity, it was selected for the adjudicated judgment.
Equation 2 gives the trusted score voting, which weighs a particular turker’s entity vote with their trust score (a trust score of 75% provided a 0.75 vote per entity and the max trust score of 100% provided a single simple vote). As presented in Equation 3, turker experience voting weighted each entity vote by the experience score of the turker. The experience score is the number of judgments performed by a turker relative to the maximum number of judgments the most prolific turker performed in that experiment. For example, in one job, a turker submitted 163 judgments, which was the most of any turker in that job. That turker’s weight for all of his entity votes became 1 and the experience score for all other turkers became u/163. Note that the intention of division in 3 equations was to normalize the scores to the range of 0 to 1. As presented in Figure 5, there was high variance in the accomplished number of jobs between turkers. The point of the logarithm in Equation 3 was to scale the difference.
The F-measures were calculated using the 3 voting methods on the original judgments (with each unit having 5 judgments) as correction baselines presented in Table 5. These were then compared to the subsequent correction results computed by 3 voting methods of all correction judgments presented in Table 6. In order to further show the impact of correction, another measure, which we described as a response-level entity vote, is presented in Figure 6. We counted whether the F-measure of the correction judgments improved upon the F-measure of the original judgment.
[view this figure]
|Figure 5. The distributions of turkers’ experience for medical named-entity task, correction task and linking task (X axis denotes number of jobs, Y axis indicates number of turkers).|
[view this figure]
|Figure 6. Improvement chart for correction task.|
Statistical Significance Test of Turker Performance
In order to analyze the differences between the corpus created by the turkers and the corpus created by in-house expert annotators, a statistical assessment method (named “pooling chi-square test”) was proposed to calculate the P values. In this method, the voting results from turkers were pooled together with the corpus that was annotated by experts. These pooled results were then tested against the original voting results. The hypothesis was that the turkers with sufficient training and aggregating multiple results could perform as well as experts. If this hypothesis was true, then pooling the results was not expected to change the original CF voting results. Specifically, the hypothesis H0 was that the experts’ output did not change the quality of the turkers’ annotations (reflected by the number of unique entities annotated correctly and incorrectly). If the P value was less than the designated threshold (0.05), it meant that the experts’ output significantly affected the quality of the turkers’ results. In other words, the turkers did not perform as well as experts. If the P value was higher than the predetermined threshold, then we could not reject the hypothesis. Therefore, we could infer that there was no evidence for statistically significant differences between the turkers’ and experts’ annotations.
Information on Turkers
Table 2 shows information on turkers participating in our 3 tasks. We had 156 turkers, 86 turkers, and 46 turkers to complete medical named-entities task, correction task and linking task, respectively. Figure 5 presents the distribution of turkers by the number of performed jobs for the 3 tasks. We found that the top 5 most prolific turkers completed 39.9% (6778/17,000) medical named-entities jobs, 44.0% (1616/3675) correction jobs, and 45.4% (7716/17,000) linking jobs. Figure 7 shows the distribution of F-measure of turkers for the 3 tasks. We can see that F-measures of greater than 0.6 were achieved by over 83% turkers for the medical named-entities task, over 88% of turkers for the correction task and 100% for the linking task. Table 3 presents the cost and completion time of the 3 tasks. The payment of 3.84 cents per judgment included 3 cents paying for turkers and 0.84 cents charged by CF. Table 3 also presents the time required for the in-house annotators to complete the same tasks. Additionally, the time to receive results from in-house annotation is around 5 times longer than crowdsourcing due to the parallel nature of the crowdsourcing task and the traditional work hours (eg, Monday to Friday, 9am-5pm). The 133 hours represented by the total in-house annotation were the total work hours. The total elapsed time was 10 days (8 work days plus 2 weekend days).
Results of Medical Named-Entities Annotation Task
Table 4 shows the results of the turkers’ medical named-entity annotation with the 3 voting methods that were implemented. It shows the turkers’ generated corpus’ agreement with the in-house experts’ generated gold standard at various threshold levels.
[view this table]
|Table 2. Information on turkers participating in the 3 tasks.|
[view this table]
|Table 3. Cost and time of the 3 tasks.|
[view this figure]
|Figure 7. The distribution of turkers’ F-measure for medical named-entity task, correction task and linking task (X axis denotes F-measure, Y axis indicates number of turkers).|
[view this table]
|Table 4. Results of medical named entity annotation (the pre-determined threshold and its corresponding Pa, Rb, and Fc for each column are italicized).|
Results of Correction Task
The results of the correction task, its corresponding correction baseline, and the results of combined judgments are presented in Tables 5 and 6, respectively. In the correction task, the turkers and experts agreement F-measure of medication name and medication type achieved 0.900 and 0.760 by simple vote, respectively. With comparison to the F-measure of its corresponding correction baseline, relative improvements of 2.62% (medication name Baseline F-measure = 0.877, After_Correction_F-measure = 0.900; computed by (After_Correction_F-measure - Baseline_F-measure)/ Baseline_F-measure * 100) and 10.79% (n/N; medication type name Baseline_F-measure = 0.686, After_Correction_F-measure = 0.760; computed by (After_Correction_F-measure - Baseline_F-measure)/ Baseline_F-measure * 100) were gained (Tables 4 and 5).
[view this table]
|Table 5. Results of correction task with 200 units and 1000 judgments (the pre-determined threshold and its corresponding Pa, Rb, and Fc for each column are italicized).|
[view this table]
|Table 6. Baseline Results of medical named entity annotation corresponding to the correction task (the pre-determined threshold and its corresponding Pa, Rb, and Fc for each column are italicized).|
Furthermore, we analyzed the practical significance of these improvements by calculating the F-measure of medication name and medication type for each unique judgment (the total number of unique judgments was 735) and its corresponding 5 correction judgments. Based on empirical evidence acquired in previous experiments, the F-measure was computed based on a simple vote with the threshold of 0.4. The results are shown in Figure 6. Improvement was seen for 50.5% (370/735) and 44.1% (324/735) of the judgments for medication name and medication type after the turkers’ correction, respectively. In contrast, 1.9% (14/735) and 6.9% (51/735) judgments became worse.
Result of Linking Task
Table 7 shows the results of the linking experiment. Non-expert annotators (turkers) did an excellent job, in which the F-measure achieved 0.962. Meanwhile, as previous results indicated, the simple method could obtain very good results in case of strict quality control.
[view this table]
|Table 7. Results of linking task (the pre-determined threshold and its corresponding Pa, Rb, and Fc for each column italicized).|
Results of Statistical Significance Analysis
For all the results above, Chi-square statistical significance tests were conducted between the corpora created by Crowdflower’s and the gold-standard generated by the in-house annotators. The P values (at P<.001) showed no statistically significant difference between the best CrowdFlower generated corpora and corresponding in-house generated gold-standard sets.
To our knowledge, the medical named-entity annotation task described in this work is the largest scale crowdsourcing experiment in the clinical NLP research field. The results demonstrated that crowdsourcing is a feasible solution for creating a gold standard for medical named-entities. Many works were described in the introduction section, but only one performed a similar medical named entity crowdsourced annotation and is directly comparable to our current study. All other works focused on different corpora and entity types and cannot be compared directly with these works. We improved upon the previously reported results on medical named entity annotation task [21,22] with more than 27.9% of the F-measure (F-measure_Current_Study = 0.87 vs F-measure_Earlier_Work = 0.68 for agreement between the crowdsourced and traditionally developed corpora; computed by (F-measure_Current_Study vs F-measure_Earlier_Work)/ F-measure_Earlier_Work * 100) for named-entity annotation. This experiment also showed that the crowdsourcing performance for medication name annotations is much better than those of medication type. This is a similar finding to the in-house results with trained, expert annotators. We attribute this phenomenon to the clarity of the task for medication name annotation. In other words, the definition and the gold standard answers of medication names are easier to understand and to capture than those of medication type. In the future, we plan to use a more easily interpretable definition of medication types to improve performance. We also plan to use crowdsourcing to annotate attributes, such as date, dosage, as listed in Table 1.
Based on our experiments, we found that it was easy to find a large number of turkers by crowdsourcing. Around 10% (156/1144, 86/678, 46/644 for medication name entity, linking and correction tasks, respectively) of the turkers passed our quality control test (see Table 2). Among those turkers, around 10% (14/156, 11/86, 7/46) of them contributed over 50% (10,521/17,000, 10,900/17,000, 1907/3675) of the jobs.
As shown in Table 4, the non-expert annotators performed at a very high quality and the results indicated that the simple method could obtain very good results, provided the quality control is strict. In our previous work , we reported inter-annotator agreement (IAA) F-measures for medication names and medication types, 94.2% and 88.2% respectively. Additionally, what could have conceivably been weeks’ worth of in-house annotation work was achieved in less than a day of crowdsourcing effort.
Our previous study conducted experiments by implementing a rule-based linking system . The result (around 0.72 F-measure) showed that manual annotation is definitely needed to develop an effective training set for a machine learning-based linking system. The presented linking experiment is the first work known to us that attempted to link medications to their corresponding attributes with crowdsourcing. The results indicated that linking is not a difficult task and the data created can be sufficiently applied to real applications. Based upon this experiment, we plan to create a larger scale data set using crowdsourcing and to apply it to clinical NLP tasks. We will further evaluate the performance of linking by implementing our linking strategy for other clinical named entities. The results of the linking task are excellent, with a near 100% (N=3400) agreement between crowd and traditionally developed corpora.
As shown in Table 3, the linking task took much less time than the other two tasks, most likely because the linking task is much easier than the other two tasks. The time per judgment for medical named entity annotation task is much less than that of the correction task (12.07 vs 37.22 seconds respectively). The reason is that the medical named entity annotation task has more participating turkers (156 vs 86). We can conclude that the difficulty of tasks and the number of participating turkers strongly affect the completion time of the tasks. In contrast to traditional annotation, crowdsourcing achieved 55.5% time (71/128 hours) and 75.0% cost ($1958/$2611) savings for medical named entity annotation. For the linking task, 38.6% time (17/44 hours) and 27.2% cost ($244/$897) savings were seen when using crowdsourcing.
To our knowledge, we were the first to conduct clinical NLP correction experiments. The results of that experiment are quite encouraging. Our correction F-measure was 0.90 (medication names) and even the worst final F-measure improved by more than 10% after the corresponding voting (medication types). We believe that this experiment showed another feasible and efficient way to improve the output of crowdsourcing. We designed an efficient strategy to perform correction. Future work will focus on determining the number of iterative cycles to achieve the best results.
As was mentioned in the previous sections, creating a smaller batch of gold standard data (in-house with expert annotators) is a critically important step for crowdsourcing quality control. This in-house gold standard can be used later to: (1) train turkers, (2) perform quality control, and (3) determine thresholds to aggregate judgments. In this study, we also modified the gold standard management interfaces of CF to perform turker training and quality control by setting gold standard answers. There is room for further research in different methods to train turkers and to experiment with quality control thresholds.
Finally, 3 different voting methods were investigated to aggregate judgments. The results showed that it is quite possible to acquire a high-quality annotated corpus by implementing simple voting under the condition of strict quality control. In pilots, we experimented with different voting thresholds. Lower thresholds resulted in lower agreement of the turker-annotated corpora with the gold standard corpora. Higher thresholds prevented the successful completion of the task by eliminating too many turkers. The thresholds used in the paper (eg, 2 judgments out of 5 or 0.4) were set empirically based on our pilot experiments and earlier related work [32,33]. For the judgment-based voting (eg, trust-based and experience-based voting) more complicated voting methods could be implemented and compared.
A potential limitation of this study was that, the proportion of empty units in our experimental corpus was less (30%) than that in the general population of CTA documents (42%). On the other hand, our pilot experiments show that the proportion of empty units did not influence the performance of the turkers. A second potential limitation was that we included only 3 voting methods among the tested voting schemas. We plan to address this limitation in our future works.
In this study, we evaluated the feasibility of crowdsourcing for creating gold standard data for clinical NLP tasks. Although direct comparison with all related work in the literature was not possible because of corpora and entity type differences, by implementing strict quality control for turker selection and by continuously monitoring the turkers’ performance, we improved upon the directly comparable results in the literature with more than 27.9% for the named-entity annotation task. 3 major experiments were conducted: (1) named-entity annotation, (2) entity linking, and (3) annotation correction. In addition, 3 voting methods were studied. To our knowledge we were the first to investigate the feasibility of crowdsourcing for clinical named-entity annotation on a large-scale corpus. Similarly, we are not aware of a competing work in the clinical NLP domain that proposed to use crowdsourcing to create an entity-linking gold standard for information extraction, on our experiments’ scale. Furthermore, we proposed a successful correction strategy that applied crowdsourcing to crowdsourcing results to improve the quality of the annotated corpus. We found that a high-quality, clinical NLP gold standard data could be obtained by a simple voting method, if a strict quality control is implemented.
The work presented was partially supported by NIH grants 5R00LM010227-04, 1R21HD072883-01, and 1U01HG006828-01.
Conflicts of Interest
Multimedia Appendix 1
GUI source code.[PDF File (Adobe PDF File), 528KB]
Multimedia Appendix 2
Samples annotated by turkers and their corresponding gold standard.[XLS File (Microsoft Excel File), 6MB]
Multimedia Appendix 3
Definitions of precision, recall, and F-measure.[PDF File (Adobe PDF File), 97KB]
Multimedia Appendix 4
Voting method equations.[PDF File (Adobe PDF File), 99KB]
- Chapman WW, Nadkarni PM, Hirschman L, D'Avolio LW, Savova GK, Uzuner O. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc 2011;18(5):540-543 [FREE Full text] [CrossRef] [Medline]
- Snow R, O'Connor B, Jurafsky D, Ng AY. Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics; 2008 Presented at: Empirical Methods in Natural Language Processing; 2008; Honolulu, Hawaii p. 254-263.
- Amazon Mechanical Turk. Seattle, WA: Amazon URL: https://www.mturk.com:443/mturk/welcome [accessed 2012-10-10] [WebCite Cache]
- Lawson N, Eustice K, Perkowitz M, Yildiz M. Annotating large email datasets for named entity recognition with Mechanical Turk. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 71-79.
- Finin T, Murnane W, Karandikar A, Keller N, Martineau J. Annotating Named Entities in Twitter Data with Crowdsourcing. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 80-88.
- Crowdflower. URL: http://crowdflower.com/ [accessed 2012-10-10] [WebCite Cache]
- Ambati V, Vogel S. Can crowds build parallel corpora for machine translation systems? Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 62-65.
- Denkowski M, Al-Haj H, Lavie A. Turker-assisted paraphrasing for English-Arabic machine translation. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 66-70.
- Gao Q, Vogel S. Semi-supervised word alignment with mechanical turk. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 30-34.
- Bloodgood M, Callison-Burch C. Using mechanical turk to build machine translation evaluation sets. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 208-211.
- Audhkhasi K, Georgiou PG, Narayanan SS. Analyzing quality of crowd-sourced speech transcriptions of noisy audio for acoustic model adaptation. : IEEE; 2012 Mar 25 Presented at: International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2012; Kyoto, Japan p. 4137-4170. [CrossRef]
- Evanini K, Higgins D, Zechner K. Using amazon mechanical turk for transcription of nonnative speech. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 53-56.
- Lee CY, Glass J. A transcription task for crowdsourcing with automatic quality control. : ISCA; 2011 Presented at: Interspeech; 2011; Florence p. 3041.
- Gimpel K, Schneider N, O'Connor B, Das D, Mills D, Eisenstein J, et al. Part-of-speech tagging for Twitter: annotation, features, and experiments. Stroudsburg, PA: Association for Computational Linguistics; 2011 Presented at: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; 2011; Portland, Oregon p. 42-47.
- Yano T, Resnik P, Smith NA. Shedding (a thousand points of) light on biased language. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, CA p. 152-158.
- Jha M, Andreas J, Thadani K, Rosenthal S, McKeown K. Corpus creation for new genres: a crowdsourced approach to PP attachment. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, California p. 13-20.
- Turner AM, Kirchhoff K, Capurro D. Using crowdsourcing technology for testing multilingual public health promotion materials. J Med Internet Res 2012;14(3):e79 [FREE Full text] [CrossRef] [Medline]
- Luengo-Oroz MA, Arranz A, Frean J. Crowdsourcing malaria parasite quantification: an online game for analyzing images of infected thick blood smears. J Med Internet Res 2012;14(6):e167 [FREE Full text] [CrossRef] [Medline]
- Burger JD, Doughty E, Bayer S, Tresner-Kirsch D, Wellner B, Aberdeen J, et al. Validating candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing. In: Bodenreider O, Rance B, editors. Data Integration in the Life Sciences. Lecture Notes in Computer Science, Volume 7348/2012. Berlin: Springer; 2012:83-91.
- Norman TC, Bountra C, Edwards AM, Yamamoto KR, Friend SH. Leveraging crowdsourcing to facilitate the discovery of new medicines. Sci Transl Med 2011 Jun 22;3(88):88mr1. [CrossRef] [Medline]
- Yetisgen-Yildiz M, Solti I, Xia F. Using Amazon's mechanical turk for annotating medical named entities. In: AMIA Annu Symp Proc. 2010 Presented at: AMIA 2010 Annual Symposium; 2010; Washington, DC p. 1316.
- Yetisgen-Yildiz M, Solti I, Xia F, Halgrim SR. Preliminary experiments with Amazon's mechanical turk for annotating medical named entities. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles, California p. 180-183.
- Kumar A, Lease M. Learning to rank from a noisy crowd. New York, NY: ACM; 2011 Presented at: 34th International ACM SIGIR Conference on Research and Development in Information Retrieval; 2011; Beijing, China p. 1221-1222. [CrossRef]
- Kumar A, Lease M. Modelling annotator accuracies for supervised learning. New York, NY: Association for Computing Machinery; 2011 Presented at: WSDM Workshop on Crowdsourcing for Search and Data Mining (WSDM); 2011; Hong Kong p. 19-22.
- Jung HJ, Lease M. Improving consensus accuracy via Z-score and weighted voting. In: Human Computation: Papers from the 2011 AAAI Workshop (WS-11-11). Menlo Park, California: The AAAI Press; 2011 Presented at: 3rd Human Computation Workshop (HCOMP); 2011; San Francisco, CA p. 88-90.
- Callan J. The ClueWeb09 Dataset. 2009. URL: http://lemurproject.org/clueweb09/ [accessed 2013-03-21] [WebCite Cache]
- Tang W, Lease M. Semi-supervised consensus labeling for crowdsourcing. 2011 Presented at: SIGIR Workshop on Crowdsourcing for Information Retrieval; 2011; Beijing, China p. 66-75.
- Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C, Hwang JD, et al. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc 2013 Jan 25:1-9. [CrossRef] [Medline]
- Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, et al. Building Gold Standard Corpora for Medical Natural Language Processing Tasks. 2012. Presented at: American Medical Informatics Association 2012 Annual Symposium; 2012; Chicago, IL p. 144-153.
- Ogren P. Knowtator: a protégé plug-in for annotated corpus construction. In: Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: demonstrations. Stroudsburg, PA: Association for Computational Linguistics; 2006 Presented at: NAACL-Demonstrations '06; 2006; New York, New York p. 273-275. [CrossRef]
- Li Q, Zhai H, Deleger L, Lingren T, Kaiser M, Stoutenborough L, et al. A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction. Journal of the American Medical Informatics Association 2012:1-7.
- Lawson N, Eustice K, Perkowitz M, Yestisgen-Yildiz M. Annotating large email datasets for named entity recognition with mechanical turk. Stroudsburg, PA: Association for Computational Linguistics; 2010 Presented at: NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 2010; Los Angeles p. 71-79.
- Demartini G, Difallahmm DE, Cudré-Mauroux P. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. New York, NY: ACM; 2012 Presented at: 21st International Conference on World Wide Web (WWW '12); 2012; Lyon, France p. 469-478.
- Solti lab code page. URL: https://code.google.com/p/soltilab/ [accessed 2013-03-21] [WebCite Cache]
|AMT: Amazon Mechanical Turk|
|CML: CrowdFlower Markup Language|
|CSS: common style sheet|
|CTA: clinical trial announcement|
|EMU: Extractor of Mutations|
|GUI: graphical user interface|
|IAA: inter-annotator agreement|
|MEDLINE: Medical Literature Analysis and Retrieval System Online|
|NCBI: National Center for Biotechnology Information|
|NLP: natural language processing|
|Edited by G Eysenbach; submitted 06.11.12; peer-reviewed by M Luengo-Oroz, H Xu, L Hirschman; comments to author 25.11.12; revised version received 12.12.12; accepted 08.03.13; published 02.04.13|
Please cite as:
Zhai H, Lingren T, Deleger L, Li Qi, Kaiser M, Stoutenborough L, Solti I
Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing
J Med Internet Res 2013;15(4):e73
BibTeX, compatible with BibDesk, LaTeX
RIS, compatible with RefMan, Procite, Endnote, RefWorks
Refer, compatible with Endnote
Add this article to your Mendeley library
Add this article to your CiteULike library
Add this article to your Connotea library
Copyright©Haijun Zhai, Todd Lingren, Louise Deleger, Qi Li, Megan Kaiser, Laura Stoutenborough, Imre Solti. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 02.04.2013.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.