This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking.
This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area.
We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination).
The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration.
In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.
The need for computational tools for the tracking of emerging disease outbreaks from text has become increasingly important in recent years [
We believe that a focus on event extraction offers additional advantages to methods based solely on information retrieval (IR). Traditional IR systems allow us to identify reports based on the presence or absence of disease terms whereas event-based IE approaches enable us to dig deep into a report’s semantics. The mere presence of a disease term in a text should not necessarily lead us to the conclusion that the report contains pressing information about an outbreak. Indeed, Steinberger et al estimated that 63% of documents selected using traditional IR techniques do not contain outbreak events [
The event annotation scheme aims to identify each infectious disease outbreak event in a given text with its associated disease, time, location (at various levels of granularity), and other relevant information. An annotated corpus is necessary in order to evaluate the performance of the current BioCaster IE system and also serves as a test bed for the development of new biosurveillance-specific IE algorithms and techniques. Further, the provision of a reusable resource facilitates further work on disease event extraction and encourages the development of the field, as it has been shown that the provision of such resources (often in conjunction with organized “challenge evaluations” similar to, for instance, the Text Retrieval Conference (TREC) Genomics Track [
Previous work on evaluation for disease outbreak report IE systems has focused on disparate aspects of performance. For example, Blench [
The structure of the paper is as follows. First, we describe the event annotation scheme we developed, then, we set out agreement statistics before finally presenting a description of the corpus and associated software.
Each document is associated with zero or more event frames reflecting the number of outbreak events described in the text (A full description of the annotation scheme, and all associated software can be downloaded from the project Google Code site [
The following are the entity properties (which are filled by named entities) and their definitions:
HAS_DISEASE: disease that caused the outbreak (eg, Ebola)
HAS_LOCATION.COUNTRY: country where the outbreak occurred (eg, United States, Indonesia)
HAS_LOCATION.PROVINCE: province in which the outbreak occurred (eg, Kanagawa, New Hampshire)
HAS_LOCATION.OTHER: other geographical location (eg, Balkans, New England)
HAS_AGENT: agent (pathogen) of the disease (eg, HIV)
The following are the “fixed” slots (which are inferred from the text and take prespecified values) and their definitions:
HAS_SPECIES: human or non_human
TIME.relative: historical (more than three months ago), recent_past (between two weeks and three months ago), present (within the last two weeks), and hypothetical
ZOONOSIS: has species transfer occurred? (Boolean)
DRUG_RESISTANCE: is the disease drug resistant? (Boolean)
NEW_TYPE_AGENT: is the disease a new strain? (Boolean)
ACCIDENTAL_RELEASE: has the disease been released accidently? (Boolean)
INTERNATIONAL_TRAVEL: is international travel involved? (Boolean)
FOOD_CONTAMINATION: is the outbreak caused by contaminated food or water? (Boolean)
HOSPITAL_WORKER: are any victims hospital workers? (Boolean)
FARM_WORKER: are any victims farm workers? (Boolean)
MALFORMED_PRODUCT: are contaminated blood products or vaccines implicated? (Boolean)
A working group consisting of the current paper’s authors developed the annotation scheme over a period of several months guided by the World Health Organization International Health Regulations [
Worked example of event frame construction from raw text. Note that this paper focuses on the construction of event frames from documents already tagged for named entities. The named entity tagging process is described by Kawazoe et al [
In developing our guidelines, we took inspiration from the World Health Organization's International Health Regulations Annex 2 decision instrument [
To gain insight into how consistently the scheme could be applied and to help pinpoint areas of systematic annotator error, we conducted a 100-document interannotator agreement study. We recruited and trained one annotator and compared that individual’s annotations with those of an annotator who was involved in the original annotation scheme design process.
Following the recommendations of Wilbur et al [
We found that the two annotators agreed on the number of disease outbreak events 67% of the time. However, calculating agreement at the level of individual properties (eg, TIME.relative) was not as straightforward as calculating event number agreement for the following three reasons: (1) Annotators could identify a differing number of events for a document. (2) Unless both annotators produced just 1 (or zero) event frames, we were faced with the problem of aligning events. (3) The annotation scheme allowed for an arbitrary number of property values, reflecting synonymous or near synonymous terms in the source document. For example, it was not unusual to see a property/value pairing such as HAS_DISEASE=“bird flu|H5N1|avian influenza.”
Therefore, we concentrated our analysis on those 42 documents where only one event was identified per annotator, thus allowing for a direct comparison. These data are partially summarized in
The fixed slot properties, TIME.relative and SPECIES, are not Boolean and therefore are not represented in
The entity properties (eg, HAS_LOCATION.PROVINCE, HAS_DISEASE) were filled by tagged entities in the text. Agreement for HAS_DISEASE was 100% and for HAS_LOCATION.COUNTRY was 97.7%.
Agreement for 42 documents with precisely one event per annotator (note that only Boolean fixed slot properties are shown)
Agreement for Fixed Slot Properties in Each of 42 Documents | |||||
Property | Annotator 1 (true) | Annotator 1 (false) | Annotator 2 (true) | Annotator 2 (false) | Agreement (%) |
DRUG_RESISTANCE | 0 | 42 | 0 | 42 | 100.0 |
FARM_WORKER | 0 | 42 | 0 | 42 | 100.0 |
FOOD_CONTAMINATION | 5 | 37 | 13 | 29 | 71.4 |
HOSPITAL_WORKER | 0 | 42 | 1 | 41 | 97.6 |
INTERNATIONAL_TRAVEL | 0 | 42 | 0 | 42 | 100.0 |
PRODUCT_MALFORMATION | 0 | 42 | 0 | 42 | 100.0 |
ZOONOSIS | 7 | 35 | 12 | 30 | 83.0 |
On detailed examination of the data, a systematic problem concerning event granularity emerged accounting for the relatively low 67% event agreement rate. Our analysis showed that the issue of suspected zoonosis (ie, unconfirmed zoonosis or where zoonosis is presented as one possible explanation for a human disease) was central here. One annotator produced two events (one human, one non_human), while the other annotator only produced one event (human), ignoring the suspicion of, or speculation about, zoonosis.
We can distinguish annotator agreement arising from ambiguity in the annotation guidelines from straightforward annotator mistakes. For instance, there are several examples where the temporal categories, present (within two weeks of the document time stamp) and recent_past (more than two weeks, but less than three months from the document time stamp), were confused.
For those properties that require an annotator to infer a category from the document (TIME.relative, ZOONOSIS, HAS_SPECIES, INTERNATIONAL_TRAVEL, DRUG_RESISTANCE, FOOD_CONTAMINATION, HOSPITAL_WORKER, FARM_WORKER, and PRODUCT_MALFORMATION), there is scope for incorrect inference. For example, several of the documents in the agreement study data set concern cholera. While cholera is spread primarily through water contamination (ie, FOOD_CONTAMINATION), this is not stated explicitly in the text. Only one of the annotators marked these documents as true for FOOD_CONTAMINATION, suggesting that the annotator who marked the property false was unaware of the primary transmission route for cholera.
The corpus consists of 200 documents (all in English) and their associated event frames, with documents gathered from a variety of sources (see
Corpus document sources (200 documents)
Document Source | Number of Documents | % of 200 |
ProMed-Mail | 43 | 21.5 |
Reuters | 16 | 8.0 |
BBC | 16 | 8.0 |
WHO | 41 | 20.5 |
CBS | 13 | 6.5 |
CBC | 17 | 8.5 |
Vietnam-net | 12 | 6.0 |
Hindustan Times | 18 | 9.0 |
The Nation (Thailand) | 9 | 4.5 |
All Africa | 5 | 2.5 |
Xinhua (China) | 5 | 2.5 |
Antara (Indonesia) | 5 | 2.5 |
Of the 394 annotated events in the corpus, 75.4% describe human (rather than animal) disease events (see
Event statistics (total number of events is 394)
Type of Event | Number of Events | % of 394 |
Events involving humans | 297 | 75.4 |
Events involving food contamination | 35 | 8.9 |
Events involving hospital workers | 3 | 0.8 |
Events involving malformed products | 2 | 0.5 |
Events classified as present | 321 | 81.5 |
Events classified as historical | 49 | 12.4 |
Events classified as recent_past | 11 | 2.8 |
Events classified as hypothetical | 13 | 3.3 |
Distribution of disease events in our corpus by country (only countries with 2 or more events shown) (Map produced by GPS visualizer)
While we hope that the event frame may form part of the foundation for a future standard, we recognize that there are challenges in achieving this goal (see
The current event frames may not be suitable for all needs. For some users, the knowledge required by event frames may be superfluous (eg, a system that is solely concerned with identifying cholera outbreak has no need for zoonosis information). For other users, the event frame may not encode enough information (eg, an event's certainty or uncertainty—unrepresented in our event frame—may be important for system designers). Indeed, it is conceivable that some users may suffer from both these problems. Nevertheless, we believe that our event scheme provides a foundation for potential future standards developments.
The current agreement level for number of events (67%) is not high. However, this result masks the fact that agreement for important entity properties such as HAS_LOCATION.COUNTRY and HAS_DISEASE is almost perfect.
The current scheme was designed for news text. It is not clear how well the scheme would extend to other, less formal genres that may contain information of interest (eg, blog postings and message boards).
Linux BioCaster corpus event frame browsing tool [
In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus are presented to the research community in the belief that such resources can help in the formation of an emerging standard for this rapidly growing research area.
This work was partially funded by a Japanese Society for the Promotion of Science postdoctoral fellowship (author MC).
None declared
Global Public Health Intelligence Network
graphical user interface
information extraction
information retrieval
message understanding conference
named entity recognition
Pattern-based Understanding and Learning System
Text Retrieval Conference
World Health Organization
extensible markup language