Developing a Disease Outbreak Event Corpus
Background: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking.
Objective: This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area.
Methods: We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination).
Results: The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration.
Conclusion: In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.
J Med Internet Res 2010;12(3):e43)
- disease outbreaks;
- natural language processing;
- text mining;
- information extraction;
- public health informatics
The need for computational tools for the tracking of emerging disease outbreaks from text has become increasingly important in recent years [, ] leading to the development of various machine-aided surveillance systems (eg, Global Public Health Intelligence Network (GPHIN) [ ], HealthMap [ ], BioCaster [ ], MedISys [ ], Pattern-based Understanding and Learning System (PULS) [ ], and EpiSPIDER[ ]). One way to evaluate the semantics of such a system is to construct an event frame (ie, template), which is then associated with each outbreak event in a sample of news documents (the nature and scope of reportable events varies according to the case definition of each system). This paper reports on such a data set─an annotation scheme and corpus [ ]─developed for disease outbreak event detection in the context of the BioCaster biosurvellance online news information extraction (IE) system [ , ].
We believe that a focus on event extraction offers additional advantages to methods based solely on information retrieval (IR). Traditional IR systems allow us to identify reports based on the presence or absence of disease terms whereas event-based IE approaches enable us to dig deep into a report’s semantics. The mere presence of a disease term in a text should not necessarily lead us to the conclusion that the report contains pressing information about an outbreak. Indeed, Steinberger et al estimated that 63% of documents selected using traditional IR techniques do not contain outbreak events . For example, vaccination campaigns, medical research results, and public health advice often occur in news texts and are likely to generate false positives if we rely solely on IR to identify documents of interest. An event-based strategy facilitates the exclusion of nonrelevant documents from further processing and could form the basis of more sophisticated text mining and visualization while providing richer outbreak data for end users. Note that the event-based approach suggested here requires antecedent document selection and named entity recognition (NER) modules (ie, a pipeline with a document selection module inputting relevant documents to an NER module before this output is piped to an event extraction module). In the case of the BioCaster system, the document selection module has a particularly important “gate-keeping” function as the system accepts input from over 1700 RSS feeds─far too many documents to subject to the computationally intensive NER and event extraction processes [ ].
The event annotation scheme aims to identify each infectious disease outbreak event in a given text with its associated disease, time, location (at various levels of granularity), and other relevant information. An annotated corpus is necessary in order to evaluate the performance of the current BioCaster IE system and also serves as a test bed for the development of new biosurveillance-specific IE algorithms and techniques. Further, the provision of a reusable resource facilitates further work on disease event extraction and encourages the development of the field, as it has been shown that the provision of such resources (often in conjunction with organized “challenge evaluations” similar to, for instance, the Text Retrieval Conference (TREC) Genomics Track ) has increased research momentum for other IE tasks [ ].
Previous work on evaluation for disease outbreak report IE systems has focused on disparate aspects of performance. For example, Blench  found that the GPHIN system identified 56% of the outbreaks verified by the World Health Organization (WHO) over a three-year period, while Freifeld et al [ ] found that the HealthMap system successfully classified 84% of reports by disease and location over a one-month period. Kawazoe et al [ ] reported that the BioCaster system’s NER module achieved an F-score of 76.97, while for the PULS system (which is an event extraction system that relies on input from the MedISys IR system), it is estimated that approximately 72% of the extracted events are correct [ ]. While this kind of evaluation work is important for system developers, the obvious difficulty in comparing reported results illustrates the need for a community-wide data set for algorithm testing.
The structure of the paper is as follows. First, we describe the event annotation scheme we developed, then, we set out agreement statistics before finally presenting a description of the corpus and associated software.
Each document is associated with zero or more event frames reflecting the number of outbreak events described in the text (A full description of the annotation scheme, and all associated software can be downloaded from the project Google Code site ). The event frames are designed to capture the properties of outbreak reports that are of interest to public health experts and epidemiologists. Event frames are formatted in extensible markup language (XML) (see ) and consist of property names and their associated values derived from the document source (eg, HAS_DISEASE, “Ebola”). Reports have already been tagged for named entities such as person names, disease names, location names, and so on (twelve in total) using an ontology-based annotation scheme developed specifically for the disease outbreak domain [ ]. Property names are of two types. First, entity properties are filled with appropriate entities derived directly from the text of interest (entity properties are conceptually similar to Message Understanding Conference (MUC) style “string fills”). For example, the HAS_DISEASE property could only have the value “polio” if “polio” is tagged as an entity in the document. Second, fixed slots (equivalent to MUC-style “set fills”) take prespecified values of a restricted kind (normally simply Boolean true or false values), and, unlike entity values, are inferred from the document. For example, the INTERNATIONAL_TRAVEL property accepts only Boolean values.
The following are the entity properties (which are filled by named entities) and their definitions:
- HAS_DISEASE: disease that caused the outbreak (eg, Ebola)
- HAS_LOCATION.COUNTRY: country where the outbreak occurred (eg, United States, Indonesia)
- HAS_LOCATION.PROVINCE: province in which the outbreak occurred (eg, Kanagawa, New Hampshire)
- HAS_LOCATION.OTHER: other geographical location (eg, Balkans, New England)
- HAS_AGENT: agent (pathogen) of the disease (eg, HIV)
The following are the “fixed” slots (which are inferred from the text and take prespecified values) and their definitions:
- HAS_SPECIES: human or non_human
- TIME.relative: historical (more than three months ago), recent_past (between two weeks and three months ago), present (within the last two weeks), and hypothetical
- ZOONOSIS: has species transfer occurred? (Boolean)
- DRUG_RESISTANCE: is the disease drug resistant? (Boolean)
- NEW_TYPE_AGENT: is the disease a new strain? (Boolean)
- ACCIDENTAL_RELEASE: has the disease been released accidently? (Boolean)
- INTERNATIONAL_TRAVEL: is international travel involved? (Boolean)
- FOOD_CONTAMINATION: is the outbreak caused by contaminated food or water? (Boolean)
- HOSPITAL_WORKER: are any victims hospital workers? (Boolean)
- FARM_WORKER: are any victims farm workers? (Boolean)
- MALFORMED_PRODUCT: are contaminated blood products or vaccines implicated? (Boolean)
A working group consisting of the current paper’s authors developed the annotation scheme over a period of several months guided by the World Health Organization International Health Regulations  (see ) and using advice provided by the National Institute of Infectious Diseases in Japan.
Agreement Study and Error Analysis
To gain insight into how consistently the scheme could be applied and to help pinpoint areas of systematic annotator error, we conducted a 100-document interannotator agreement study. We recruited and trained one annotator and compared that individual’s annotations with those of an annotator who was involved in the original annotation scheme design process.
Following the recommendations of Wilbur et al , we used percentage scores to assess agreement rather than the kappa statistic [ ]. While some researchers in annotation scheme design refrain from the use of agreement studies entirely (eg, [ ]), we felt that this exercise would help to draw out any systematic annotator difficulties and also facilitate the debugging of the annotation scheme and corpus.
We found that the two annotators agreed on the number of disease outbreak events 67% of the time. However, calculating agreement at the level of individual properties (eg, TIME.relative) was not as straightforward as calculating event number agreement for the following three reasons: (1) Annotators could identify a differing number of events for a document. (2) Unless both annotators produced just 1 (or zero) event frames, we were faced with the problem of aligning events. (3) The annotation scheme allowed for an arbitrary number of property values, reflecting synonymous or near synonymous terms in the source document. For example, it was not unusual to see a property/value pairing such as HAS_DISEASE=“bird flu|H5N1|avian influenza.”
Therefore, we concentrated our analysis on those 42 documents where only one event was identified per annotator, thus allowing for a direct comparison. These data are partially summarized in, where it can be seen that the annotators agreed 100% of the time on DRUG_RESISTANT, FARM_WORKER, INTERNATIONAL_TRAVEL, and PRODUCT_MALFORMATION. Agreement was worst for FOOD_CONTAMINATION and ZOONOSIS. Major sources of disagreement are summarized in .
The fixed slot properties, TIME.relative and SPECIES, are not Boolean and therefore are not represented in. TIME.relative had four values (historical, recent past, present, and hypothetical) and achieved an agreement score of 90.5% (with the most frequent value being “present”). SPECIES had two values (human and nonhuman) and achieved an agreement of 90.2%. More information about the annotation guidelines is available in [ ].
The entity properties (eg, HAS_LOCATION.PROVINCE, HAS_DISEASE) were filled by tagged entities in the text. Agreement for HAS_DISEASE was 100% and for HAS_LOCATION.COUNTRY was 97.7%.
The corpus consists of 200 documents (all in English) and their associated event frames, with documents gathered from a variety of sources (see). The largest single source was ProMed-Mail [ ], an expert-curated infectious disease reporting service. Additionally, documents were not randomly sampled, but rather selected to represent diseases and geographical areas of interest to the researchers. Major international news providers are also represented (eg, CBC, Reuters, BBC) in addition to primarily Asian or Asia-Pacific news services (eg, Vietnam-net, Thailand’s The Nation). Documents range from 45 to 1487 words long, with a mean of 305.9 words (without markup). Document selection was performed by author MC (see corpus documentation [ ] for details).
Of the 394 annotated events in the corpus, 75.4% describe human (rather than animal) disease events (see). Most of the events identified (81.5%) have been classified as present outbreaks, although historical, recent past, and hypothetical events are also represented. To show the geographical range of the documents selected, the geographical distribution of events (by country) is shown in . Note that the map does not show the actual distribution of disease events, but rather the geographical distribution of disease events in our corpus.
While we hope that the event frame may form part of the foundation for a future standard, we recognize that there are challenges in achieving this goal (see). Further, due to copyright restrictions, we are unable to distribute the corpus directly. Instead we have provided two methods for corpus access. First, a download script (a Perl script that downloads and cleans all the documents from their original source on the Web and then associates them with event frames) and a graphical user interface (GUI) based event browser (see ) [ ]. Note that as of July 2009, only 176 of the original 200 documents were currently available online.
In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus are presented to the research community in the belief that such resources can help in the formation of an emerging standard for this rapidly growing research area.
This work was partially funded by a Japanese Society for the Promotion of Science postdoctoral fellowship (author MC).
Conflicts of Interest
- Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009;11(1):e11 [FREE Full text] [CrossRef] [Medline]
- Keller M, Blench M, Tolentino H, Freifeld CC, Mandl KD, Mawudeku A, et al. Use of unstructured event-based reports for global infectious disease surveillance. Emerg Infect Dis 2009 May;15(5):689-695 [FREE Full text] [Medline]
- . Public Health Agency of Canada News Releases. 2004. The Global Public Health Intelligence Network (GPHIN). URL: http://www.phac-aspc.gc.ca/media/nr-rp/2004/2004_gphin-rmispbk-eng.php [accessed 2010-08-17] [WebCite Cache]
- HealthMap. URL: http://www.healthmap.org/en/ [accessed 2010-09-03] [WebCite Cache]
- BioCaster. URL: http://biocaster.nii.ac.jp/ [accessed 2010-09-03] [WebCite Cache]
- MedISys. URL: http://medusa.jrc.it/medisys/homeedition/all/home.html [accessed 2010-09-03] [WebCite Cache]
- Pattern-based Understanding and Learning System (PULS). URL: http://puls.cs.helsinki.fi/medical/ [accessed 2010-09-03] [WebCite Cache]
- EpiSPIDER. URL: http://www.epispider.org/ [WebCite Cache]
- BioCaster event corpus and tools. URL: http://code.google.com/p/becorpus/ [accessed 2010-09-03] [WebCite Cache]
- Collier N, Doan S, Kawazoe A, Matsuda-Goodwin R, Conway M, Tateno Y, et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 2008 Dec 15;24(24):2940-2941 [FREE Full text] [CrossRef] [Medline]
- Steinberger R, Fuart F, van der Groot E, Best C, von Etter P, Yangarber R. Text mining from the web for medical intelligence. In: Fogelman-Soulié F, Perrotta D, Piskorski J, Steinberger R. editors. Mining Massive Data Sets for Security. Amsterdam, Netherlands: OIS Press; 2008:295-310.
- TREC Genomics Track. URL: http://ir.ohsu.edu/genomics/ [accessed 2010-09-03] [WebCite Cache]
- Hirschman L, Park JC, Tsujii J, Wong L, Wu CH. Accomplishments and challenges in literature data mining for biology. Bioinformatics 2002 Dec;18(12):1553-1561 [FREE Full text] [Medline]
- Blench M. Global public health intelligence network (GPHIN). Presented at: The eighth conference of the association for machine translation in the Americas; 2008 Oct 21-25; Wakiki, Hawaii URL: http://www.amtaweb.org/papers/4.02_Blench2008.pdf [WebCite Cache]
- Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. J Am Med Inform Assoc 2008 Apr;15(2):150-157 [FREE Full text] [CrossRef] [Medline]
- Kawazoe A, Jin L, Shigematsu M, Barrero R, Taniguchi K, Coller N. The development of a schema for the annotation of terms in the BioCaster disease detecting/tracking system. Presented at: KR-MED 2006: Biomedical ontology in action; 2006 Nov 8; Baltimore, Maryland URL: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-222/krmed2006-p09.pdf [WebCite Cache]
- World Health Organization. International Health Regulations. 2nd edition. Geneva, Switzerland: World Health Organization; 2005. URL: http://whqlibdoc.who.int/publications/2008/9789241580410_eng.pdf [accessed 2010-09-03] [WebCite Cache]
- Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 2006;7:356 [FREE Full text] [CrossRef] [Medline]
- Carletta J. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics 1996;22(2):249-254 [FREE Full text] [WebCite Cache]
- Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007;8:50 [FREE Full text] [CrossRef] [Medline]
- ProMED-mail. URL: http://www.promedmail.org/pls/apex/f?p=2400:1000 [accessed 2010-08-17] [WebCite Cache]
|GPHIN: Global Public Health Intelligence Network|
|GUI: graphical user interface|
|IE: information extraction|
|IR: information retrieval|
|MUC: message understanding conference|
|NER: named entity recognition|
|PULS: Pattern-based Understanding and Learning System|
|TREC: Text Retrieval Conference|
|WHO: World Health Organization|
|XML: extensible markup language|
Edited by G Eysenbach; submitted 28.07.09; peer-reviewed by L Hirschman, H Tolentino, C Freifeld; comments to author 18.11.09; revised version received 21.02.10; accepted 12.03.10; published 28.09.10
©Mike Conway, Ai Kawazoe, Hutchatai Chanlekha, Nigel Collier. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 28.09.2010
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.