This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
The prevalence and value of patient-generated health text are increasing, but processing such text remains problematic. Although existing biomedical natural language processing (NLP) tools are appealing, most were developed to process clinician- or researcher-generated text, such as clinical notes or journal articles. In addition to being constructed for different types of text, other challenges of using existing NLP include constantly changing technologies, source vocabularies, and characteristics of text. These continuously evolving challenges warrant the need for applying low-cost systematic assessment. However, the primarily accepted evaluation method in NLP, manual annotation, requires tremendous effort and time.
The primary objective of this study is to explore an alternative approach—using low-cost, automated methods to detect failures (eg, incorrect boundaries, missed terms, mismapped concepts) when processing patient-generated text with existing biomedical NLP tools. We first characterize common failures that NLP tools can make in processing online community text. We then demonstrate the feasibility of our automated approach in detecting these common failures using one of the most popular biomedical NLP tools, MetaMap.
Using 9657 posts from an online cancer community, we explored our automated failure detection approach in two steps: (1) to characterize the failure types, we first manually reviewed MetaMap’s commonly occurring failures, grouped the inaccurate mappings into failure types, and then identified causes of the failures through iterative rounds of manual review using open coding, and (2) to automatically detect these failure types, we then explored combinations of existing NLP techniques and dictionary-based matching for each failure cause. Finally, we manually evaluated the automatically detected failures.
From our manual review, we characterized three types of failure: (1) boundary failures, (2) missed term failures, and (3) word ambiguity failures. Within these three failure types, we discovered 12 causes of inaccurate mappings of concepts. We used automated methods to detect almost half of 383,572 MetaMap’s mappings as problematic. Word sense ambiguity failure was the most widely occurring, comprising 82.22% of failures. Boundary failure was the second most frequent, amounting to 15.90% of failures, while missed term failures were the least common, making up 1.88% of failures. The automated failure detection achieved precision, recall, accuracy, and F1 score of 83.00%, 92.57%, 88.17%, and 87.52%, respectively.
We illustrate the challenges of processing patient-generated online health community text and characterize failures of NLP tools on this patient-generated health text, demonstrating the feasibility of our low-cost approach to automatically detect those failures. Our approach shows the potential for scalable and effective solutions to automatically assess the constantly evolving NLP tools and source vocabularies to process patient-generated text.
The Internet pervades our everyday life, including health care [
One scalable approach to process text-based patient-generated data is natural language processing (NLP). An increasing number of researchers studying patient-generated text, such as in online health communities, have used statistical methods based on manually annotated datasets [
Existing biomedical NLP tools have the potential to be used immediately and promise to provide greater generalizability than statistical approaches while providing semantic connections. Researchers have developed various NLP techniques and applications in the biomedical domain. For example, the Clinical Text Analysis and Knowledge Extraction System (cTakes) [
One of the most widely regarded NLP applications in biomedicine is MetaMap [
However, MetaMap and many other traditional biomedical NLP tools were developed to process biomedical literature and clinical notes, rather than patient-generated text in online communities. One of the biggest challenges in applying these biomedical NLP tools to a different type of text is the difference in vocabulary. For example, Zeng et al recognize differences in the vocabulary used by patients and clinicians [
Recognizing the differences in vocabulary, a number of efforts to expand the UMLS to include patient-generated text have been reported [
The effort to process patient-generated text, such as email [
These prior studies show that many have worked to improve biomedical NLP tools to process patient-generated text. As NLP technologies and source vocabularies continue to evolve, we need easy, low-cost methods to systematically assess the performance of those tools. Traditionally in NLP, evaluations involve a great deal of manual effort, such as creating a manually annotated dataset. Moreover, a new evaluation for different types of text requires additional annotated datasets, thus maintenance can often be difficult. Recognizing the potential benefits of performing a low-cost assessment of NLP tools, we explore automated methods to detect failures without producing annotated datasets. Given MetaMap’s long history of use in biomedical contexts, its configurability, and its scalability, we apply our failure detection tool to MetaMap in processing patient-generated text from an online cancer community to demonstrate the feasibility of automatically detecting occurrence of failures. We first present the dataset and MetaMap configuration, followed by the specific methods and results for (1) characterizing failure types, (2) automated failure detection, and (3) manual performance evaluation of our automated failure detection approach.
Our dataset consists of community posts from the CancerConnect website, an online cancer community for cancer patients, their families, friends, and caregivers to exchange support and advice. The dataset consists of a total of 2010 unique user members and 9657 user member–generated posts from March 2010 to January 2013.
We processed the online community posts with MetaMap version 2011AA and configured the word sense disambiguation feature and included only the top-ranked concept from the output. In the default setting, MetaMap suggests a number of candidate concepts with candidate scores indicating relationships among concepts found in the text. However, in real-world usage on large amounts of text, considering multiple suggestions for each processed term could be overwhelming to assess manually. Thus, we assessed only the top-ranked scored concept to simulate how MetaMap would be used in real-world settings. However, we used default settings for all other options for generalizability. A single mapped term/concept served as the unit of analysis.
To characterize the types of failures, we assessed MetaMap’s output collaboratively through iterative rounds of manual review among the five authors. We reviewed the output following an open coding process [
From our manual review, we characterized three types of failure: (1) boundary failures, (2) missed term failures, and (3) word ambiguity failures. A boundary failure occurred when a single coherent term was incorrectly parsed into multiple incomplete terms. A missed term failure occurred when a relevant term had not been identified. A word sense ambiguity failure occurred when a relevant term was mapped to a wrong concept. Within these three failure types, we discovered 12 causes of failures. In the sections below, we describe each type of failure and then identify potential causes within each failure type.
Boundary failures, in which a single coherent term is incorrectly parsed into multiple terms, are well documented in biomedical NLP literature [
Our patient-generated text contained extensive descriptive phrases (eg, “feeling great”) and colloquial language (eg, “chemo brain”), contrasting with typical biomedical text that usually contained concepts from standard terminologies. Theoretically, boundary failures can result from standard medical terminologies. However, descriptive phrases and colloquial language highlight the parsing problem of biomedical NLP because colloquial language and descriptive phrases that patients use in online health communities cannot all be included in the UMLS. For instance, UMLS included “feeling sick” as a synonym of a concept, although a similar descriptive phrase “feeling great” was not included in the UMLS. Consequently in our analysis, “feeling sick” was recognized as one concept, while “feeling great” was parsed into two separate terms “Emotions” and “Large” delivering different interpretations than intended.
Boundary failures also occurred even when proper concepts were available in the UMLS. For instance, a colloquial term “chemo brain” was commonly used to describe the single concept of cognitive deterioration of cancer patients after chemotherapy. In our analysis, the term was recognized as two UMLS concepts—“chemotherapy” and “brain-body part”—even though UMLS contained a concept for “chemo brain”. From our experience, we inferred that the lack of colloquial language and descriptive phrases concepts in the UMLS as well as standard medical terminologies parser were causing boundary failure when processing patient-generated text.
Missed term failures occurred when a relevant term was not identified [
Community-specific nomenclature refers to members of a community using terms that either are commonly used in a different way elsewhere or not commonly used at all. In online communities, members frequently create their own nomenclature that, over time, can become vernacular that is well understood in the community [
In particular, community nomenclature regularly referred to relevant health-related content but resulted in three major challenges. First, many of the community-specific terms were not found in the UMLS. For instance, “PC” referred to “Prostate Cancer”; however, this acronym was not contained in UMLS. Second, community nomenclature was typically context and community-specific. For instance, the acronym “BC” was used for “before cancer”, “blood count”, or “breast cancer” depending on the context. This type of ambiguous usage was also seen with commonly accepted abbreviations. For instance, “rad” was a common abbreviation for “radiation therapy” in the cancer community, but “rad” could also be used for “radiation absorbed dose”, “reactive airway disease”, “reactive attachment disorder”, or “RRAD gene” depending on the community. Third, novel abbreviations and acronyms constantly showed up in our data, similar to what researchers of online communities found [
Previous research showed that patients made more medically related misspellings at a significantly higher rate compared to clinicians [
The most prevalent failure was word sense ambiguity, which occurred when a term was mapped to the wrong concept because the two concepts are spelled the same way, share the same acronym (eg, “apt”, an acronym used for appointment was mapped to organic chemical “4-azido-7-phenylpyrazolo-(1,5a)-1,3,5-triazine”), or were spelled the same as one of their acronyms (eg, a verb “aids” was mapped to “Acquired Immunodeficiency Syndrome”). This failure had been identified in previous research [
Frequent use of standard abbreviations and contractions was common in our online health communities. Online community members frequently used contractions such as “I’d” or abbreviations such as “i.e.” in their text. Although the use of these shortened forms was common in informal text, it could be a source of errors for many NLP tools. For example, MetaMap maps “I’d” to “Incision and drainage” and mapped “i.e.” to “Internal-External Locus of Control Scale” due to partial matches with synonyms. Also, MetaMap was inconsistent with some of its correct mappings for abbreviations. For instance, abbreviations for some US states were mapped correctly (eg, “AK” and “WA”), whereas others were often missed even though they were in the UMLS (eg, “CA” and “FL”) or were mismapped (eg, Virginia was mapped to “Alveolar gas volume” when written as “V.A.”).
Colloquial language, such as “hi” was prevalent in our dataset and caused many failures. Although these terms are obvious to human readers, we found they were often mapped to incorrect terms in the UMLS. For instance, our previous example “hi”, rather than being left unmapped, was mapped to “Hawaii”, “ABCC8 gene”, or “AKAP4 gene” because “hi” was a synonym for all three concepts. In our analysis, this failure was found with many semantic types; however, terms mapped to the semantic type of “Gene or Genome” were particularly troublesome because of their unusual naming conventions.
Our online community posts often contained numbers that convey important information, such as a patient’s disease status (eg, “stage 3 breast cancer”). Other times, numbers conveyed more logistical information, such as time of day and dates, which were misinterpreted. For instance, in the phrase, “I got there at 4:12pm”, “12pm” was mapped to “Maxillary left first premolar mesial prosthesis” because it was a complete match for one of its synonyms in the UMLS. Numbers that were used to convey diagnostic information were crucial for the identity of many community members, and such information was often included in an automated signature line (eg, “stage 2 grade 3 triple negative breast cancer”) at the end of posts. Numbers indicating dates and times often resulted in false positives, whereas health status numbers often resulted in a different failure type (ie, boundary failure caused by splitting a phrase). We saw this type of failure across many different semantic types, including “Amino Acid, Peptide, or Protein”, “Finding”, “Gene or Genome”, “Intellectual Product”, “Medical Device”, “Quantitative Concept”, and “Research Activity”.
Online community members frequently mentioned URLs and email addresses in our dataset. They often pointed to websites that they found useful and gave out email addresses to start private conversations. Parts of email addresses and URLs were incorrectly mapped in our analysis. For instance, “net” at the end of an email address was often mapped to the “SPINK5 gene” because one of its synonyms was “nets”. Also, “en”, a language code that referred the English language in URLs, incorrectly mapped to “NT5E gene” because one of its synonyms was “eN”.
Internet slang and SMS language, such as “LOL” (ie, “laugh out loud” or “lots of love”) or “XOXO” (ie, hugs and kisses) are highly prevalent in online community text but not in typical biomedical texts. Although these terms should be obvious to human readers, our analysis showed that Internet slang and SMS language were often mapped to incorrect biomedical terms in the UMLS. In particular, Internet slang and SMS language were often mistaken for gene names, such as the mapping of “LOL” to the LOX1 gene and “XO” to the XDH gene. To manage the different variations of concepts, the UMLS included many synonyms of terms. Varieties of these synonyms overlapped with commonly used Internet slang and SMS language resulting in word sense ambiguity failure.
The use of names is also prevalent in online community posts, particularly when posts address specific individuals. Community members also often include their first names in a signature line and call out other members by first names or community handles. In our analysis, common first names were often mistaken for UMLS concepts, such as “Meg” being mistaken for “megestrol”, “Rebecca” for “becatecarin”, “Don” for “Diazooxonorleucine”, and “Candy” for “candy dosage form”. Each individual name was a complete match for one of the UMLS concepts. We identified these mismatches across multiple semantic types, including “Antibiotic”, “Biomedical or Dental Material”, “Clinical Attribute”, “Diagnostic Procedure”, “Disease or Syndrome”, “Finding”, “Hormone”, “Injury or Poisoning”, “Laboratory Procedure”, “Mental Process”, “Pathologic Function”, “Pharmacologic Substance”, and “Sign of Symptom”.
Patients share a wide variety of personal experiences in narrative form in online health communities. Thus, the use of the pronoun “I” is prevalent in community posts but is a source of misinterpretation. For example, over the course of the study we discovered that “I” is typically mapped to either “Blood group antibody I” or “Iodides”, which belong to “Amino Acid, Peptide, or Protein”, “Immunologic Factor”, or “Inorganic Chemical” semantic types.
One of the most fundamental components of NLP tools is a part-of-speech (POS) tagger, which marks up words with their corresponding POS (eg, verb, noun, preposition) in a phrase, sentence, or paragraph. POS taggers are commonly used in NLP and have many different applications, such as phrase parsers. In our analysis, we discovered that MetaMap uses a POS tagger called MedPost SKR (Semantic Knowledge Representation) [
Two great strengths of the UMLS are its broad coverage of concepts and its capacity to distinguish among concepts in fine detail. This ability to provide the precise meaning of concepts is valuable for many applications. However, this feature also became a source for inconsistent mappings despite similar usage of terms in our analysis. For instance, the term “stage” was mapped to multiple concepts in our dataset. Community members often used the term “stage” to describe their cancer status (eg,
Word sense ambiguity failures: inconsistent mappings of stage by MetaMap.
Sample sentence | Mapped term | UMLS concept | Concept unique identifiers | UMLS semantic type |
“My father was diagnosed with stage 2b pancreatic cancer” | stage 2b | Stage 2B | C0441769 | Classification |
“I'm stage 4 SLL and stage 2 CLL” | stage | Tumor stage | C1300072 | Clinical attribute |
“I was dx last year at age 46 with Stage 1” | Stage 1 | Stage level 1 | C0441766 | Intellectual product |
“Almost seven years ago I was diagnosed with stage 1 breast cancer at age 36 ½” | Stage breast cancer | malignant neoplasm of breast staging | C2216702 | Neoplastic process |
“My friend was just diagnosed with Stage IV cancer” | stage | Stage | C1306673 | Qualitative concept |
“My mom was diagnosed 11/07 with stage IV inoperable EC” | stage | Phase | C0205390 | Temporal concept |
To explore automated methods for detecting the three types of failures we identified, we created a tool that applies combinations of dictionary-based matching [
Our tool detected failures caused by incorrectly splitting a phrase through a comparison of MetaMap’s MedPost SKR parser [
Examples of splitting a phrase failure.
Sample sentence | Ideally mapped UMLS concept | First mapped term (UMLS concept name) | Second mapped term (UMLS concept name) |
“My mom had unknown primary and it was a PET scan that helped them find the primary.” | PET/CT scan | PET (Pet Animal) | Scan (Radionuclide Imaging) |
“It was removed and I have had stereotactic treatment along with 6 rounds of Taxol/Carbo completed in January 2012.” [sic] | Stereotactic Radiation Treatment | Stereotactic (Stereotactic) | Treatment (Therapeutic Aspects) |
“Had 25 internal rad treatments (along with cisplatin on day 1 and 25).” [sic] | Therapeutic Radiology Procedure | Rad (Radiation Absorbed Dose) | Treatments (Therapeutic Procedure) |
“I am Triple Negative BC and there are no follow-up treatments for us TN's.” | Triple Negative Breast Neoplasms | Triple (Triplicate) | Negative (Negative) |
“My doc thinks I will probably end up having a double mastectomy” | None available | Double (Double Value Type) | Mastectomy (Mastectomy) |
“I thought after 9 months my hair would be back but I have grown some type of hair that I am told is ‘chemo curls’.” | None available | Chemo (Chemotherapy Regimen) | Curls (Early Endosome) |
We identified two causes of missed term failures associated with processing patient-generated text. The following sections describe automatic detection of missed terms, specifically due to community-specific nomenclature and misspellings.
Our tool detected missed terms due to abbreviations and acronyms in four steps. First, it ran MetaMap on the original text and then counted the total number of mappings. Second, it extracted common abbreviations and acronyms and their definitions using a simple rule-based algorithm [
The simple rule-based algorithm by Schwartz and Hearst [
Our tool detected the prevalence of missed terms due to misspelling using three steps. First, it ran MetaMap on the original text and counted the total number of mappings. Second, it ran MetaMap again after correcting possible misspellings using Google’s query suggestion service [
We identified nine causes of word sense ambiguity failure associated with processing patient-generated text. In the following sections, we describe how to automatically detect the word sense ambiguity failures.
To detect word sense ambiguity failures due to abbreviations and contractions, we used an NLP tool called the Stanford POS Tagger [
Detecting word sense ambiguity failure caused by colloquial language is particularly challenging. We identified many of these failures by narrowing our focus to consider only the “gene or genome” semantic type because colloquial language failures were frequently mapped to this semantic type. Our tool automatically detected improperly mapped colloquial language by using an existing cancer gene dictionary—a list of genes known to be associated with cancer [
To automatically detect improperly mapped dates and times, we implemented a number of rule-based regular expressions to detect times and dates that were not mapped as “Quantitative Concept” semantic type concepts. “Quantitative Concept” is the most appropriate semantic type based on how patients typically used numbers in our dataset. This resulted in counting the numbers mapped to “Amino Acid, Peptide, or Protein”, “Finding”, “Gene or Genome”, “Intellectual Product”, “Medical Device”, and “Research Activity”.
In our approach, we recognized two types of date or time expression that are problematic for MetaMap. The first type was a time expression containing the term “pm”. The second type was a string of numbers that has been typically used to describe age, date, or time duration. For instance, “3/4” indicating March fourth was mapped to a concept describing distance vision: concept unique identifier (CUI) C0442757. We used specific regular expressions that focused on numbers with “am” or “pm”, as well as a string of numbers with or without non-alphanumeric characters in between numbers to identify dates, times, and other numbers that do not indicate disease status.
Our detection process for email addresses and URLs was completed using regular expressions to identify all the email addresses and URLs, and then we counted the number of terms that were mapped from email addresses or URLs. In our approach, we used specific regular expressions matching “@” and a typical structure of domain name (ie, a dot character followed by 2-6 alphabetic or dot characters) for identifying email addresses and “http” or a typical structure of domain name for identifying URLs.
We detected improperly mapped Internet slang and SMS language using a 3-step process. First, we identified an Internet dictionary with a list of chat acronyms and text shorthand [
To identify improperly mapped names, we first combined a number of name dictionaries that consist of first names [
We identified a number of cases where the pronoun “I” was improperly assumed to be an abbreviation, such as for Iodine, because the NLP tool did not consider the contextual knowledge from the term’s POS. One of the most fundamental components of NLP tools is a POS tagger, which marks up words with their corresponding POS (eg, verb, noun, preposition) in a phrase, sentence, or paragraph. “I” as an abbreviation for Iodine should be recognized as a noun, whereas “I” meaning the individual should be recognized as a pronoun by a POS tagger. Our tool used data derived from the Stanford POS Tagger [
To identify the improperly mapped terms without discriminating between verbs and nouns, we used POS information from the Stanford POS Tagger [
Detecting word sense ambiguity failures leading up to this section consisted of cases where terms were consistently mapped improperly. However, for other word sense ambiguity failures, MetaMap inconsistently mapped terms, both correctly and incorrectly. The inconsistency was the result of poor performance by MetaMap’s word sense disambiguation feature that was designed to select the best matching concepts out of many candidate concepts available in the UMLS. We detected inconsistent mappings by (1) assuming that patients used terms consistently, and (2) MetaMap accurately selecting the best matching concepts the majority of the time. For instance, in our online cancer community dataset, we assumed that patients always used the term “blood test” to convey the “Hematologic Tests” concept (CUI: C0018941), which was how MetaMap interpreted this term two thirds of the time, rather than the less frequent mapping to the “Blood test device” concept (CUI: C0994779). Based on these assumptions, we detected inconsistent mappings in two steps. First, we created a term frequency table based on a term’s spelling and its CUI. Second, assuming the most frequently mapped CUI was the correct concept, we counted the number of cases where the term was mapped to less frequent CUIs.
The automated methods detected that at least 49.12% (188,411/383,572) of MetaMap’s mappings for our dataset were problematic. Word sense ambiguity failures were the most widely occurring, comprising 82.22% among the total detected failures. Boundary failures were the second most frequent, amounting to 15.90% among the total detected failures, while missed term failures were the least common, making up 1.88% of the detected failures.
We found that word sense ambiguity failures were not mutually exclusive, and several cases had multiple causes. Thus, in
We manually evaluated the performance of our failure detection tool in two parts: overall performance evaluation and individual component level performance evaluation.
Detecting MetaMap’s failures on processing patient-generated text.
Failure type | Causes of failure | Count | Percentage of failure, % |
1. Boundary failures | 1.1 Splitting a phrase | 29,965 | 15.90 |
2. Missed term failures | 2.1 Community specific nomenclatures | 1167 | 0.62 |
2.2 Misspellings | 2375 | 1.26 | |
3. Word sense ambiguity failures | 3.1 Abbreviations and contractions | 416 | 0.22 |
3.2 Colloquial language | 4162 | 2.21 | |
3.3 Numbers | 143 | 0.08 | |
3.4 Email addresses and URLs | 1448 | 0.77 | |
3.5 Internet slang and SMS language | 3442 | 1.83 | |
3.6 Names | 10,061 | 5.34 | |
3.7 Narrative style of pronoun ‘I’ | 61,119 | 32.44 | |
3.8 Mismapped verbs | 51,193 | 27.17 | |
3.9 Inconsistent mappings | 29,308 | 15.56 | |
Total number of unique word sense ambiguity failures | 154,904 | 82.22 | |
Total number of unique failures | 188,411 |
|
We randomly selected 50 cases (ie, mappings) that our tool identified as incorrect mappings from each of the 12 causes of failures, totaling 600 cases that served as positive cases. We then randomly selected another 600 cases from the rest of the mappings not detected as incorrect mappings according to our tool to serve as the negative cases. We then mixed up the selected 1200 cases and manually assessed the accuracy of mappings through a blind procedure.
We also measured individual performance on each of the 12 detection techniques. We used the previously selected 600 negative cases and individual technique’s 50 positive cases to assess the performance. For boundary failure, we examined whether the mapped terms could deliver precise conceptual meaning independent of additional phrases. For missed term failure, we investigated whether the tool had accurately corrected the spellings and verified the results of the new mappings. For word sense ambiguity failures, we examined whether MetaMap appropriately mapped terms based on the rest of the context. The unit of analysis was a single mapping, and we evaluated our results using precision, recall, accuracy, and F1 score. Precision measures the proportion of predicted positive instances that are correct. Recall measures the proportion of positive instances that were predicted. Accuracy measures the percentages of correctly predicted instances among the total number of instances examined. F1 score is the weighted harmonic mean—reflecting both performance and balance—of precision and recall. In all measures, higher scores reflect better performance.
Performance (in %) of automatic failure detection and its individual component.
Failure type | Causes of failure | Precision | Recall | Accuracy | F1 score |
1. Boundary failures | 1.1 Splitting a phrase | 82.00 | 78.85 | 96.78 | 80.39 |
2. Missed term failures | 2.1 Community specific nomenclatures | 88.00 | 100.00 | 99.02 | 93.62 |
2.2 Misspellings | 80.00 | 93.02 | 97.88 | 86.02 | |
3. Word sense ambiguity failures | 3.1 Abbreviations and contractions | 82.00 | 95.35 | 98.20 | 88.17 |
3.2 Colloquial language | 100.00 | 100.00 | 100.00 | 100.00 | |
3.3 Numbers | 100.00 | 100.00 | 100.00 | 100.00 | |
3.4 Email addresses and URLs | 100.00 | 100.00 | 100.00 | 100.00 | |
3.5 Internet slang and SMS language | 100.00 | 100.00 | 100.00 | 100.00 | |
3.6 Names | 66.00 | 100.00 | 97.21 | 79.52 | |
3.7 Narrative style of pronoun “I” | 100.00 | 100.00 | 100.00 | 100.00 | |
3.8 Mismapped verbs | 32.00 | 100.00 | 94.43 | 48.48 | |
3.9 Inconsistent mappings | 66.00 | 53.23 | 92.80 | 58.93 | |
Total | 83.00 | 92.57 | 88.17 | 87.52 |
Our automatic failure detection tool identified 15.90% of the total failures as due to splitting a phrase. The performance evaluation of this task achieved precision, recall, accuracy, and F1 score of 82.00%, 78.85%, 96.78%, and 80.39%, respectively. It is important to note that a single concept can produce multiple split phrase failures. For instance, the phrase “stage 4 Melanoma” was mapped to three concepts: “stage”, “4”, and “Melanoma”. Two boundary failures occurred in this phrase. The first failure occurred between “stage” and “4”; the second failure occurred between “4” and “Melanoma”. By focusing on a pair of mapped terms at a time, we correctly identified two failures that occurred in the phrase “stage 4 Melanoma”. We considered only adjacent paired mappings because splitting a single coherent phrase into two or more UMLS concepts was clearly a more significant problem. However, split phrase failures could occur in non-paired mappings as well, and we are underestimating the prevalence of split phrases.
Less than 1% of failures were due to community-specific nomenclature, and the automatic detection system achieved precision, recall, accuracy, and F1 score of 88.00%, 100.00%, 99.02%, and 93.62%, respectively. It should be noted that we underestimated the number of missed terms because the algorithm [
We automatically assessed that misspellings were responsible for 1.26% of failures. However, we observed few cases of incorrect assessment due to failures of Google’s query suggestion service. For instance, some medications were incorrectly recommended. “Donesaub”, a misspelling of “Denosumab” was mapped to “dinosaur”. Furthermore, even with correct recommendation, MetaMap did not always map to the right concept. For instance, “Wsihng” was correctly recommended to be “Wishing”, but MetaMap mapped it to “NCKIPSD gene”. Despite a few cases of incorrect assessment, the misspelling component performed relatively well, achieving precision, recall, accuracy, and F1 score of 80.00%, 93.02%, 97.88%, and 86.02%, respectively.
Improperly mapped abbreviations comprised less than 1% of failures. Although this was seldom, the automatic detection system performed relatively well, achieving precision, recall, accuracy, and F1 score of 82.00%, 95.35%, 98.20%, and 88.17%, respectively.
Incorrectly mapped “gene or genome” semantic types comprised 2.21% of failures, and the automatic detection system achieved precision, recall, accuracy, and F1 score of 100.00%, 100.00%, 100.00%, and 100.00%, respectively. With this process, we also detected terms like “lord” and “wish” that may not be perceived as colloquial language. Nevertheless, they were improperly mapped as “gene or genome” semantic type. It is also important to note that different disease-specific communities should utilize different gene dictionaries.
Our automatic failure detection tool identified less than 1% of failures as improperly mapped numbers. The performance evaluation of this task achieved precision, recall, accuracy, and F1 score of 100.00%, 100.00%, 100.00%, and 100.00%, respectively. However, we are underestimating this failure prevalence because MetaMap improperly mapped more than half of the “Quantitative Concept” semantic type concepts in our dataset. We did not include this semantic type and underestimated this particular failure because few cases were correctly mapped.
Improperly mapped email addresses or URLs comprised less than 1% of failures, and the automatic detection system achieved precision, recall, accuracy, and F1 score of 100.00%, 100.00%, 100.00%, and 100.00%, respectively. It is important to note that the basis for our manual assessments was how patients had intended to use the term. For instance, MetaMap mapped “org” at the end of a URL to “Professional Organization or Group” concept. Although assessment of such cases can be subjective, we followed the basic rule of reflecting patients’ intentions.
A total of 1.83% of failures resulted from Internet slang and SMS language terms. Like other dictionary-based matching techniques, our automatic detection system performed relatively well, accomplishing precision, recall, accuracy, and F1 score of 100.00%, 100.00%, 100.00%, and 100.00%, respectively.
We automatically assessed that names accounted for 5.34% of failures. However, the name dictionary matching did not perform as well as other dictionary-based matching components. We discovered that unique but popular names, such as “Sunday”, “Faith”, and “Hope” were incorrectly mapped when used as nouns in a sentence. The name dictionary component achieved precision, recall, accuracy, and F1 score of 66.00%, 100.00%, 97.21%, and 79.52%, respectively.
We found that 32.44% of failures resulted from pronoun “I”. Although the use of the pronoun “I” could be considered a part of colloquial language
We automatically assessed that mismapped verbs accounted for 27.17% of failures; however, the detecting mismapped verbs component performed poorly, achieving precision, recall, accuracy, and F1 score of 32.00%, 100.00%, 94.43%, and 48.48%, respectively. We discovered that although Stanford POS Tagger has identified verbs correctly, we made the false assumption that verbs did not belong to the entity part of the UMLS ontology. However, verbs like “lost” and “wait” belong to the “Functional Concept” semantic type, which is under the entity part of the UMLS tree. Thus, the detecting mismapped verbs component of our automatic failure detection tool incorrectly identified such verbs as failures.
Our automatic failure detection tool identified 15.56% of the total failures due to inconsistent mappings. The performance evaluation of this task achieved precision, recall, accuracy, and F1 score of 66.00%, 53.23%, 92.80%, and 58.93%, respectively. We found two reasons for the relatively low precision. First, we did not account for cases where the most commonly mapped concept is not the correct mapping. For instance, in our dataset “radiation” was mapped to “radiotherapy research” (CUI: C1524021) two-thirds of the time when community members actually meant “therapeutic radiology procedure” (CUI: C1522449). We incorrectly assessed if less frequent mappings were accurate. Second, we missed cases when correct mappings do not exist. For instance, the verb “go” was incorrectly but consistently mapped as “GORAB gene”. In our automated failure detection analysis, our tool overlooked terms like “go” that were consistently mismapped.
We characterized (1) boundary failures, (2) missed term failures, and (3) word ambiguity failures and discovered 12 causes for these failures in our manual review. We then used automated methods and detected that almost half of 383,572 MetaMap’s mappings were failures. 82.22% of failures were word sense ambiguity. 15.90% of failures were boundary failure. 1.88% of failures were missed term failures. The automated failure detection achieved precision, recall, accuracy, and F1 score of 83.00%, 92.57%, 88.17%, and 87.52%, respectively.
We first discuss challenges of using out-of-the-box biomedical NLP tools, such as MetaMap, to process patient-generated text. We then discuss the contributions and wider implications of our study for research activities that need to manage the constantly changing and overwhelming amount of patient-generated data. We end with summarizing our contributions to the medical Internet research community.
Some of these failures are already known problematic failures of MetaMap [
We focused our research on MetaMap; however, findings from our study can apply to other NLP tools in a similar manner. Few failure causes, such as inconsistency of word sense disambiguation feature, pertain more to MetaMap than other tools. However, any NLP tools that provide semantic connections require a similar word sense ambiguity feature. Moreover, different NLP tools could excel in different areas, and our automated failure detection can cost-effectively highlight problematic areas. Similarly, our techniques for detecting failures could strengthen the performance of other NLP tools to process patient-generated text and more traditional types of text. For instance, the word sense ambiguity failure caused by neglecting POS information can also be problematic in different types of text, including biomedical literature. That failure might surface less frequently due to differences in sentence structure between the biomedical literature and patient-generated text. Nevertheless, it is a significant problem that applies to both types of text. Applying such POS information when mapping a term could increase the accuracy of the mappings from a variety of texts. Another example is the missed term failure caused by community nomenclature. MetaMap or other NLP tools will miss terms if particular synonyms are missing from the vocabulary source. Researchers could use the algorithm by Schwartz and Hearst [
The dictionary-based matching and NLP techniques used in our detection process were evaluated in previous studies [
Moreover, a number of updates were made for both the UMLS and MetaMap [
Although our study focused on online health community text, the insights inform efforts to apply NLP tools to process various types of patient-generated text, including blogs or online journals, which share similar narrative writing styles and colloquial language. Moreover, Facebook and email provide conversational interactions similar to the interaction in online health communities. Tweets about emergency responses [
Example failures that resulted from the application of MetaMap to process patient-generated text in an online health community (blue terms represent patient-generated text; black terms represent MetaMap’s interpretation; and red terms represent failure type).
Processing patient-generated text provides unique opportunities. However, this process is fraught with challenges. We identified three types of failures that biomedical NLP tools could produce when processing patient-generated text from an online health community. We further identified causes for each failure type, which became the basis for applying automated failure detection methods using pre-validated NLP and dictionary-based techniques. Using these techniques, we showed the feasibility of identifying common failures in processing patient-generated health text, at a low cost. The value of our approach lies in helping researchers and developers quickly assess the capability of NLP tools for processing patient-generated text.
Consumer Health Vocabulary
Clinical Text Analysis and Knowledge Extraction System
concept unique identifiers
esophageal cancer
electronic medical records
laugh out loud or lots of love
Medical Language Extraction and Encoding System
National Library of Medicine
natural language processing
part of speech
radiation absorbed dose
Simian Acquired Immunodeficiency Syndrome
semantic knowledge representation
short message service
Unified Medical Language System
Uniform Resource Locator
hugs and kisses
This work was funded by NSF SHB 1117187 and NIH-NLM #5T15LM007442-10 BHI Training Program.
None declared.