The development of consumer health information applications such as health education websites has motivated the research on consumer health vocabulary (CHV). Term identification is a critical task in vocabulary development. Because of the heterogeneity and ambiguity of consumer expressions, term identification for CHV is more challenging than for professional health vocabularies.
For the development of a CHV, we explored several term identification methods, including collaborative human review and automated term recognition methods.
A set of criteria was established to ensure consistency in the collaborative review, which analyzed 1893 strings. Using the results from the human review, we tested two automated methods—C-value formula and a logistic regression model.
The study identified 753 consumer terms and found the logistic regression model to be highly effective for CHV term identification (area under the receiver operating characteristic curve = 95.5%).
The collaborative human review and logistic regression methods were effective for identifying terms for CHV development.
Two important steps in vocabulary development are (1) the identification of candidate strings (ie, words or phrases) in a domain and (2) the determination of which of these should be included in a vocabulary as “valid” terms, also called “termhood determination.” Health vocabulary development, which has a long history, requires significant effort for collecting candidate terms and determining termhood [
Research and development of controlled consumer health vocabularies (CHVs) is a relatively new endeavor in the health vocabulary field [
The general goal of our CHV research is to help overcome the vocabulary gap between consumers and health information provided by informatics applications. The specific aim of this paper is to elucidate term identification methods for CHVs. CHV research has largely been driven by the proliferation of health-related materials on the Web, the emergence of electronic personal health records, as well as the growing availability of various consumer health applications (eg, decision support tools). Over the past five years, researchers have found that consumer terms are not well covered by the existing health vocabularies, which mostly represent the language of health professionals [
Developing and validating a comprehensive CHV is challenging because “consumers” constitute a plethora of highly diverse groups. Further, individuals uniquely acquire health-related terms and concepts from formal and informal sources (eg, media exposure) and from personal experiences. Nevertheless, there is strong evidence of the stability of lay health language among particular populations, for specific tasks [
We have been working on an open access and collaborative (OAC) CHV project. The first step in creating the OAC CHV was to identify consumer terms since surface forms, represented as strings in written text, are more tractable than concepts (ie, underlying meanings) or semantic relations, both of which require in-depth understanding of term usage, rhetorical intent, and explanatory models. Because consumer terms are heterogeneous and even less well defined than professional terms [
1. CHVs consist of actual terms commonly used by consumers (in any particular discourse group).
2. CHV terms must allow for computer processing of consumer language.
Since many professional health vocabulary terms are already used by consumers, though in some cases with different or broader semantics (eg, “diabetes” for diabetes mellitus, types 1 and 2), we focused on consumer terms not yet represented in existing vocabularies (eg, “broken finger” for any type of fracture in the “distal,” “middle,” or “proximal phalanges”).
Because the number of candidate strings is often very large in any domain, researchers have explored the use of corpus-based automated term recognition (ATR) methods for extracting the most promising strings for human review from domain-specific documents [
In the biomedical domain, ATR methods have been applied to Medline literature [
In this study, we first identified CHV terms through collaborative review of strings derived from query logs of a consumer health site [
Our use of ATRs in this study differs from that in prior studies in the biomedical domain in two aspects: (1) short phrases from query logs were used as the text corpus rather than entire sentences from full-text sources, and (2) “new” CHV terms, not yet part of existing vocabularies, were identified rather than “pre-existing” terms such as UMLS terms.
The term identification study had three components:
Candidate string extraction from a query log data set of terms that could not be mapped to UMLS
Collaborative manual review of a subset of the candidate strings and identification of CHV terms
Application of ATR methods (the C-value formula and logistic regression models) to human-reviewed CHV terms
We obtained a set of query log files [
The preprocessed queries were then mapped to the 2004AA version of the UMLS Metathesaurus using lexical methods (ie, removing non-alphanumeric symbols, stemming, normalization, and truncation). Queries that did not map to the UMLS Metathesaurus were broken into n-grams. N-grams that matched terms in the Metathesaurus were removed, and the remaining n-grams were collected into sets by frequency and number of words.
We used n-gram analysis to find candidate terms from unmapped query strings. The n-gram analysis uses the frequencies of n-grams and text fragments of n words in a text sample to estimate the likelihood that a string is a potential term. In general, the more frequently an n-gram appears in text documents, the increased likelihood that the n-gram is a “useful” term.
Six researchers (first six of the authors) reviewed candidate strings (n-grams) collaboratively. First, each reviewer independently reviewed a subset of the n-grams (n = 1 to 4 and frequency > 50) and voted on whether they should be considered CHV terms. Unanimous votes for n-grams that were reviewed by at least three people were entered as “master” votes. Otherwise, termhood was discussed by the entire group until consensus was reached and a master vote was cast. To support reviewers from geographically distributed locations and to calculate votes, a specially designed Web-based application [
Application to support collaborative manual review of candidate strings
Through several iterations of votes and discussion, we established the following review criteria:
CHV terms should be syntactic constituents or phrases such as a noun phrase or adjective phrase (eg, “bypass surgery” is a phrase, but “fever in” is not). Special attention should be given to noun phrases.
CHV terms should have independent semantics and should not only occur as a part of longer valid terms or as a part of wild card searches (eg, [chicken-, small-] “pox vaccine” is not considered a CHV term).
CHV terms should be specific to the medical domain (eg, “Google” and “Yahoo” are general words, not CHV terms).
CHV terms should function as semantic components in addition to functioning as syntactic components (eg, stop words “the” and “a” as well as empty verbs “make” and “take” are not considered CHV terms).
N-grams representing existing UMLS medical concepts are considered to be CHV terms, but CHV terms may represent non-UMLS concepts.
Eponymous forms of CHV terms are considered to be CHV terms (eg, “Parkinson’s”).
CHV terms may include spelling errors, (eg, “Chron's disease”). These misspelled terms are given the label “disparaged.”
Terms with distinct clinical semantics (eg, “result”) are considered to be CHV terms, regardless of ambiguity and/or vagueness in other domains.
We singled out several types of terms for future investigation and assigned special labels to them:
meta: A term that is usually used to indicate the category/type of information sought or presented (eg, “picture,” “guideline,” and “tutorial”).
modifier: A term not typically used by itself, but for limiting or qualifying other terms (eg, “sexually” as in “sexually active”).
relation: A term not typically used by itself, but used to describe relations among concepts (eg, “caused by” and “results in”). We also include the unary relation “not” in this set.
Currently, we consider terms classified as meta and modifier to be CHV terms, but relations are not considered CHV terms.
Once these review criteria were established, researchers double-checked the previously cast master votes for compliance. A second round of discussion resulted in some adjustments to the votes.
We explored the use of two ATR methods to facilitate candidate selection for human review: (1) the C-value method (C loosely stands for “candidate collection”) and (2) logistic regression.
We applied the C-value method to the strings that had already been reviewed. First, the strings were parsed to filter out single-word strings and strings that were not noun phrases. The C-value was calculated using the formula [
(When
To create the logistic regression model that predicts the termhood of a candidate string
part-of-speech (POS) tag (eg, noun or adjective) of the first word
POS tag of the last word
noun phrase status (ie, yes/no)
word count (ie, number of words in
number of distinct
number of repeated
percentage of distinct
percentage of repeated
number of distinct
number of repeated
percentage of distinct
percentage of repeated
frequency of
number of distinct host
average number of distinct queries containing
The frequency distribution of the POS tags (variables 1 and 2) required them to be collapsed into fewer categories for modeling. The original tags came from a Brill-style, rule-based POS tagger developed by Mark Hepple [
The continuous variables (variables 4 to 15) were dichotomized based on the median value. The dichotomized variables were used in the logistic regression to predict or explain the probability of having a term voted “yes” for termhood.
The logistic regression model building was carried out by a stepwise procedure. After calculating the odds ratio estimates, most of the variables were dropped. The remaining variables 1, 2, 3, 6, 10, and 15 were represented in the regression formula as FirstPOS, LastPOS, np_value, repeat_sup_gt_median, repeat_sub_gt_median, and distinct_perhost_gt_median.
For both the C-value formula and the regression model, we calculated the sensitivity and specificity at different thresholds to create the receiver operating characteristic (ROC) curves. To estimate the area under the ROC curve for the logistic regression, we used the c-statistic [
We identified 18454 candidate n-grams (n = 1 to 5); 7967 were reviewed by at least one reviewer, and 1893 distinct n-grams received master votes (
Number of n-grams with master votes and number of n-grams voted as CHV terms
|
|
|
1-gram | 379 | 261 |
2-gram | 1101 | 303 |
3-gram | 356 | 154 |
4-gram | 57 | 35 |
|
|
|
The logistic regression model
The logistic regression model is shown in
The ROC curves for C-value and the regression model are shown in
Curves for C-value and the regression model
This paper reports on several term identification methods for the OAC CHV project. We established a set of criteria and procedures to conduct a manual review, resulting in multiple reviewers reaching consensus on 1893 n-grams, including identification of 753 new terms for inclusion in the OAC CHV that were not in the 2004AC version of UMLS.
The OAC termhood criteria were established collaboratively, reflecting the reviewers’ backgrounds in several different fields: controlled vocabulary, health informatics, linguistics, cognitive science, and computer science. While the OAC termhood criteria could be further refined and termhood criteria for health vocabularies are often not published, we believe publishing such criteria could benefit vocabulary research. For instance, many articles evaluate vocabularies and study methods of mapping one vocabulary to another [
In CHV research, the termhood issue is of particular importance because there has been limited discussion and little consensus on what should be considered a consumer term. Is “sun poisoning” an acceptable term? How about “skin conditions?” As was pointed out in the Introduction, health professional vocabularies do not always agree on the termhood of a phrase. Consumer expressions, however, require more scrutiny because it is harder to determine their semantics and contexts of usage.
We tested two ATR methods (C-value and logistic regression) on the human-reviewed n-grams. The C-value was useful for determining termhood, though it did not have high distinguishing power (AUC = 70.9%). The AUC for the logistic regression model was 95.5%, which is fairly satisfactory.
These results suggest that a specially fitted logistic regression model is better suited than the generic C-value method for the task of identifying CHV terms according to our criteria. The C-value method’s performance problem was partially caused by issues unique to this data set, among them the inclusion of infrequent misspellings and the high frequency of most candidates, which made frequency a less reliable predicator. The imperfection in noun-phrase parsing is not unexpected, though the relatively short query string posed a greater challenge for parsing. Like many vocabularies, OAC includes strings that are single words and are not noun phrases, while C-value is typically calculated for multiword noun phrases.
The logistic regression model demonstrated excellent suitability for OAC termhood determination. It may have to be altered to be used with other corpora or for other types of vocabularies due to the particularities of query-based corpus attributes such as the short length of the documents. Nonetheless, training of predictive models for a particular corpus and vocabulary is a generalizable strategy. Although general principles exist, the determination of which strings are to be considered legitimate vocabulary terms often depends on the domain and the vocabulary developers’ criteria (eg, including verb phrases [
The regression model utilizes syntactic and nesting pattern features; both types of features are well-recognized termhood indicators. A concern often raised about CHV research is that the syntax and semantic of consumer phrases are too unruly to be represented in a computable vocabulary. The fact that many consumer phrases have common term characteristics suggests that they are tractable terms.
Our study has several limitations. Because consumer utterances are not readily available as corpora of medical literature or clinical records, we used query logs that contained relatively few complete sentences. Subsequently, this resulted in many POS and noun phrase analysis errors. As well, we only had researchers and not lay consumers review the candidate terms, due to budget and logistic constraints. However, the analysis was based on utterances from queries submitted by tens of thousands of consumers.
Based on the result of this study, we plan to apply the logistic regression model to the candidate n-grams and select those predicted to be terms for human review. We also plan to add the identified CHV terms to OAC. The authors associated with NLM are interested in investigating similar techniques to aid in identifying candidate terms for inclusion into the SPECIALIST Lexicon of the NLM, and for quality control.
We thank the National Library of Medicine (NLM) for sharing the MedlinePlus query log data. This work is supported by the National Institutes of Health (NIH) grant R01 LM07222 and by the Intramural Research Program of the NIH, NLM.
None declared.
automated text recognition
area under the curve
consumer health vocabulary
National Institutes of Health
National Library of Medicine
open access and collaborative
part of speech
receiver operating characteristic
Unified Medical Language System