This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Efficiently finding clinical examination studies—studies that quantify the value of symptoms and signs in the diagnosis of disease—is becoming increasingly difficult. Filters developed to retrieve studies of diagnosis from Medline lack specificity because they also retrieve large numbers of studies on the diagnostic value of imaging and laboratory tests.
The objective was to develop filters for retrieving clinical examination studies from Medline.
We developed filters in a training dataset and validated them in a testing database. We created the training database by hand searching 161 journals (n = 52,636 studies). We evaluated the recall and precision of 65 candidate single-term filters in identifying studies that reported the sensitivity and specificity of symptoms or signs in the training database. To identify best combinations of these search terms, we used recursive partitioning. The best-performing filters in the training database as well as 13 previously developed filters were evaluated in a testing database (n = 431,120 studies). We also examined the impact of examining reference lists of included articles on recall.
In the training database, the single-term filters with the highest recall (95%) and the highest precision (8.4%) were diagnosis[subheading] and “medical history taking”[MeSH], respectively. The multiple-term filter developed using recursive partitioning (the RP filter) had a recall of 100% and a precision of 89% in the training database. In the testing database, the Haynes-2004-Sensitive filter (recall 98%, precision 0.13%) and the RP filter (recall 89%, precision 0.52%) showed the best performance. The recall of these two filters increased to 99% and 94% respectively with review of the reference lists of the included articles.
Recursive partitioning appears to be a useful method of developing search filters. The empirical search filters proposed here can assist in the retrieval of clinical examination studies from Medline; however, because of the low precision of the search strategies, retrieving relevant studies remains challenging. Improving precision may require systematic changes in the tagging of articles by the National Library of Medicine.
In arriving at a diagnosis, clinicians often rely on clinical examination findings (ie, information from the patient’s history and/or physical examination) [
In many areas of medicine, filters have been developed to facilitate the search for relevant articles. Filters are pretested search strategies that help identify studies of a certain type from among all the other studies in Medline. Search filters that are optimized for the retrieval of studies of diagnosis, therapy, and clinical prediction rules are available [
The goal of this study was to develop and evaluate Medline filters that could facilitate retrieval of clinical examination studies.
The training and testing of the filters entailed 8 steps: (1) development of a training database, (2) identification of candidate single-term filters, (3) identification of single-term filters with the best performance in the training database, (4) identification of the multiple-term filter with the best performance in the training database using recursive partitioning, (5) development of a testing database, (6) evaluation of the performance of filters developed in this study in the testing database, (7) evaluation of the performance of previously developed filters in the testing database, and (8) examination of the impact of reviewing reference lists of included articles on recall. We performed our research using PubMed, the United States National Library of Medicine’s public search engine for accessing Medline.
We used the Clinical Hedges database, the methods of which have been previously described [
One investigator (author NS) initially reviewed the title and abstract (if an abstract was available) and full text, if necessary, of the 1347 studies and classified each article as a
We then recreated the Clinical Hedges dataset by entering the 161 journals in Medline and by restricting the publication year to 2000 (
Flow sheet describing development of the training database.
We generated a list of 65 candidate search terms in PubMed syntax with the help of two clinicians, three reference librarians, and a thorough review of the literature. The expert searchers independently reviewed our lists of candidate terms and suggested additional terms. We used terms pertaining to clinical examination and diagnosis as well as negated terms (eg, NOT MRI). (See
We evaluated each individual filter against the training database to determine its recall (proportion of the clinical examination articles that the filter detected), precision (proportion of articles retrieved that were relevant), F-measure (an overall measure combining recall and precision), “fallout” (the proportion of nonrelevant articles that were retrieved), and the number needed to read (the average number of articles the searcher will need to look at to find each relevant article) [
Because testing all combinations of single-term filters would have been prohibitive, we used recursive partitioning to develop the best multiple term filter (hereinafter referred to as the recursive partitioning filter) [
To develop the testing database, we used the largest collection of systematic reviews on clinical examination in the literature: The Rational Clinical Examination series in the Journal of the American Medical Association (JAMA) [
Articles included in these 15 reviews were regarded as
Flow sheet describing development of the testing database.
For the 3 filters with the highest recall in the training database, we calculated the recall, precision, F-measure, and the number needed to read in the testing database. The calculations were based on the cells and formulas in
A 2x2 table created for each systematic review and formulas useda
Articles Included in the Systematic Review | Articles Not Included in the Systematic Review | |
Detected by filter | A | B |
Missed by filter | C | D |
a Recall = A/(A+C); Precision = A/(A+B); F-measure = 2*precision*recall/(precision + recall); Number needed to read = 1/precision; Fallout = B/(B+D) [
The performance of 12 previously developed filters validated for retrieving articles on diagnosis [
Authors of systematic reviews often examine reference lists hoping to increase recall. We examined how this strategy would complement the use of filters in the area of clinical examination. Specifically, we examined whether checking the reference lists of included articles would allow use of a filter with a lower recall. Thus, we identified articles that were missed by the 2 filters with the highest recall and checked to see if these articles were included in the reference lists of the articles not missed by these filters.
Filters with the best performance in the training database are shown in
Filters with the best recall (keeping fallout less than 50%), precision (keeping recall greater than 50%) and F-measure in the training database
Filter | Performance |
Recall (%) | Precision (%) | F-measure | NNRa | |
|
||||||
Diagnosis[subheading] | Best recall | 95 | 0.35 | 0.71 | 279 | |
Medical history taking[MeSH] | Best precision and F-measure | 12 | 8.44 | 9.79 | 11.86 | |
|
||||||
Diagnosis[tw] OR "sensitivity and specificity"[MeSH] | Best recall (hereinafter Dx-high recall) | 100 | 0.52 | 1.04 | 191 | |
Predictive value of tests[mesh] OR specificity[TIAB] | Best precision and F-measure (hereinafter Dx-precise) | 67 | 1.95 | 3.78 | 51 | |
|
||||||
Clinical*[tw] OR symptom*[tw] OR exam*[tw] OR criteria[tw] OR tests[tw] OR test[tw] | Best recall (hereinafter CE-high recall) | 100 | 0.27 | 0.53 | 377 | |
Tests[tw] OR physical[tw] | Best precision and F-measure (hereinafter CE-precise) | 62 | 0.72 | 1.43 | 138 | |
|
||||||
(Diagnosis[tw] AND (specific*[tw] OR clinical*[tw] OR exam*[tw])) OR "sensitivity and specificity"[MeSH] | Best overall filter from recursive partition (hereinafter RP-filter)b | 100 | 0.89 | 1.76 | 113 |
a Number needed to read
bFilter developed using recursive partitioning (see “Methods” section)
The recursive partitioning tree is shown in
Best multiple-term filter for retrieval of articles on clinical examination (CE) developed using recursive partitioning.
The recall, precision, F-measure, and the number needed to read for the filters developed in this study as well as the 13 previously developed filters and combination of filters are presented in
Performance of the search filters in the testing database sorted according to recall
Filters or Filter Combinations | Recall (%) | Precision (%) | F-measure | NNRa | |
|
|||||
Haynes-2004-Sensitive [ |
98 | 0.13 | 0.26 | 778 | |
Vincent-2003 [ |
98 | 0.09 | 0.17 | 1154 | |
Bachmann-2002 [ |
96 | 0.11 | 0.22 | 906 | |
Haynes-1994-Sensitive [ |
95 | 0.16 | 0.31 | 641 | |
Dx-high recallb | 95 | 0.12 | 0.25 | 804 | |
Van der Weijden-1997 [ |
95 | 0.07 | 0.13 | 1490 | |
CE-high recallb | 91 | 0.08 | 0.15 | 1330 | |
Haynes-1994-Accurate [ |
91 | 0.07 | 0.14 | 1431 | |
RP-filterb | 89 | 0.26 | 0.52 | 380 | |
Rational Clinical exam [ |
73 | 0.30 | 0.61 | 328 | |
Deville-2002 [ |
71 | 0.40 | 0.80 | 249 | |
Haynes-2004-Accurate [ |
69 | 0.45 | 0.89 | 224 | |
Deville-2000-Accurate [ |
64 | 0.64 | 1.26 | 157 | |
Deville-2000-Sensitive [ |
64 | 0.60 | 1.19 | 167 | |
Haynes-1994-Specific [ |
51 | 0.72 | 1.42 | 139 | |
Haynes-2004-Specific [ |
36 | 1.01 | 1.97 | 99 | |
|
|||||
Haynes-2004-Sensitive [ |
100 | 0.06 | 0.12 | 1613 | |
CE-high recall OR RP | 99 | 0.06 | 0.13 | 1572 | |
Haynes-2004-Sensitive [ |
98 | 0.11 | 0.22 | 890 | |
Haynes-2004-Sensitive [ |
95 | 0.13 | 0.25 | 790 | |
Haynes-2004-Sensitive [ |
88 | 0.19 | 0.39 | 515 |
aNNR = number needed to read
bThe three filters with highest recall in the training database
Overall, 4 of 188 relevant articles were missed by the Haynes-2004-Sensitive search strategy, and, of these, 2 were retrieved by reviewing the reference lists of the articles not missed by this strategy (increasing recall from 98% to 99%). Of the 19 articles missed by the recursive partitioning strategy, 8 were retrieved by reviewing the reference lists of the articles not missed by this strategy (increasing recall from 89% to 94%).
We quantified the recall and precision of filters that may be used to find articles on clinical examination in MEDLINE. While the use of recursive partitioning may increase the precision of searching, all of the strategies we tested had a very low precision of less than 2%.
For health care providers looking for information regarding the diagnostic accuracy of clinical examination findings, the RP filter appears to be the most reasonable choice. For example, let us assume that a clinician is reviewing the ability of the third heart sound to detect heart failure. To determine the posttest probability of congestive heart failure among patients with a third heart sound, the search using the RP filter in PubMed would be (gallop OR S3 OR third heart sound) AND heart failure[MeSH] AND ((Diagnosis[tw] AND (specific*[tw] OR clinical*[tw] OR exam*[tw])) OR "sensitivity and specificity"[MeSH]). As of March 2011 this search yielded 68 articles, several of which directly related to the clinician’s question. Although not studied, the physician could restrict the search to systematic reviews by adding the term “systematic[sb]”. This strategy yielded 1 relevant systematic review. While the NNRs for the filters examined reported in this study are very high (
For the researcher who wants to undertake a systematic review, the Haynes-2004-Sensitive filter [
All of the filters we tested had a very low precision in identifying clinical examination studies. Our findings are consistent with findings published by Haynes and colleagues [
Comparison of the performance of filters for clinical examination, diagnosis, and treatment
Filters | Recall (%) | Precision (%) | F-measure | NNRa | |
|
|||||
Haynes-2004-Sensitive [ |
98 | 0.13 | 0.26 | 778 | |
Recursive partitioning | 89 | 0.26 | 0.52 | 380 | |
|
|||||
Haynes-2004-Sensitive [ |
99 | 1.1 | 2.17 | 91 | |
|
|||||
Haynes 2005 [ |
99 | 9.9 | 18.0 | 10 | |
Haynes 1994 [ |
99 | 22 | 36.0 | 4.5 |
aNNR = Number needed to read
bValues are for the most-sensitive multi-term filter
There are several limitations to our study. The Hedges database [
A surprising result is that only 25% and 20% of the clinical examination studies in the training database were coded with the MeSH terms “physical examination” and “signs and symptoms”, respectively. This current inconsistency in the assignment of these MeSH terms limits the ability of search filters on this topic.
We present a new method for the development of multi-term filters. The use of recursive partitioning in the development of filters is novel and seems particularly well suited when there are many candidate terms. When the number of candidate terms is small, one could test all the possible combinations of terms against the dataset. This becomes prohibitive when the number of candidate terms is large. In contrast, using recursive partitioning, a search filter is constructed in a stepwise fashion. This method also allows for the development of filters that use both AND and OR terms and allows for the development of filters with the best combination of recall and precision.
Recursive partitioning offers an alternative method of developing filters: it not only allows for the development of filters with the best combination of recall and precision, but also for the development of filters that use both AND and OR Boolean connectors. Despite the advantages of recursive partitioning, the filters we developed for the retrieval of clinical examination studies had relatively low precision. We believe the National Library of Medicine should create a publication type for articles that quantify the sensitivity and specificity of the clinical examination. This new tag could improve retrieval of studies of clinical diagnosis.
We acknowledge Chen-Pin Wang, PhD, Department of Epidemiology and Biostatistics at the University of Texas Health Science Center at San Antonio, who helped us develop statistical strategies to address overfitting in the recursive partitioning model.
None declared
List of single-term filters.
List of filters evaluated in the testing corpus.