This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Understandability plays a key role in ensuring that people accessing health information are capable of gaining insights that can assist them with their health concerns and choices. The access to unclear or misleading information has been shown to negatively impact the health decisions of the general public.
The aim of this study was to investigate methods to estimate the understandability of health Web pages and use these to improve the retrieval of information for people seeking health advice on the Web.
Our investigation considered methods to automatically estimate the understandability of health information in Web pages, and it provided a thorough evaluation of these methods using human assessments as well as an analysis of preprocessing factors affecting understandability estimations and associated pitfalls. Furthermore, lessons learned for estimating Web page understandability were applied to the construction of retrieval methods, with specific attention to retrieving information understandable by the general public.
We found that machine learning techniques were more suitable to estimate health Web page understandability than traditional readability formulae, which are often used as guidelines and benchmark by health information providers on the Web (larger difference found for Pearson correlation of .602 using gradient boosting regressor compared with .438 using Simple Measure of Gobbledygook Index with the Conference and Labs of the Evaluation Forum eHealth 2015 collection).
The findings reported in this paper are important for specialized search services tailored to support the general public in seeking health advice on the Web, as they document and empirically validate state-of-the-art techniques and settings for this domain application.
Search engines are concerned with retrieving relevant information to support a user’s information-seeking task. Commonly, signals about the topicality or aboutness of a piece of information with respect to a query are used to estimate relevance, with other relevance dimensions such as understandability and trustworthiness [
A key problem when searching the Web for health information is that this can be too technical, unreliable, generally misleading, and can lead to unfounded escalations and poor decisions [
The use of general purpose Web search engines such as Google, Bing, and Baidu for seeking health advice has been largely analyzed, questioned, and criticized [
Ad hoc solutions to support the general public in searching and accessing health information on the Web have been implemented, typically supported by government initiatives or medical practitioner associations, for example, HealthOnNet.org (HON [
As an illustrative example, we analyzed the top 10 search results retrieved by HON on October 01, 2017 in answer to 300 health search queries generated by regular health consumers in health forums. These queries are part of the Conference and Labs of the Evaluation Forum (CLEF) 2016 electronic health (eHealth) collection [
In this paper, we aim to establish methods and best practice for developing search engines that retrieve
We propose and investigate methods for the estimation of the understandability of health information in Web pages: a large number of medically focused features are grouped in categories and their contribution to the understandability estimation task is carefully measured.
We further study the influence of HTML processing methods on these estimations and their pitfalls, extending our previous work that has shown how this often-ignored aspect greatly impacts effectiveness [
We further investigate how understandability estimations can be integrated into retrieval methods to enhance the quality of the retrieved health information, with particular attention to its understandability by the general public. New models are explored in this paper, also extending our previous work [
This paper makes concrete contributions to practice, as it informs health search engines specifically tailored to the general public (eg, the HON or HealthDirect services referred to above) about the best methods they should adopt. These are novel and significant contributions as no previous work has systematically analyzed the influence of the components in this study—we show that these greatly influence retrieval effectiveness and, thus, delivery of relevant and understandable health advice.
Understandability refers to the ease of comprehension of the information presented to a user. In other words, health information is understandable “when consumers of diverse backgrounds and varying levels of health literacy can process and explain key messages” [
Cumulative distribution of Dale-Chall Index (DCI) of search results. DCI measures the years of schooling required to understand a document. The dashed line is the 8th grade level which is the reading level of an average US resident. The distribution for HealthOnNet (HON) is similar to that of the baseline used in this paper (Best Match 25 [BM25]). Our best method (eXtreme Gradient Boosting [XGB]) reranks documents to provide more understandable results; its distribution is similar to that of an oracle system.
There is a large body of literature that has examined the understandability of Web health content when the information seeker is a member of the general public. For example, Becker reported that the majority of health websites are not well designed for the elderly [
Previous linguistics and information retrieval research has attempted to devise computational methods for the automatic estimation of text readability and understandability, and for the inclusion of these within search methods or their evaluation. Computational approaches to understandability estimations include (1)
Measures such as Coleman-Liau Index (CLI) [
The use of machine learning to estimate understandability forms an alternative approach. Earlier research explored the use of statistical natural language processing and language modeling [
The actual use of CHV or other terminologies such as the Medical Subject Headings (MeSH) belongs to the third category of approaches. The CHV is a prominent medical vocabulary dedicated to mapping layperson vocabulary to technical terms [
In this study, we investigated approaches to estimate understandability from each of these categories, including measure the influence of HTML preprocessing on automatic understandability methods and establish best practices.
Some previous works have attempted to use understandability estimations for improving search results in consumer health search as well as methods to evaluate retrieval systems that do account for understandability along with topical relevance. Palotti et al have used learning to rank with standard retrieval features along with features based on RF and medical lexical aspects to determine understandability [
In this paper, we investigated methods to estimate Web page understandability, including the effect that HTML preprocessing pipelines and heuristics have, and their search effectiveness when employed within retrieval methods. To obtain both topical relevance and understandability assessments, we used the data from the CLEF 2015 and 2016 eHealth collections. The CLEF eHealth initiative is a research community–shared task aimed at creating resources for evaluating health search engines aimed at the general public [
The CLEF 2015 collection contains 50 queries and 1437 documents that have been assessed as relevant by clinical experts and have an assessment for understandability [
To support the investigation of methods to automatically estimate the understandability of Web pages, we further considered correlations between multiple human assessors (interassessor agreement). For CLEF 2015, we used the publicly available additional assessments made by unpaid medical students and health consumers collected by Palotti et al [
Several methods have been used to estimate the understandability of health Web pages, with the most popular methods (at least in the biomedical literature) being RF based on surface level characteristics of the text. Next, we outline the categories of methods to estimate understandability used in this study; an overview is shown in
These include the most popular RF [
These are formed by the
These include methods that count the number of words with a medical prefix or suffix, that is, beginning or ending with Latin or Greek particles (eg, amni-, angi-, algia-, and arteri-), and text strings included in lists of acronyms or in medical vocabularies such as the International Statistical Classification of Diseases and Related Health Problems (ICD), Drugbank and the OpenMedSpel dictionary [
The popular MetaMap [
Similar to the CHV features, we used MetaMap to convert the content of Web pages into MeSH entities, studying symptom and disease concepts separately. A complete list of methods is provided in
These included commonly used natural language heuristics such as the ratio of POS classes, the height of the POS parser tree, the number of entities in the text, the sentiment polarity [
These include the identification of a large number of HTML tags, which were extracted with the Python library BeautifulSoup [
Readability feature
Automated Readability Index [
Coleman-Liau Index (CLI) [
Dale-Chall Index (DCI) [
Flesch-Kincaid Grade Level [
Flesch Reading Ease (FRE) [
Gunning Fog Index (GFI) [
Lasbarhetsindex (LIX) [
Simple Measure of Gobbledygook (SMOG) [
Components of readability feature
# of Characters
# of Words
# of Sentences
# of Difficult Words (Dale-Chall list [
# of Words Longer than 4 Characters
# of Words Longer than 6 Characters
# of Words Longer than 10 Characters
# of Words Longer than 13 Characters
# of Number of Syllables
# of Polysyllable Words (>3 Syllables)
General medical vocabularies (GMVs)
# of words with medical prefix
# of words with medical suffix
# of acronyms
# of International Statistical Classification of Diseases and Related Health Problems (ICD) concepts
# of Drugbank
# of words in medical dictionary (OpenMedSpel)
Consumer medical vocabularies (CMV)
Consumer health vocabulary (CHV) mean score for all concepts
# of CHV concepts
CHV mean score for symptom concepts
# of CHV symptom concepts
CHV mean score for disease concepts
# of CHV disease concepts
Expert medical vocabulary (EMV)
# of Medical Subject Headings (MeSH) concepts
Average tree of MeSH concepts
# of MeSH symptom concepts
Average tree of MeSH symptom concepts
# of MeSH disease concepts
Average tree of MeSH disease concepts
Natural language features (NLF)
Positive words
Negative words
Neutral words
# of verbs
# of nouns
# of pronouns
# of adjectives
# of adverbs
# of adpositions
# of conjunctions
# of determiners
# of cardinal numbers
# of particles or other function words
# of other part of speech (POS; foreign words and typos)
# of punctuation
# of entities
Height of POS parser tree
# of stop words
# of words not found in Aspell Engish dictionary
Average tree of Medical Subject Headings (MeSH) disease concepts
Generally speaking, common and known words are usually frequent words, whereas unknown and obscure words are generally rare. This idea is implemented in RF such as the DCI, which uses a list of common words and counts the number of words that fall outside this list (complex words) [
Medical Reddit: Reddit [
Medical English Wikipedia: after obtaining a recent Wikipedia dump [
PubMed Central: PubMed Central is a Web-based database of biomedical literature. We used the collection distributed for the Text Retrieval Conference (TREC) 2014 and 2015 Clinical Decision Support Track [
HTML features (HF)
# of abbreviation (abbr tags)
# of links (A tags)
# of blockquote tags
# of bold tags
# of cite tags
# of divisions or sections (div tags)
# of forms tags
# of heading H1 tags
# of heading H2 tags
# of heading H3 tags
# of heading H4 tags
# of heading H5 tags
# of heading H6 tags
Total # of headings (any heading H above)
# of image tags
# of input tags
# of link tags
# of description lists (DL tags)
# of unordered lists (UL tags)
# of ordered lists (OL tags)
Total # of any list (DL+UL+OL)
# of short quotations (Q tags)
# of scripts tags
# of spans tags
# of table tags
# of paragraphs (P tags)
A summary of the statistics of the corpora is reported in
These include machine learning methods for estimating Web page understandability. Although Collins-Thompson highlighted the promise of estimating understandability using machine learning methods, a challenge is identifying the background corpus to be used for training [
Medical Reddit (label 1): Documents in this corpus are expected to be written in a colloquial style, and thus the easiest to understand. All the conversations are, in fact, explicitly directed to assist inexpert health consumers
Medical English Wikipedia (label 2): Documents in this corpus are expected to be less formal than scientific papers, but more formal than a Web forum like Reddit, thus somewhat more difficult to understand
PubMed Central (label 3): Documents in this corpus are expected to be written in a highly formal style, as the target audience are physicians and biomedical researchers.
Statistics for the corpora used as background models for understandability estimations.
Statistics | Medical Wikipedia | Medical Reddit | PubMed Central |
Documents, n | 11,868 | 43,019 | 733,191 |
Words, n | 10,655,572 | 11,978,447 | 144,024,976 |
Unique words, n | 467,650 | 317,106 | 2,933,167 |
Average words per document, mean (SD) | 898.90 (1351.76) | 278.45 (359.70) | 227.22 (270.44) |
Average characters per document, mean (SD) | 5107.81(7618.57) | 1258.44 (1659.96) | 1309.11(1447.31) |
Average characters per word, mean (SD) | 5.68 (3.75) | 4.52 (3.52) | 5.76 (3.51) |
Word frequency features (WFF)
25th percentile English Wikipedia
50th percentile English Wikipedia
75th percentile English Wikipedia
Mean rank English Wikipedia
Mean rank English Wikipedia—includes out-of-vocabulary (OV) words
25th percentile Medical Reddit
50th percentile Medical Reddit
75th percentile Medical Reddit
Mean rank Medical Reddit
Mean rank Medical Reddit—includes OV
25th percentile Pubmed
50th percentile Pubmed
75th percentile Pubmed
Mean rank Pubmed
Mean rank Pubmed—includes OV
25th percentile Wikipedia+Reddit+Pubmed
50th percentile Wikipedia+Reddit+Pubmed
75th percentile Wikipedia+Reddit+Pubmed
Mean rank Wikipedia+Reddit+Pubmed
Mean rank Wikipedia+Reddit+Pubmed—includes OV
Machine learning regressors (MLR)
Linear regressor
Multilayer perceptron regressor
Random forest regressor
Support vector machine regressor
eXtreme Gradient Boosting Regressor
Machine learning classifiers (MLC)
Logistic regression
Multilayer perceptron classifier
Random forest classifier
Support vector machine classifier
Multinomial naive Bayes
eXtreme Gradient Boosting Classifier
On the basis of the labels of each class above, models were learnt using all documents from these corpora after features were extracted using latent semantic analysis with ten dimensions on top of TF-IDF calculated for each word. We modeled a classification task as well as a regression task using these corpora. In the classification task, the first step is to train a classifier on documents belonging to these three collections with the three different classes shown above. The second step is to use the classifier to estimate which of these three possible classes an unseen document from the CLEF 2015 or CLEF 2016 would belong. Similarly, in the regression task, after training, a regressor has to estimate an understandability value to an unseen CLEF document. We hypothesize that documents that are more difficult to read are more similar to PubMed documents than to Wikipedia or Reddit ones. A complete list of methods is provided in
As part of our study, we investigated the influence that the preprocessing of Web pages had on the estimation of understandability computed using the methods described above. We did so by comparing the combination of a number of preprocessing pipelines, heuristics, and understandability estimation methods with human assessments of Web page understandability. Our experiments extended our previous work [
To extract the content of a Web page from the HTML source we tested: BeautifulSoup,
We then investigated how understandability estimations can be integrated into retrieval methods to increase the quality of search results. Specifically, we considered 3 retrieval methods of differing quality for the initial retrieval. These included the best 2 runs submitted to each CLEF task, and a plain BM25 baseline (default Terrier parameters: b=0.75 and k1=1.2). BM25 is a probabilistic term weighting scheme commonly used in information retrieval and is defined with respect to the frequency of a term in a document, the collection frequency of that term, and the ratio between the length of the document and the average document length. As understandability estimators, we used the XGB regressor [
To integrate understandability estimators into the retrieval process, we first investigated
Learning to rank settings.
Strategy | Explanation | Labeling function | |
CLEFa 2015 | CLEF 2016 | ||
LTRb 1 | Model built |
Fd(Re,Uf)=R | F(R,U)=R |
LTR 2 | Model built |
F(R,U)=R | F(R,U)=R |
LTR 3 | Model combines understandability and topicality labels. Uses IR and understandability features | F(R,U)=R×U/3 | F(R,U)=R×(100-U)/100 |
LTR 4 | Model built |
F(R,U)=R, if U≥2 |
F(R,U)=R, if U≤40 |
LTR 5 | Model built boosting easy-to-read documents. Uses IR and understandability Features | F(R,U)=2×R, if U≥2 |
F(R,U)=2×R, if U≤40 |
aCLEF: Conference and Labs of the Evaluation Forum.
bLTR: learning to rank.
cIR: information retrieval.
dF: function.
eR: topical relevance of a document.
fU: understandability.
As an alternative to the previous 2-step ranking strategy for combining topical relevance and understandability, we explored the
Finally, we considered a third alternative to combine relevance and understandability: using
In the experiments, we used Pearson, Kendall, and Spearman correlations to compare the understandability assessments of human assessors with estimations obtained by the considered automated approaches, under all combinations of pipelines and heuristics. Pearson correlation is used to calculate the strength of the linear relationship between 2 variables, whereas Kendall and Spearman measure the rank correlations among the variables. We opted to report all 3 correlation coefficients to allow for a thorough comparison with other work, as they are equally used in the literature.
For the retrieval experiments, we used evaluation measures that rely on both (topical) relevance and understandability. The uRBP measure [
A drawback of uRBP is that relevance and understandability are combined into a unique evaluation score, thus making it difficult to interpret whether improvements are because of more understandable or more topical documents being retrieved. To overcome this, we used the multidimensional metric (MM) framework introduced by Palotti et al [
For all measures, we set n=10 because shallow pools were used in CLEF along with measures that focused on the top 10 search results (including RBPr@10). Shallow pools refer to the selection of a limited number of documents to be assessed for relevance, among the documents retrieved at the top ranks by a search engine.
Along with these measures of search effectiveness, we also recorded the number of unassessed documents, the RBP residuals, RBP*r@10, RBP*u@10, and MM*RBP, that is, the corresponding measures calculated by ignoring unassessed documents. These latter measures implement the condensed measures approach proposed by Sakai as a way to deal with unassessed documents [
To keep this paper succinct, in the following we only report a subset of the results. The remaining results (which show similar trends to those reported here) are made available in the
Using the CLEF eHealth 2015 and 2016 collections, we studied the correlations of methods to estimate Web page understandability compared with human assessments. For each category of understandability estimation,
Overall, Spearman and Kendall correlations obtained similar results (in terms of which methods exhibited the highest correlations): this was expected as, unlike Pearson, they are both rank-based correlations.
For traditional RF, SMOG had the highest correlations for CLEF 2015 and DCI for CLEF 2016, regardless of correlation measure. These results resonate with those obtained for the category of raw components of readability formulae (CRF). In fact, the polysyllable words measure, which is the main feature used in SMOG, had the highest correlation for CLEF 2015 among methods in this category. Similarly, the number of difficult words, which is the main feature used in DCI, had the highest correlation for CLEF 2016 among methods in this category.
When examining the expert vocabulary category (EMV), we found that the number of MeSH concepts obtained the highest correlations with human assessments; however, its correlations were substantially lower than those achieved by the best method from the consumer medical vocabularies category (CMV), that is, the scores of CHV concepts. For the natural language category (NLF), we found that the number of pronouns, the number of stop words, and the number of OV words had the highest correlations—and these were even higher than those obtained with MeSH- and CHV-based methods. In turn, the methods that obtained the highest correlations among the HTML category (HF) and counts of P tags and list tags exhibited overall the lowest correlations compared with methods in the other categories. P tags are used to create paragraphs in a Web page, being thus a rough proxy for text length. Among methods in the word frequency category (WFF), the use of Medical Reddit (but also of PubMed) showed the highest correlations, and these were comparable with those obtained by the RF.
Finally, regressors (MLR) and classifiers (MLC) exhibited the highest correlations among all methods: in this category, the XGB regressor and the multinomial Naive Bayes best correlated with human assessments.
Methods with the highest correlation per category for Conference and Labs of the Evaluation Forum (CLEF) 2015.
Category | Method | Preprocessing | Pearson | Spearman | Kendall |
Readability formulae | Simple Measure of Gobbledygook Index | Jst Do Not Force Period (DNFP) | |||
Components of readability formulae (CRF) | Average number of Polysyllables words per sentence | Jst force period (FP) | .364 | .268 | |
CRF | Average number of Polysyllables words per sentence | Jst DNFP | .192 | ||
General medical vocabularies (GMVs) | Average number of medical prefixes per word | Naïve FP | .312 | .229 | |
GMVs | Number of medical prefixes | Naïve FP | .131 | ||
Consumer medical vocabulary (CMV) | Consumer health vocabulary (CHV) mean score for all concepts | Naïve FP | |||
Expert medical vocabulary (EMV) | Number of medical concepts | Naïve FP | |||
Natural language features (NLF) | Number of words not found in Aspell dictionary | Jst DNFP | .276 | .203 | |
NLF | Number of pronouns per word | Naïve FP | .271 | ||
HTML features (HF) | Number of P tags | None | |||
Word frequency features (WFF) | Mean rank Medical Reddit | Jst DNFP | .277 | .197 | |
WFF | 25th percentile Pubmed | Jst DNFP | .330 | ||
Machine learning regressors (MLR) | eXtreme Gradient Boosting (XGB) Regressor | Boi DNFP | .394 | .287 | |
MLR | XGB Regressor | Jst FP | .565 | ||
Machine learning classifiers | Multinomial Naïve Bayes | Naïve FP |
aItalics used to highlight the best result of each group.
Methods with the highest correlation per category for Conference and Labs of the Evaluation Forum (CLEF) 2016.
Category | Method | Preprocessing | Pearson | Spearman | Kendall |
Readability formulae (RF) | Dale-Chall Index (DCI) | Jst force period (FP) | .381 | ||
RF | DCI | Boi FP | .437 | ||
Components of readability formulae (CRF) | Average number of difficult word per Word | Boi FP | |||
General medical vocabularies (GMVs) | Average prefixes per sentence | Jst FP | .242 | .164 | |
GMVs | International Statistical Classification of Diseases and Related Health Problems concepts per sentence | Jst do not force period (DNFP) | .014 | ||
Consumer medical vocabulary (CMV) | Consumer health vocabulary (CHV) mean score for all concepts | Jst FP | .313 | .216 | |
CMV | CHV mean score for all concepts | Boi FP | |||
EMV | Number of MeSH (Medical Subject Headings) concepts | Boi DNFP | .166 | .113 | |
Expert medical vocabulary (EMV) | Number of MeSH disease concepts | Boi DNFP | .179 | ||
Natural language features (NLF) | Average stop word per word | Boi FP | .312 | .213 | |
NLF | Number of pronouns | Boi FP | .341 | ||
HTML features (HF) | Number of lists | None | .021 | .015 | |
HF | Number of P tags | None | .110 | ||
Word frequency features (WFF) | Mean rank Medical Reddit | Boi DNFP | .312 | .214 | |
WFF | 50th percentile Medical Reddit | Jst DNFP | .351 | ||
Machine learning regressors (MLR) | eXtreme Gradient Boosting (XGB) Regressor | Jst DNFP | .258 | ||
MLR | Random Forest Regressor | Boi DNFP | .389 | .355 | |
Machine learning classifiers | Multinomial Naïve Bayes | Jst FP |
aItalics used to highlight the best result of each group.
Results from experiments with different preprocessing pipelines and heuristics are shown in
We first examined the correlations between human assessments and RF. We found that the
Overall, among RF, the best results (highest correlations) were obtained by SMOG and DCI (see also
When considering methods beyond those based on RF, we found that the highest correlations were achieved by the regressors (MLR) and classifiers (MLC), independently of the preprocessing method used. There is little difference in terms of effectiveness of methods in these categories, with the exception of regressors on CLEF 2015 that exhibited not negligible variances: whereas for the neural network regressor the Pearson correlation was .44 and for the support vector regressor it was only .30.
Correlations between understandability estimators and human assessments for Conference and Labs of the Evaluation Forum 2015. For example, the first boxplot on the top represents the distribution of Spearman correlations with human assessments across all features in the category readability formulae, obtained with the Naive Force Period preprocessing. Each box extends from the lower to the upper quartile values, with the red marker representing the median value for that category. Whiskers show the range of the data in each category and circles represent values considered outliers for the category (eg, Spearman correlation for Simple Measure of Gobbledygook (SMOG) index was .296 and for Automated Readability Index (ARI) was .194: these were outliers for that category). CMV: consumer medical vocabulary; CRF: components of readability formulae; DNFP: Do Not Force Period; EMV: expert medical vocabulary; FP: Force Period; GMV: general medical vocabulary; MLC: machine learning classifiers; MLR: machine learning regressors; NLF: natural language features; RF: readability formulae; WFF: word frequency features.
Correlations between understandability estimators and human assessments for Conference and Labs of the Evaluation Forum (CLEF) 2016. CMV: consumer medical vocabulary; CRF: components of readability formulae; DNFP: Do Not Force Period; EMV: expert medical vocabulary; FP: Force Period; GMV: general medical vocabulary; MLC: machine learning classifiers; MLR: machine learning regressors; NLF: natural language features; RF: readability formulae; WFF: word frequency features.
A common trend when comparing preprocessing pipelines is that the Naïve pipeline provided the weakest correlations with human assessments for CLEF 2016, regardless of estimation methods and heuristics. This result, however, was not confirmed for CLEF 2015, where the Naive preprocessing negatively influenced correlations for the RF category, but not for other categories, although it was generally associated with larger variances for the correlation coefficients.
Results for the considered retrieval methods are reported in
The effectiveness of the top 2 submissions to CLEF 2016 and the BM25 baseline are reported in
Baseline results for the best 2 submissions to Conference and Labs of the Evaluation Forum (CLEF) 2016 (Georgetown University Information Retrieval [GUIR] and East China Normal University [ECNU]) and the Best Match 25 (BM25) baseline of Terrier. MM: multidimensional metric; RBP: rank biased precision.
Reranking of the runs based on the Dale-Chall readability formula. ECNU: East China Normal University; GUIR: Georgetown University Information Retrieval; MM: multidimensional metric; RBP: rank biased precision.
Reranking of the runs based on the eXtreme Gradient Boosting (XGB) regressor to estimate understandability. ECNU: East China Normal University; GUIR: Georgetown University Information Retrieval; MM: multidimensional metric; RBP: rank biased precision.
Reranking combining topical relevance (original run) and understandability (eXtreme Gradient Boosting [XGB]) through rank fusion. ECNU: East China Normal University; GUIR: Georgetown University Information Retrieval; MM: multidimensional metric; RBP: rank biased precision.
Results of the learning to rank (LTR) method on the Best Match 25 (BM25) baseline. The BM25 baseline (light blue) is shown for direct comparison. MM: multidimensional metric; RBP: rank biased precision.
In the experiments, we also studied the influence of the number of documents considered for reranking (cut-off). The top/middle/bottom plots of
Note that with the increase of the number of documents considered for reranking, there is an increase in the number of unassessed documents being considered by the evaluation measures. Nevertheless, we note that if unassessed documents are excluded from the evaluation, similar trends are observed, for example, compare findings with those for the condensed measures uRBP*, RBP*r, RBP*u, and MM*RBP.
Overall, statistically significant improvements over the baselines were observed for most configurations and measures.
Next, we report the results of automatically combining topical relevance and understandability through rank fusion in
As for reranking, also for the rank fusion approaches we found that, in general, higher cut-offs were associated to higher effectiveness in terms of understandability measures on one hand, but higher losses in terms of relevance-oriented measures on the other. Overall, results obtained with rank fusion were superior to those obtained with reranking only, although most differences were not statistically significant. Statistically significant improvements over the baselines were instead observed for most configurations and measures.
Finally, we analyze the results obtained by the learning to rank methods in
When considering RBPr and uRBP, learning to rank exhibited effectiveness that was significantly inferior to that of the GUIR and ECNU baseline runs, although higher than those for the BM25 baseline (for some configurations). The examination of the number of unassessed documents (and the RBP residuals, see
We thus should carefully account for unassessed documents through considering the residuals of RBP measures as well as the condensed measures. When this was done, we observed that learning to rank methods overall provided substantial gains over the original runs and other methods (when considering RBP*r, RBP*u, and MM*RBP), or large potential gains over these methods (when considering the residuals). Next, we analyzed these results in more detail.
No improvements over the baselines were found for LTR 1, and the high residuals for RBPr were not matched by other residuals or by considering only assessed documents (see
Compared with LTR 1, LTR 2 included the understandability features listed in
LTRs 4 and 5 were devised based on a set understandability threshold U=40. Although LTR 4 took into consideration only documents that were easy to read (understandability label≤U), LTR 5 considered all documents, but boosted the relevance score. LTR 4 reached the highest understandability score for the learning-to-rank approaches (RBP*u=50.06), but it failed to retrieve a substantial number of relevant documents (RBP*r=2.20). In turn, LTR 5 reached the highest understandability-relevance trade-off (MM*RBP=29.20). Compared with the BM25 baseline (on which it was based), LTR 5 largely increased both relevance (RBP*r from 26.01 to 32.60—a 25% increase,
The empirical experiments suggested the following:
Machine learning methods based on regression are best suited to estimate the understandability of health Web pages
Preprocessing does affect effectiveness (both for understandability prediction and document retrieval), although compared with other methods, ML-based methods for understandability estimation are less subjected to variability caused by poor preprocessing
Learning to rank methods can be specifically trained to promote more understandable search results, whereas still providing an effective trade-off with topical relevance.
In this study, we relied on data collected through the CLEF 2015 and CLEF 2016 evaluation efforts to evaluate the effectiveness of methods that estimate the understandability of the Web pages. These assessments were obtained by asking medical experts and practitioners to rate documents; although, they were asked to estimate the understandability of the content as if they were the patients they treat, there might have been noise and imprecisions in the collection mechanism because of the subjectivity of the task.
Relevance assessments on the CLEF 2015 and 2016 collections are incomplete [
We have examined approaches to estimate the understandability of health Web pages, including the impact of HTML preprocessing techniques and how to integrate these within retrieval methods to provide more understandable search results for people seeking health information. We found that machine learning methods are better suited than traditionally employed readability measures for assessing the understandability of health-related Web pages and that learning to rank is the most effective strategy to integrate this into retrieval. We also found that HTML and text preprocessing do affect the effectiveness of both understandability estimations and of the retrieval process, although machine learning methods are less sensitive to this issue.
This paper contributes to improving search engines tailored to consumer health search because it thoroughly investigates promises and pitfalls of understandability estimations and their integration into retrieval methods. The paper further highlights which methods and settings should be used to provide better search results to health information seekers. As shown in
The methods investigated here do not provide a fully personalized search, with respect to how much of the health content consumers with different health knowledge might be able to understand. Instead, we focus on making the results understandable by anyone, and thus promote in the search results content that has the highest level of understandability. However, people with a more than average medical knowledge might benefit higher from more specialized content. We leave this personalization aspect, that is, the tailoring of the understandability level of the promoted content with respect to the user’s knowledge and abilities, to further work.
The impact of feature sets on the Spearman correlation between the predicted understandability and the ground truth assessed by human assessors in Conference and Labs of the Evaluation Forum (CLEF) eHealth 2015.
Distribution of Understandability Scores for Conference and Labs of the Evaluation Forum (CLEF) 2016.
Correlations between understandability estimators and human assessments for Conference and Labs of the Evaluation Forum (CLEF) 2015 and CLEF 2016.
Correlation results of different readability formulae with human assessments from Conference and Labs of the Evaluation Forum (CLEF) eHealth 2015 and 2016.
Results obtained by integrating understandability estimations within retrieval methods on Conference and Labs of the Evaluation Forum (CLEF) 2015 and CLEF 2016.
Best Match 25
consumer health vocabulary
Conference and Labs of the Evaluation Forum
Coleman-Liau Index
consumer medical vocabulary
components of readability formulae
Dale-Chall Index
East China Normal University
electronic health
expert medical vocabulary
Flesch Reading Ease
general medical vocabulary
Georgetown University Information Retrieval
HTML features
HealthOnNet
International Statistical Classification of Diseases and Related Health Problems
Korean Institute of Science and Technology Information
learning to rank
Medical Subject Headings
machine learning classifiers
machine learning regressors
multidimensional metric
natural language features
out-of-vocabulary
part of speech
rank biased precision
readability formulae
Simple Measure of Gobbledygook
word frequency features
eXtreme Gradient Boosting
The authors acknowledge the Technische Universität Wien (TU Wien) University Library for financial support through its Open Access Funding Programme. GZ is the recipient of an Australian Research Council Discovery Early Career Researcher Award (DECRA) Research Fellowship (DE180101579) and a Google Faculty Award. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644753 (KConnect).
None declared.