Improving Web Searches: Case Study of Quit-Smoking Web Sites for Teenagers
Background: The Web has become an important and influential source of health information. With the vast number of Web sites on the Internet, users often resort to popular search sites when searching for information. However, little is known about the characteristics of Web sites returned by simple Web searches for information about smoking cessation for teenagers.
Objective: To determine the characteristics of Web sites retrieved by search engines about smoking cessation for teenagers and how information quality correlates with the search ranking.
Methods: The top 30 sites returned by 4 popular search sites in response to the search terms "teen quit smoking" were examined. The information relevance and quality characteristics of these sites were evaluated by 2 raters. Objective site characteristics were obtained using a page-analysis Web site.
Results: Only 14 of the 30 Web sites are of direct relevance to smoking cessation for teenagers. The readability of about two-thirds of the 14 sites is below an eighth-grade school level and they ranked significantly higher (Kendall rank correlation, tau = -0.39, P= .05) in search-site results than sites with readability above or equal to that grade level. Sites that ranked higher were significantly associated with the presence of e-mail address for contact (tau = -0.46, P= .01), annotated hyperlinks to external sites (tau = -0.39, P= .04), and the presence of meta description tag (tau = -0.48, P= .002). The median link density (number of external sites that have a link to that site) of the Web pages was 6 and the maximum was 735. A higher link density was significantly associated with a higher rank (tau = -0.58, P= .02).
Conclusions: Using simple search terms on popular search sites to look for information on smoking cessation for teenagers resulted in less than half of the sites being of direct relevance. To improve search efficiency, users could supplement results obtained from simple Web searches with human-maintained Web directories and learn to refine their searches with more advanced search syntax.
J Med Internet Res 2003;5(4):e28)
The World Wide Web, with over 3 million public Web sites and over 1.4 billion Web pages , has become an important and influential source of health information [ ]. In September 2002, there were an estimated 605 million people online worldwide [ ]. In the United States, 90% (48 million) of the children and adolescents between the ages of 5 and 17 use computers, and 75% of the 14 to 17 year olds use the Internet [ ]. With the vast amount and dynamic nature of information on the World Wide Web, it is not surprising to find that over 75% of those online use search sites to navigate the Web [ ]. However, the amount of results returned from a search is often overwhelming. For example, 115000 results were found with the search terms "teen quit smoking" in Google.
Of the several thousand search sites or directories , only a few are of high popularity as indicated by their audience reach and time spent on them [ ]. Although Google will provide up to a thousand results from a query, few users are likely to examine them all. In an observational study on 16 adult subjects, only 9 participants ever looked beyond the first search pages and only 5 of them ever clicked a link on those pages [ ]. A survey done in 2002 on 1403 e-mail participants showed that only 23% of the users went beyond the second page [ ]. Another pilot study of 12 teenagers found they looked past the fourth page of results less than 5% of the time [ ]. Thus, position ranking in Web-search results, especially on the first few pages, is an important determinant of information accessibility by users.
Several studies have reported substantial variability in health-related Web-site content [- ]. While guidelines for evaluating the quality of health information on the Web are available [ - ], the correlation between these guidelines and accuracy of health information is debated [ - ]. Position ranking in search results was not associated with content quality [ ]. Using the search term "breast cancer," Meric et al [ ] reported that popularity of Web sites was associated with type rather than quality of content. In a sample of 75 Web sites that provided information on urinary incontinence, the Internet popularity indexes—as measured by the number of links to the main incontinence page of each Web site and by the number of links to all pages of each Web site divided by the number of pages of the site—were not correlated with a quality score based on Silberg et al [ ] and the HONcode principles [ ].
The aim of this study was: (a) to identify the characteristics of Web sites with information on smoking cessation for teenagers that ranked in the top 30 positions in a typical Web search on popular search sites and (b) to evaluate the association between those characteristics and the position ranking for sites that are of direct relevance to smoking cessation for teenagers. The findings are relevant for improving consumer access to health information.
This study was carried out from May 2003 through June 2003. Web sites with information on smoking cessation for teenagers were identified with 4 popular search sites using a specific search term. The characteristics of the identified sites were collected with a Web-site characteristic checklist; 2 raters evaluated each Web site independently (details below).
Four popular search sites () were used in this study. Users spend over 5 million search hours per month at each site. A search hour equals the number of visitors to a site multiplied by the average number of hours each visitor is estimated to have spent at the site.
The search term on smoking cessation for teenagers was selected based on information from the Overture Search Term Suggestion Tool  and the 7search Keyword Suggestion Tool [ ]. These sites provide a count of the search terms that were submitted to their search engines. Overture provides their search results to various popular search sites including Yahoo, MSN, AltaVista, Lycos, HotBot, and AllTheWeb [ ]. For example, in April 2003 there were 40036 searches submitted to Overture with "quit smoking," 27812 with "stop smoking," and 9001 with "smoking cessation." Various other combinations of "teen," "youth," "adolescent," "quit smoking," "stop smoking," and "smoking cessation" were compared. Based on the frequency of searches performed on the Web as recorded by the Overture database, the search terms "teen quit smoking" were submitted to the 4 search sites to locate sites with information on smoking cessation for teenagers.
To mimic the search behavior of Web users, only the top 30 search results were included in the study. Sites ranking below the top 30 results are likely to be found only by more-persistent searchers . Thirty results are equivalent to 3 pages (2 clicks) of the default number of results per page in Google and AOL, 2 such pages (1 click) in MSN, and one and a half such pages (one click) in Yahoo. The results from the 4 search sites were combined into one list to provide an overall picture of the search activity on the Web. The sites were reranked by first grouping the sites into 4 groups by the number of search sites that included them (1 to 4 search sites) and then by the position ranking provided by the search results within each group. The top 30 reranked sites formed the sample for the analysis.
Since the rankings of Web sites within search-site results change frequently, the search results were captured in spreadsheet format using the Google API Search Tool . The Web pages of sites identified by search results were captured using Offline Explorer software [ ] to facilitate the recall of the exact page content when necessary and to provide consistency for the 2 raters.
Checklist of Web-Site Characteristics
A checklist was uses to evaluate the characteristics of the Web sites (seefor checklist items). The readability was estimated by the Flesch-Kincaid grade-level score [ ]. (The Flesch-Kincaid grade-level score rates text on a United States grade-school level. For example, a score of 8.0 means that an eighth grader can understand the document.) Sample passages from the Web pages with information pertaining to smoking cessation of the identified sites were pasted into Microsoft Word XP for Windows to obtain the score. The results were recorded in a spreadsheet and subsequently imported into SPSS [ ] for analysis. The number of broken links, page size, presence of meta tags, and presence of persistent cookies were obtained from WebXact Watchfire Page Analysis [ ]. (Meta tags are HTML [hypertext markup language] tags that provide information about the content of a Web page for indexing by search engines but do not affect how a Web page is displayed by a browser.) Link density was obtained by using a reverse-lookup query (link:siteURL, where siteURL is replaced by the Web site's URL) in Google. The link density of a site is the number of external sites that have a link to that site. A site with a higher link density is generally more likely to be found by visitors because they may find it through the external sites.
Correlations between position ranking and the Web-site characteristics were calculated using the Kendall rank correlation. The value of the coefficient (tau) ranges from -1 to 1. A value of zero indicates no correlation, values near 1 indicate a strong direct correlation, and values near -1 indicate a strong inverse correlation. Interobserver reliability between the 2 raters was calculated using Kappa statistics on all variables except readability, link density, and those returned by WebXact Watchfire Page Analysis. We regarded P£ .05 as statistically significant.
Of the top 30 sites identified by the 4 search sites using the search terms "teen quit smoking," only 14 were relevant to teenagers who are seeking information on smoking cessation. We also evaluated the search results from Google by using other similar search terms. The number of relevant sites ranged from 5 to 17 (). Although we used only 1 search site to illustrate the effect of search terms on the type of Web sites found, the result should be similar at other search sites.
Characteristics of the 14 Relevant Web Sites
The characteristics of the 14 sites are summarized in 3 categories ().
The essential-characteristic category contains those characteristics that contribute to user dissatisfaction if absent or inadequately provided. The presence of a privacy statement and disclaimer, although it appears not to be required for the functioning of a Web site, wasreported to be essential in a Web-user interface study .
The correlation between the 2 raters ranged from 1.00 for 2 characteristics (presence of phone number or mailing address and presence of material in video or audio format) to 0.19 for indication of sponsorship. The median correlation was 0.69 for the 15 characteristics evaluated by both raters.
In the essential category, 8 sites (57%) contained a site-search feature and 11 sites (79%) contained links for navigation in the site. However, 2 sites contained neither of the features. Over half of the sites contained either a privacy statement (57%) or a disclaimer (64%) but only a third of the sites contained both. About one-third of the sites have readability below eighth-grade school level and they ranked significantly higher (tau = -0.39, P= .05) than those that have readability above or equal to that level. The median grade level was 8.5. Half the sites contained one or more broken internal or external hyperlinks.
In the enhancement-characteristic category, 11 sites (79%) indicated their sponsorship. Apparently because most of the sites were sponsored by organizations, government bodies, or educational institutions, only 4 sites (29%) had either pop-up advertisements or in-page banner advertisements. E-mail address (71%) was the most-common contact information available while phone number or mailing address was present in 29% of the sites. Sites that ranked higher were significantly associated with the presence of e-mail address for contact (tau = -0.46, P= .01). Eleven sites (79%) had information on behavioral approach as a method of smoking cessation. Ten sites (71%) had information on a medication (nicotine replacement) approach, and 5 sites (36%) had information on alternative approaches such as acupuncture, hypnosis, laser therapy, and herbal cigarettes. Both the presence of medication (tau = -0.43, P= .02) and alternative approaches (tau = -0.42, P= .02) were significantly associated with a higher search ranking. Five sites provided annotated hyperlinks to external sites and their presence was significantly associated with a higher search ranking (tau = -0.39, P= .04). Eight sites contained interactive components such as quizzes, games, or bulletin boards. Only 1 site provided material in video or audio format.
In the technical-characteristic category, the largest file size of the landing page (the page reached when clicking on the search-site result) was 134 kilobytes, which is equivalent to approximately 19 seconds of download time on a 56 Kbps modem. Sites that were equal to or larger than 35 kilobytes (57%) were ranked significantly higher (tau = -0.39, P= .04) by the search sites. Eight (57%) and 11 (79%) of the sites had meta description and meta keywords tags, respectively. The presence of a meta description tag was significantly associated with a higher search rank (tau = -0.48, P= .002). Although 5 sites used cookies (small files sent to the browser along with a Web page for tracking a visit), only 3 of them used a persistent cookie that is stored on the user's hard disk and 4 used a session cookie that is automatically deleted from the browser's cache when the browseris closed. Six (43%) sites were just part of larger Web sites containing information other than smoking. The median link density of the 14 Web pages was 6 and the maximum was 735. A higher link density was significantly associated with a higher search rank (tau = -0.58, P= .02).
The key finding of this study was that using simple search terms on popular search sites to look for information on smoking cessation for teenagers, less than half (14 of 30) of the sites found were of direct relevance. The remaining sites were study reports, news, and hyperlinks.
We did not include all information retrieved from Web searches, as has been done in studies on other topics , since users tend not to go beyond the first few pages of search results [ , ]. Instead, we evaluated only the top 30 search results to mimic typical Web search behavior.
Searching with the terms "teen quit smoking" on 7 popular search sites, Edwards et al  also reported that only 40% of the 140 potential hits were focused on cessation. In our study, 1 site of pornographic nature was found when using the search terms "teen smoking cessation" but no such sites were found when using the search terms "teen quit smoking" in contrast to a previous report [ ] where 7 out of the top 20 sites were teen pornography sites.
Several important associations were found between Web-site characteristics and position ranking in the top 30 search results. These results can be used for optimizing site development in future smoking-cessation Web sites.
As an example of how these results can be used, of the 6 items in the essential-characteristic category, readability (lower grade level) was associated with higher position ranking. The lack of search box, navigational menu, privacy statement, or disclaimer, or the presence of broken links, was not uncommon, but their absence was not associated with lower position ranking.
In the enhancement-characteristic category, presence of contact e-mail address, medication-cessation information, alternative-approach information, and annotated external links were associated with higher position ranking. It is surprising to find that only 1 site displayed a HONcode insignia which, along with the associated membership, is an indication that a site complies with an 8-point code of conduct put forth by Health on the Net . Although 73% of young people said that knowing who produced health information is very important to them, only 29% of those who looked up health information online checked the source the last time they conducted a search [ ] and it is likely that fewer will check for the authenticity (for example, verify the membership status of a site at the HON Web site) of any indications of external recognition even if they are present [ ].
In the technical-characteristic category, page size that was larger than 35 kilobytes, presence of a meta description tag, and a high link density were associated with higher ranking. The strong association between site description meta tag and ranking (tau = -0.48, P= .002) suggests that such information is relevant to the ranking algorithms of the search-engines used. Including a concise description tag is likely to be more effective in improving search-engine visibility than just a comprehensive keywords list. In fact, due to high rate of keyword repetition and spam, search sites such as Google and AltaVista do not give consideration to the keywords meta tag in their ranking [, ]. As expected, link density is strongly associated with ranking (tau = -0.58, P= .02). Search engines generally use the number of incoming links (link density) in their ranking algorithm. However, Google's PageRank algorithm also takes into account the number of outgoing links on the page of each of the incoming links [ ].Therefore, to achieve a high ranking a Web site should try to get listed on as many sites as possible and, in particular, on those sites that have as few external links as possible. Since search engines assign higher ranking to sites with incoming links that originate from pages containing fewer external links, and sites with annotated external links tend to have fewer links than those sites without annotated external links, this may explain the association between the presence of annotated external links and higher ranking (tau = -0.39, P= .04).
To improve search efficiency, users may want to supplement results from search sites with those from subject-based Web directories that are created and maintained by people, rather than by algorithms, such as Yahoo! Directory, which has a teen-smoking section . Using the Yahoo! directory, we found 25 sites listed, of which only 4 were found using our search terms at the 4 popular search sites. In addition, users may want to learn and apply the specific syntax of their favorite search sites when searching for information. For example, quit-smoking Web sites of the commercial (.com) domain can be eliminated from the search results by entering "quit smoking -site:.com" in the search box in Google.
The authors thank Sherry Biscope for her help in the data analyses. This study was supported by grants from the Canadian Institutes for Health Research and from the Ontario Ministry of Health and Long-Term Care.
Conflicts of Interest
- O'Neill ET, Lavoie BF, Bennett R. Trends in the evolution of the public Web: 1998-2002. D-Lib Magazine 2003 Apr;9(4) [FREE Full text] [WebCite Cache]
- Cline RJ, Haynes KM. Consumer health information seeking on the Internet: the state of the art. Health Educ Res 2001 Dec;16(6):671-692. [Medline] [CrossRef]
- . How many online?. URL: http://www.nua.ie/surveys/how_many_online/ [accessed 2003 Jun 16] [WebCite Cache]
- . In: Clancy RE, editor. A Nation Online: How Americans Are Expanding Their Use of the Internet. Washington, DC: Nova Science Pub Inc; May 1, 2002. URL: http://www.ntia.doc.gov/ntiahome/dn/
- Sullivan D. Survey reveals search habits. 2000 Jun 2. URL: http://www.searchenginewatch.com/sereport/article.php/2162681 [accessed 2003 Jun 16] [WebCite Cache]
- Sullivan D. Guides to search engines. 2002 Jan 23. URL: http://www.searchenginewatch.com/links/article.php/2156161 [accessed 2003 Jun 16] [WebCite Cache]
- Sullivan D. Nielsen NetRatings search engine ratings. 2002 Feb 25. URL: http://www.searchenginewatch.com/reports/article.php/34701_2156451 [accessed 2003 Jun 16] [WebCite Cache]
- Eysenbach G, Köhler C. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ 2002 Mar 9;324(7337):573-577 [FREE Full text] [PMC] [Medline] [CrossRef]
- Greenspan R. Search engine usage ranks high. URL: http://cyberatlas.internet.com/markets/advertising/article/0,,5941_1500821,00.html [accessed 2003 Jun 16] [WebCite Cache]
- Richardson CR, Resnick PJ, Hansen DL, Derry HA, Rideout VJ. Does pornography-blocking software block access to health information on the Internet? JAMA 2002 Dec 11;288(22):2887-2894 [FREE Full text] [Medline] [CrossRef]
- Biermann JS, Golladay GJ, Greenfield ML, Baker LH. Evaluation of cancer information on the Internet. Cancer 1999 Aug 1;86(3):381-390. [Medline] [CrossRef]
- Croft DR, Peterson MW. An evaluation of the quality and contents of asthma education on the World Wide Web. Chest 2002 Apr;121(4):1301-1307 [FREE Full text] [Medline] [CrossRef]
- Latthe PM, Khan KS. Quality of medical information about menorrhagia on the worldwide web. BJOG 2000 Jan;107(1):39-43. [Medline]
- Hoffman-goetz L, Clarke JN. Quality of breast cancer sites on the World Wide Web. Can J Public Health 2000;91(4):281-284. [Medline]
- Kim P, Eng TR, Deering MJ, Maxfield A. Published criteria for evaluating health related web sites: review. BMJ 1999 Mar 6;318(7184):647-649 [FREE Full text] [PMC] [Medline]
- Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling, and assuring the quality of medical information on the Internet: Caveant lector et viewor--Let the reader and viewer beware. JAMA 1997 Apr 16;277(15):1244-1245. [Medline] [CrossRef]
- Winker MA, Flanagin A, Chi-lum B, White J, Andrews K, Kennett RL, et al. Guidelines for medical and health information sites on the internet: principles governing AMA web sites. American Medical Association. JAMA 2000 Mar 22;283(12):1600-1606. [Medline] [CrossRef]
- . HON code of conduct (HONcode) for medical and health Web sites. URL: http://www.hon.ch/HONcode/Conduct.html [accessed 2003 Jun 16] [WebCite Cache]
- . eEurope 2002: Quality Criteria for Health Related Websites. J Med Internet Res 2002 Nov 29;4(3):e15 [FREE Full text] [Medline]
- Jadad AR, Gagliardi A. Rating health information on the Internet: navigating to knowledge or to Babel? JAMA 1998 Feb 25;279(8):611-614. [Medline] [CrossRef]
- Kunst H, Groot D, Latthe PM, Latthe M, Khan KS. Accuracy of information on apparently credible websites: survey of five common health topics. BMJ 2002 Mar 9;324(7337):581-582 [FREE Full text] [PMC] [Medline] [CrossRef]
- Gagliardi A, Jadad AR. Examination of instruments used to rate quality of health information on the internet: chronicle of a voyage with an unclear destination. BMJ 2002 Mar 9;324(7337):569-573 [FREE Full text] [PMC] [Medline] [CrossRef]
- Griffiths KM, Christensen H. The quality and accessibility of Australian depression sites on the World Wide Web. Med J Aust 2002 May 20;176 Suppl:S97-S104 [FREE Full text] [Medline]
- Meric F, Bernstam EV, Mirza NQ, Hunt KK, Ames FC, Ross MI, et al. Breast cancer on the world wide web: cross sectional survey of quality of information and popularity of websites. BMJ 2002 Mar 9;324(7337):577-581 [FREE Full text] [PMC] [Medline] [CrossRef]
- Sandvik H. Health information and interaction on the internet: a survey of female urinary incontinence. BMJ 1999 Jul 3;319(7201):29-32 [FREE Full text] [PMC] [Medline]
- Sullivan D. Nielsen NetRatings search engine ratings 2003 Feb 25 [FREE Full text] [WebCite Cache]
- . Search term suggestion tool. URL: http://inventory.overture.com/d/searchinventory/suggestion/ [accessed 2003 Jun 16] [WebCite Cache]
- . Keyword suggestion tool. URL: http://conversion.7search.com/scripts/advertisertools/keywordsuggestion.aspx [accessed 2003 Jun 16] [WebCite Cache]
- Sullivan D. Who powers whom? Search providers chart. 2003 May 5. URL: http://www.searchenginewatch.com/reports/article.php/34701_2156401 [accessed 2003 Jun 16] [WebCite Cache]
- Marckini FW. Search Engine Positioning (With CD-ROM). Plano, TX: Wordware Publishing; May 15, 2001.
- . Google API Search Tool [computer program]. URL: http://www.searchenginelab.com/common/products/gapis/ [accessed 2003 Jun 16] [WebCite Cache]
- . Offline Explorer [computer program]. URL: http://www.metaproducts.com/ [accessed 2003 Jun 16] [WebCite Cache]
- Flesch R. How to write plain English. URL: http://www.mang.canterbury.ac.nz/courseinfo/AcademicWriting/Flesch.htm [accessed 2003 Oct 13] [WebCite Cache]
- SPSS (Statistical Package for the Social Sciences) [computer program]. Chicago, IL: SPSS.
- . WebXACT. URL: http://www.webxact.com/ [accessed 2003 Jun 16] [WebCite Cache]
- Zhang P, Von Dran GM. Satisfiers and dissatisfiers: a two-factor model for website design and evaluation. J Am Soc Info Sci 2000 Oct;51(14):1253-1268. [CrossRef]
- Ribisl KM. The potential of the internet as a medium to encourage and discourage youth tobacco use. Tob Control 2003 Jun;12 Suppl 1(90001):i48-i59 [FREE Full text] [Medline] [CrossRef]
- Edwards CC, Elliott SP, Conway TL, Woodruff SI. Teen smoking cessation help via the Internet: a survey of search engines. Health Promot Pract 2003 Jul;4(3):262-265. [Medline] [CrossRef]
- Elliott SP, Edwards CC, Woodruff SI, Conway TL. On-line smoking cessation: what's porn got to do with it? Tob Control 2001 Dec;10(4):397 [FREE Full text] [Medline] [CrossRef]
- Sullivan D. Death of a meta tag. 2002 Oct 1. URL: http://www.searchenginewatch.com/sereport/article.php/2165061 [accessed 2003 Jun 16] [WebCite Cache]
- Whalen J. Ten tips to the top of Google. 2003 Apr 30. URL: http://www.searchenginewatch.com/searchday/article.php/2198931 [accessed 2003 Jun 16] [WebCite Cache]
- Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. URL: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [accessed 2003 Nov 10] [WebCite Cache]
- . Yahoo! directory teen health > teen smoking. URL: http://dir.yahoo.com/health/teen_health/teen_smoking/ [accessed 2003 Nov 10] [WebCite Cache]
Edited by G. Eysenbach; submitted 10.09.03; peer-reviewed by M Slater, WE Haefeli, D Ilic; comments to author 22.09.03; revised version received 29.09.03; accepted 30.09.03; published 14.11.03
© Malcolm Koo, Harvey Skinner. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 14.11.2003. Except where otherwise noted, articles published in the Journal of Medical Internet Research are distributed under the terms of the Creative Commons Attribution License (http://www.creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited, including full bibliographic details and the URL (see "please cite as" above), and this statement is included.