Rating Health Web sites using the principles of Citation Analysis: A Bibliometric Approach

The rapid growth in the number of health care related web sites necessitates that medical librarians be able to evaluate the quality of the web sites. By analysing the linked sources medical libraries web pages of nineteen of the top U.S. medical schools, this study used the citation analysis method. What was found with this bibliometric approach was a set of 78 most highly cited WWW sites out of thousands of cited links. The identification of the current, core section of health sciences related web sites with a bibliometric method gives librarians and information scientists another approach for evaluating web sites.


Introduction
The rapid growth and constant change in the number of health care related web sites make the evaluation of the quality of the web sites a difficult but beneficial task. The Internet is "a medium in which anyone with a computer can serve simultaneously as author, editor, and publisher and can fill any or all of these roles anonymously if he or she so choose. In such an environment, novices and savvy Internet users alike can have trouble distinguishing the wheat from the chaff, the useful from the harmful" [1]. In a systematic search by means of two search engines (Yahoo and Excite) for parent-oriented web pages relating to home management of feverish children, [2] the investigators of this study, compared the web site information with the guidelines to parents for managing fever at home supplied by a printed book. The investigators found among 41 web pages retrieved and reviewed: 28 web pages gave a specific temperature above which a child is feverish, 26 pages indicated the optimal site for taking temperature, 38 pages recommended non-drug measures, and 36 pages gave some indication of when a doctor should be called. Only four web pages adhered closely to the main recommendations in the guidelines. The investigators concluded from these observations only a few web sites provided complete and accurate information for this common and widely discussed condition. According to McClung, [3] 48 out of 60 major medical institution web sites checked had inaccurate information about the treatment of childhood diarrhea. While it is virtually impossible (and probably undesirable) to control the content of web pages, it is certainly useful to have some measure of the quality of the information provided.
One possibility is to establish an official rating system based on standard criteria. In the survey mentioned above, [2] the author also suggested an urgent need to check public oriented health care information on the Internet for accuracy, completeness, and consistency. Many attempts have been made, and core standards that can help to achieve these goals have been developed. The most widely accepted suggestion is adapting the five traditional print evaluation criteria: accuracy, authority, objectivity, currency and coverage, to web resources [4,5,6].
However, "many Internet users object strongly to any 'official' attempts to regulate information", though few want to see inaccurate information appearing! In addition, "the Web's interactive format means criteria used for paper-based journals may not be valid for web-based information." [7]. Jadad points out that the "Net's very nature makes this difficult, if not impossible". After an investigation to identify instruments used to rate web sites providing health information on the Internet, Jadad concluded, "many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm" [8]. At this point it is very difficult to reach or develop a standard that every user of the Internet could observe.
It has been suggested that Web sites can be evaluated in a similar way to traditional print media. When we evaluate a textbook or a journal, we not only assess the authors, content, and structure, but also more objectively, measure the impact of the publication on its readers. Citation analysis, the practice of counting citations to determine the scholarly impact of a work, is a method long used by librarians as an important tool of collection development. With bibliometrics the impact of a journal is evaluated by the frequency that it was cited during a certain period.
One major instrument to evaluate scientific journals is Journal Citation Reports (JCR) [9]. JCR is published by the Institute for Scientific Information and includes several citation-based measures of journal impact for the journals that they review. Librarians and researchers can utilize JCR to see how many times and how quickly articles published in certain journals are cited. There is also a measure of effectiveness, the impact factor, which normalizes the citations received by the selected journals and looks only at the previous two years of publication.
Though there is no similar tool available to evaluate the impact of a WWW page, it is comparatively easy to determine which pages are cited ("linked to") by the compilers of other pages. We also found a study conducted on the WWW pages of selected fine art libraries [10]. By analyzing the linked sources on art library Web pages, Neth's study found a set of twenty commonly cited WWW sites out of thousands of cited links. As we investigate health science related web sites, we also find some well-established sites already use this method successfully. For example the compilers for the Hardin-Meta of the University of Iowa look at many sites in each field and chose the lists that are most frequently cited by people in the field. This analysis provides a rudimentary form of peer evaluation. They call it a "list of lists" [11]. Another example, in a paper on the quality management of medical information on the Internet by Eysenbach, the author presented some indirect quality indicators, among them is "Web citation". A "webcite index," analogous to the Science Citation Index, could be compiled from the absolute number of hyperlinks to a certain website or new hyperlinks established over a period of time [17]. The author has developed a website network (http://webcite.net) contributing and practicing this methodology.
In this paper, we analysis the pages linked to in the "other links" sections of the web pages of a selection of the top 25 US medical schools. On the assumption that a Web Master will only cite or link to pages he/she thinks are authoritative. We examine the links made from these pages and obtain a listing of the most cited pages. This affords a new approach to evaluate web sites by using the principles of citation analysis.

Methods
(1). Sample selection: The selection of the "key sites" used to count the most frequently cited web sites is very important. For our approach, we used the listing of "the top 25 medical schools in the United States" as published by U.S. News and World Report [12]. Next we identified their primary health information WWW site. Normally this was the home page for the medical school library. Among these 25 medical schools, the web pages for seven of the medical schools were eliminated due to technical limitation of the URL checking software and the variations of the Web sites.
We finally examined the web pages of nineteen of the top twenty-five US medical schools. The top 25 are listed in Table  1 with those eliminated from this study indicated by an asterisk (*).  (2). Ranking the web sites by the cited frequency The next step was to examine the links made from these pages. This was achieved by using a software program "Checkweb" [13], which checks the links of the selected web page and reports which ones have moved, or cannot be located or connected to. The second step is to clean up this list of and eliminate the orphans (Status 404 -no longer existing and Status 301 and 302 -moved), and the "noise items". Noise items are "noise" from the host web page such as "go home" or links to other sections on the same site. This ensures that the final list is only to active links to external URLs.
The final step was to count the frequency of these URLs by their different levels. For example, we have the URLs such as: http://www.lib.uiowa.edu/hardin/md/speech.html. This URL can be broken down into its component parts as shown in Table  1. We separated these URLs into their different component levels and counted their frequency. In this example, the first level domain name is the portion before the first slash, "http:/ /www.lib.uiowa.edu".
The Top Level Domains (TLDs) include the designators such as .edu, .com, .ca, and .nl. Sorting the TLDs resulted in Table  2.

Results and Analysis
The three levels of URLs were counted and the results are shown in Table 3, Table 4 and Table 5.
The frequency of links is very concentrated in several TLDs, notably .edu, .com, and .gov and. org. These accounted for 88.61% of the Links. Table 3 the most highly cited TLDs (greater than 600 times) are .edu, .com, .gov, and .org. These TLD's are all registered in the United States. Other less cited US TLDs are .net, .us and .mil. The United States related web pages account for almost 90% of the URLs cited. This was not unexpected because the source samples are U.S. medical schools and the Internet is highly developed in this country. Among the US TLDs, those from four years colleges and universities, those entitled to use the .edu suffix, are cited most frequently and therefore are considered the most important. The .edu suffix accounts for almost one third of all links.

As shown in
Other countries whose TLDs are frequently cited are United Kingdom (uk), Switzerland (ch), Canada (ca), Germany (de), Australia (au), Sweden (se) and Netherlands (nl). This distribution is very similar to the results of 30 nations ranked by the citations per paper from 1992 to 1996 by Institute of Scientific Information (ISI) published in the Science Watch [14]. In this study the top ten nations were Switzerland, United States, Netherlands, Sweden, Denmark, United Kingdom, Belgium, Finland, Canada, and Germany. It seems that in some degree our results may also represent the developmental level of medical information publishing and research in the world. However, the focus of this paper is not placed on the comparison of these two lists.  (2). Distribution of the First Level Domains.
One of the goals of this study was to identify the web sites cited most frequently by US academic health sciences libraries. Table  3 shows that a total of 1731 web sites were cited by (linked to) these 19 institutional home pages.
According to the Bradford 's Law of Scatter: [15] "if scientific journals are arranged in order of decreasing productivity of articles on a given subject, they may be divided into a nucleus of periodicals more particularly devoted to the subject and several groups or zones containing the same number of articles as the nucleus, when the numbers of periodicals in the nucleus and succeeding zones will be as 1:n:n2 ". In our study, we list the web sites in order of decreasing frequency of citation, and as Bradford has done in his original paper we divide the total cited times of the web sites into 3 equal sections. The first section is the top 78 web sites (as shown in details in Table 3) 33.69% of total cited times, the second section is from rank No. 79 to No.530, nearly another 33% of total cited times, the last section is No. 531 to No. 1731. So the numbers of these web sites with almost equal cited frequency is 78:452:1201, close to 1:4:42. Thus by application of this law in the web sites citation analysis, we can take the first section (78 web sites) as a core section of these 1731 web sites.
(3). Distribution of the Whole Domain Name Web sites: Most of the web sites listed in the whole domain name table (Table 4) are already listed in the earlier tables. This is because most "other links" are directed to the first level domains of URLs. Only URLs with asterisks (*) in this table have more details.
In fact, most of the whole URLs list were already been identified in the "First Level Domain" (Table 3), as most of the whole URLs are also represented in the "first level domain". A few links found are to pages deeper into the site and give us information as to why a site was selected for a link. For example, many, though not all of, visitors to CDC want to look up the Morbidity and Mortality Weekly Report (MMWR) and many visitors to NIH want information on grant and fellowship programs, both pages thus often get direct links in addition to a more general link to the CDC or NIH sites. 6 Combining the results of the "first level of domains" analyses and the whole URLs analyses, we replaced some "first level of domain" with the whole URL expansion if it existed. From this analysis a guide to the most cited health sciences related web sites was determined. We hope this list might serve as a more complete listing of the core web sites on health care.
To further represent these health-related core web sites clearly, we classified these core web sites respectively by their main utility, original sites into 6 clusters (Table 5).

Conclusions
Among the URLs cited by the selected academic medical institutions, almost 90% of the Top Level Domains (TLDs) are from the United States. Less than 10% come from the United Kingdom, Switzerland, Canada, Germany, Australia and the Netherlands. The number of remaining TLDs is less than 2%.
The first level domains are distributed according to Bradford's Law. There is a nucleus that contains the 78 most highly cited health sciences related web sites. These core web sites represented a broad field of information needs.

Discussion
The identification of a core section of health-related web sites with bibliometrics method gives librarians and information scientists another approach to evaluate the web sites. While "core lists" of printed publications have their drawbacks, they are useful guides to help librarians and users to select publications. Similarly, lists of commonly linked-to WWW pages can provide suggestions as to important health-related sites and assist home-page compilers in selecting suitable and reliable links. It would be desirable to examine the home pages of all U.S. medical school libraries and to compare these results to those from the pages produced by medical school libraries in other English-speaking countries such as Canada, the United Kingdom and Australia.