The use of medical experts in rating the content of health-related sites on the Internet has flourished in recent years. In this research, it has been common practice to use a single medical expert to rate the content of the Web sites. In many cases, the expert has rated the Internet health information as poor, and even potentially dangerous. However, one problem with this approach is that there is no guarantee that other medical experts will rate the sites in a similar manner.
The aim was to assess the reliability of medical experts' judgments of threads in an Internet newsgroup related to a common disease. A secondary aim was to show the limitations of commonly-used statistics for measuring reliability (eg, kappa).
The participants in this study were 5 medical doctors, who worked in a specialist unit dedicated to the treatment of the disease. They each rated the information contained in newsgroup threads using a 6-point scale designed by the experts themselves. Their ratings were analyzed for reliability using a number of statistics: Cohen's kappa, gamma, Kendall's W, and Cronbach's alpha.
Reliability was absent for ratings of questions, and low for ratings of responses. The various measures of reliability used gave conflicting results. No measure produced high reliability.
The medical experts showed a low agreement when rating the postings from the newsgroup. Hence, it is important to test inter-rater reliability in research assessing the accuracy and quality of health-related information on the Internet. A discussion of the different measures of agreement that could be used reveals that the choice of statistic can be problematic. It is therefore important to consider the assumptions underlying a measure of reliability before using it. Often, more than one measure will be needed for "triangulation" purposes.
The importance of the Internet for contemporary public health has been acknowledged for some time. People have used the Internet for many years to access health-related information. Pallen points out that, although health professionals originally assumed that health-related Internet sites would be something used by themselves for research, consultation with colleagues, continuing education, and library work, this concept has been extended and modified [
The Graphics, Visualization & Usability Center at Georgia Institute of Technology estimated that 27% of female Internet users and 15% of male Internet users use the Internet to get medical information on a regular basis [
Leaving aside the question of whether a reliance on medical opinion will "dismiss the input of non medical readers" [
The 5 medical experts who participated were all doctors experienced in the treatment of the chronic illness chosen. They worked together in the same specialist unit and all had at least 5 years experience in treatment of the chosen illness.
The material to be categorized came from a newsgroup used mainly by nonprofessional medical sufferers of the illness. Overall, there were 61 threads (series of connected messages), selected from a week's posting because they contained medically-related information, to be examined by at least one medical expert; however a random sample of 18 threads was assessed by all 5 experts. These 18 threads form the basis of this report.
Each thread consisted of a start message; usually in the form of a question; and a number of responses. Both the start message and the responses were rated using a coding scheme devised by the medical experts. For start messages, there was a 6-part scheme: A = excellent; B = less good but with some details; C = poor with little detail; D = vague; E = misleading or irrelevant; F = incomprehensible. The responses were also coded according to a 6-part scheme: A = evidence based, excellent; B = accepted wisdom; C = personal opinion; D = misleading, irrelevant; E = false; F = possibly dangerous.
There are 3 main ways (kappa, gamma, and Kendall's W) to analyze the agreement of judges rating the threads from the Internet. Perhaps the most familiar to medical researchers and practitioners is Cohen's kappa. We present the version of kappa described in Siegel & Castellan [
There is a choice of the most appropriate statistic to analyze such data. One could use a weighted-kappa procedure, but this statistic is controversial because the values of the weights for each level are arbitrary [
For the start messages, the kappa statistic was 0.024; this value was not significant (
Gamma Statistics for the Rating of Start Messages
|
|
|||||
|
|
|
|
|
||
|
1 | 0.000 | 0.181 | 0.247 | -0.659** | |
|
0.000 | 1 | 0.345 | 0.262 | 0.368 | |
|
0.181 | 0.345 | 1 | .475 | 0.250 | |
|
0.247 | 0.262 | 0.475 | 1 | 0.409 | |
|
-0.659** | 0.368 | 0.250 | 0.409 | 1 | |
* |
** |
*** |
Overall, the results for the agreement of rating of responses to these start messages were somewhat better. The kappa statistic for these ratings was 0.243 and was significant (
Gamma Statistics for the Rating of Replies
|
|
|||||
|
|
|
|
|
||
|
1 | 0.431 | 0.377 | .730*** | 0.602* | |
|
0.431 | 1 | 0.578*** | .621*** | 0.311 | |
|
0.377 | 0.578*** | 1 | .592*** | 0.504** | |
|
0.730*** | 0.621*** | 0.592*** | 1 | 0.690*** | |
|
0.602* | 0.311 | 0.504** | .690*** | 1 | |
* |
** |
*** |
A more-imaginative approach to the problem of assessing reliability and validity for ratings of this type was suggested by an anonymous reviewer. The first suggestion was to treat the data as interval level rather than ordered categorical, which would allow greater flexibility in analysis. Furthermore, this approach is relatively common in the social sciences and more particularly in psychometric research. The second suggestion was that a simple and effective way of presenting the data would be to give the Spearman rank order correlation for raters. We present these for the ratings of the replies in
Spearman Rank Order Correlations for Replies
|
|
|||||
|
|
|
|
|
||
|
1 | 0.296* | 0.248* | 0.519*** | 0.416*** | |
|
0.296* | 1 | 0.454*** | 0.538*** | 0.233* | |
|
0.248* | 0.454*** | 1 | 0.452*** | 0.334** | |
|
0.519*** | 0.538*** | 0.452*** | 1 | 0.516*** | |
|
0.416*** | 0.233* | 0.334** | 0.516*** | 1 | |
* |
** |
*** |
In this case, the Cronbach's alpha for the 5 doctor's ratings of the replies was 0.78. This reliability, however, would be increased to 0.876 by doubling the number of raters to 10 and to 0.914 if we increase the number of raters to 15. If we only have 2 raters, the reliability is reduced to a very-worrying 0.59.
For medical evidence of this type, we would want to have information that is as reliable as possible; 5 doctors as in our example may be too few. The reliability can be increased by increasing the number of items to be rated as well as by increasing the number of raters. The Spearman-Brown formula is limited to estimating differences in one dimension - in this case, the number of raters. Brown [
Overall, the results suggest that there is a fair degree of disagreement between medical experts when they are asked to rate medically-related postings from the Internet. In this case, the experts were using a system that was devised by them, so any possibility of this result being forced on them by a poor or deliberately-misleading category system is negated. We acknowledge that the start-message coding is less important as it deals with questions rather than answers, includes a small sample, and its coding seems by its nature to be less precise, which may explain the very-low levels of agreement. The rating of responses, however, seems to us to use sensible and relatively-transparent categories. The agreement between response ratings is still relatively poor, and certainly not consistent across all the experts.
One particularly interesting finding was the divergence of the different statistics used to measure agreement in the same ratings. It seems that the choice of a statistic to measure the agreement of judges in this sort of research could be problematic. Consideration of the power of a statistic and the use of pair-wise versus overall statistics are the two main issues. In particular, we have shown that it is possible to achieve a reasonably-high level of agreement with an overall test when individual pair-wise statistics show no agreement or significant disagreement (as was the case for start messages). We have also shown that overall statistics can conflict with pair-wise statistics when there are subgroups within the raters who agree with each other, but disagree with the other subgroups. This was the case with the replies: the overall level of agreement was very low, but individual pair-wise statistics showed high agreement between pairs of raters. The selection of a homogeneous group of experts (such as ours) did not seem to eliminate this problem.
The anonymous reviewer's suggestion for adopting psychometric techniques to look at the reliability of the raters is interesting, and we believe could be a valuable procedure for the future. Both factor analysis and latent structure analysis [
These results call into question the numerous studies that have claimed to show that the information on the Internet is of poor quality, and suggest that future studies should employ more than one rater. That one expert fails to agree with the Internet is perhaps less important than that several experts disagree with each other. It is possible that training or other resources might increase agreement between experts, and future research could consider this. Any measure producing a greater agreement between raters of Internet sites could have great benefits to medical and nonmedical users of the Internet alike.
The research findings were drawn from a project that is being funded by the Economic and Social Research Council (award number L132251029) under the auspices of its Virtual Society? Programme. Full details of the project can be found on the project Web site at
There are no conflicts of interest for any of the authors.