This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Increasing numbers of patients are raising their voice in online forums. This shift is welcome as an act of patient autonomy, reflected in the term “expert patient”. At the same time, there is considerable concern that patients can be easily misguided by pseudoscientific research and debate. Little is known about the sources of information used in health-related online forums, how users apply this information, and how they behave in such forums.
The intent of the study was to identify (1) the sources of information used in online health-related forums, and (2) the roles and behavior of active forum visitors in introducing and disseminating this information.
This observational study used the largest German multiple sclerosis (MS) online forum as a database, analyzing the user debate about the recently proposed and controversial Chronic Cerebrospinal Venous Insufficiency (CCSVI) hypothesis. After extracting all posts and then filtering relevant CCSVI posts between 01 January 2008 and 17 August 2012, we first identified hyperlinks to scientific publications and other information sources used or referenced in the posts. Employing
Of 139,912 posts from 11,997 threads, 8628 posts discussed or at least mentioned CCSVI. We detected hyperlinks pointing to CCSVI-related scientific publications in 31 posts. In contrast, 2829 different URLs were posted to the forum, most frequently referring to social media, such as YouTube or Facebook. We identified a total of 6 different roles of hyperlink posters including Social Media Fans, Organization Followers, and Balanced Source Users. Apart from the large and nonspecific residual category of the “average user”, several specific behavior patterns were identified, such as the small but relevant groups of CCSVI-Focused Responders or CCSVI Activators.
The bulk of the observed contributions were not based on scientific results, but on various social media sources. These sources seem to contain mostly opinions and personal experience. A small group of people with distinct behavioral patterns played a core role in fuelling the discussion about CCSVI.
In the past few decades, we have witnessed a powerful movement toward an active, self-managing, and responsible patient, coined the “expert patient” [
Most studies in this area have investigated how often people use the Internet for retrieving health information [
However, we still know very little about what mechanisms of information dissemination are effective as well as what sources of information people in online forums rely on, how they form their opinions, and how they act. A better understanding of these mechanisms may help to assess their influence on laypeople and to forecast the benefits and dangers of these new forms of information dissemination and exchange.
One promising area for such research is the recently proposed Chronic Cerebrospinal Venous Insufficiency (CCSVI) hypothesis in multiple sclerosis (MS) and its repercussions in patient communities. In short, this hypothesis was first proposed by Paolo Zamboni [
Before we can make a statement on whether and how this multitude of information sources and opinions may contribute to the enlightenment of some participants in the debate or the confusion of others, we need to know more about the sources of information used in online health forums and how users and participants use this information, including their different roles and contribution behavior in such forums. To examine these questions, we can build on a UK study on online self-harm discussion forums [
Our observational study takes advantage of free access to a large German online forum related to multiple sclerosis, with the aim of identifying (1) sources of information used in online health forums, and (2) roles and patterns of behavior of people actively engaging in the forum in introducing and disseminating this information.
In this observational study, we extracted the content from an online health forum, using a custom implementation of a Web crawler, with the aim of collecting a large database of discussions from an online health forum. Furthermore, we used an Information Retrieval algorithm (specifically designed and implemented for this particular task) to identify a comprehensive sample of posts dealing with CCSVI.
The database for the study comprised contributions posted to the online forum of the Deutsche Multiple Sklerose Gesellschaft (DMSG, German Multiple Sclerosis Society) [
Between 01 January 2008 and 17 August 2012, all 139,912 posts from 11,997 threads were extracted. Because the forum is about MS in general, only a fraction of the extracted posts were expected to be about CCSVI. Preliminary analysis showed that the assumption of “one thread discusses one topic” does not hold in the observed forum. Instead, users tended to deviate from the original topic as time progressed. Therefore, a custom Information Retrieval algorithm was developed to classify individual posts as either relevant (“discussing CCSVI at least partially”) or irrelevant. For details on the algorithm design, training, and evaluation, see
A screenshot of a forum post.
Because the term “expert patient” implies intelligent use of scientific information, we aimed to assess to what degree the use of scientific sources was present in the forum. Users occasionally included hyperlinks in their posts and these links referred to content the users based their opinions on. We analyzed which of these links were defined references to scientific papers in order to get an overview of the kind of papers cited and the temporal citation patterns. Two steps were necessary for this identification process.
First, we generated a presumably exhaustive list of publications dealing with CCSVI. A citation network starting from Zamboni’s original publication and using the CiteXplore Web service was constructed [
Second, a program fetched every hyperlink (also those in “irrelevant” posts) from the corpus, extracted the textual content from the referenced webpage or PDF document and searched it for titles or publication IDs from the publication list. In the case of a hit, one of the authors verified whether one of the publications was indeed referenced and, if so, which one. Every match was also classified as either a direct reference or an indirect reference. An indirect reference in this context was regarded as a resource that solely discussed or explained a certain publication, not including other work based on the publication. A direct reference linked to the publication itself.
Apart from searching for scientific information sources in the posts, we also strived to identify other information sources used or referenced in the posts. In order to obtain an overview of the wide spectrum of referenced websites, we defined a classification scheme. First, we reduced every URL found in the reduced corpus to the basic domain part of the URL (ie, only “domainname.com” was used—if the URL included additional content after the domain name, such as directories, folders, webpages, file extensions, that content was removed from the URL). Second, we classified the remaining domains into the 8 classes shown in
Primary domain classes.
Organization | Meant in a broader sense, including foundations, associations, and unions. These are sometimes professional and often promote some kind of agenda. |
Commerce | Private business selling products or services that do not include treatment. |
News | Commercial news providers. |
Other | Various content not fitting into the other classes. |
Personal | Static content from a single person. |
Scientific | Sources of scientific work and knowledge including Wikipedia. We included the latter in this class, because its reliability was established in [ |
Social | Social media websites revolving around communication and user-generated content. |
Health care providers | Doctors’ offices, clinics, Q&A by professionals. Not limited to Multiple Sclerosis. |
To characterize user behavior, we tried to identify distinct behavior patterns. Since nothing was known in advance about the behavior patterns of forum users, we employed a method of exploratory data analysis to reveal possible patterns. A clustering algorithm groups users based on their similarity according to a set of predefined features. We thus wanted to define two separate feature sets with the aim of describing two different aspects of user behavior and revealing patterns in these features through clustering. We employed the popular
Two behavioral aspects in particular were analyzed in detail by separate clusterings: (1) the preference for discussed sources of information, and (2) the general contribution behavior or posting habits. In the first clustering, we focused on the hyperlinks from each of the 8 domain classes. A user was represented by a vector in 8-dimensional space: for example, a value of 3 for the 2nddimension meant the user had posted 3 hyperlinks from the domain class “Organization”. The second clustering focused on 9 quantitative features describing what and how a user had posted. The features (measures) were either taken from similar approaches discussed in the literature [
In both cases, the
In the first clustering, we had to compensate for different general activity levels of users because we wanted to group the users according to their information source preferences only. We divided every vector by its Euclidean norm in order to obtain unit vectors showing only “taste” (preference), but not “activity”. In the second clustering, the different features had different scales. For instance, users often showed several hundred days of activity, but the fraction of their initiated threads can by definition not exceed 1. We thus performed a
We visualized the resulting clusters in radar charts [
Definition of behavior features.
Measure | Definition | Rationale |
Average message length (from [ |
Average post content length in characters without counting references. | The message length is an indicator of the amount of effort that is put into a post by a user and it also tells us something about the discussion style of a user. Some users prefer elaborate, essay-like contributions while others use the forum in a more conversational way. |
Average number of posts per day (from [ |
Average number of posts per day that a user made. | This is the most important activity feature of a user and it also provides an insight into the selectiveness of the user. A user with a high number of posts per day over a long time period can be expected to be a frequent visitor, who makes posts regardless of outside events. |
Average number of references per post | Average number of unique references that are included in a post. | The feature describes the tendency of a user to bring new sources of information to the forum and may also describe the ability to support the stance of the user with evidence. |
Average number of threads per day (from [ |
Average number of different threads a user posts to per day. | While this is also an activity feature, it provides an insight into the focus of interest a user has. A low value may indicate a preference to discuss only specific topics while a high value may indicate a preference to join any sort of discussion. |
Days active (from [ |
Number of days between the first post and the last one. | The feature indicates the consistency of the contribution behavior and posting habits of a user and is an important piece of context information when interpreting the other features. |
Fraction of posts that were cited | Fraction of the posts that have been cited at least once. | While it can only be assumed what users try to express when they use the citation function, the feature is expected to show the tendency to provoke direct responses from other forum participants. |
Fraction of relevant posts | Fraction of the posts that were classified as relevant by the Information Retrieval algorithm. | This feature is a solid indicator of the user’s interest in CCSVIa. While it cannot be inferred from this feature alone whether the user has a pro-CCSVI or anti-CCSVI stance, it seems plausible that users with a high interest in CCSVI believe in the hypothesis. |
Fraction of initiated threads (from [ |
Fraction of the threads the user initiated based on the total number of threads the user contributed to. | This feature measures the tendency of a user to start discussions, which is often related to the introduction of new information to the forum. |
Coverage of users in relevant parts per post | Number of users the user discussed CCSVI with divided by the total number of posts the user made. An uninterrupted sequence of relevant posts is regarded a single discussion. The users that co-occurred in these discussions are counted as discussion partners. | This feature can be described as the efficiency in opinion exchange about CCSVI. |
aCCSVI: Chronic Cerebrospinal Venous Insufficiency
We detected hyperlinks pointing to CCSVI-related scientific publications in 31 posts.
The large differences in the total number of posted references per month correlate roughly with the total number of relevant posts. Interestingly, the highest peak (September 2010 - November 2010) was observed when the aforementioned phases shifted. The external events causing the other significant fluctuations are not known. However, when the total number of posted references rose from a given point in time to another, the change was typically reflected in all of the domain classes, which indicates a certain echo of external events equally affecting the different types of resources. The plot also shows how quickly the topic caught on in the layperson forum and that users seemed to have lost interest in the debate, as suggested by the few references posted in 2012.
Timeline of posted hyperlinks for each domain class.
We included only a fraction of the users in the clustering because we wanted to focus on those who took part in CCSVI discussions. Furthermore, a sufficient amount of information about each user was required. Therefore, we clustered only users who had posted at least 5 relevant hyperlinks, in the case of hyperlink use (first clustering). In the case of posting habits (second clustering), we included only users who had made at least 5 relevant posts. The filtering process is shown as a flow diagram in
The first clustering of the users into 6 groups revealed clusters shown in
Clustering users, who had made at least 5 relevant posts, revealed the 6 groups shown in
Flowchart of the sampling procedure for clusterings.
Venn diagram showing the user sets used in the clusterings.
Reference use clusters with number of users in each cluster (n=64 included cases).
Radar chart showing aggregated domain class use of each cluster (the user vectors belonging to the cluster are summed up). Each cluster vector is a normalized to be a unit vector. The length of a spoke is proportional to the value it represents.
Posting behavior, according to the second clustering, with number of users in each cluster (n=171 included cases).
Radar chart showing feature means (overall users within a cluster) of the contribution behavior clusters. The means are min-max-normalized to a [0;1] range. The length of a spoke is proportional to the value it represents.
The bulk of the observed contributions were not based on scientific results, but on various social media sources. These sources seem to contain mostly opinions and personal experience. A small group of people with distinct behavioral patterns played a core role in fuelling the discussion about CCSVI, as identified by their behavior. The identification of this group of people was an unintended consequence of our exploratory analysis technique. Our identification method is behavior driven and thus provides a viable alternative to the influence-based identification of so called “opinion leaders” in forums, as discussed in [
Scientific publications were brought to the forum at a “boom phase” of CCSVI discussion, followed by a phase of critical views, beginning September 2010 with the opponents of the CCSVI hypothesis getting the upper hand in the forum. Although scientific and lay discourse seem to go hand in hand, it is obvious that scientific publications and scientific sources such as Wikipedia played, in the end, only a minor role in the layperson forum. Instead, social media were the most important source of information. The nature of social media content varies, but we believe that social media are often about personal experiences and exchange of opinions. This is further illustrated by the reference use patterns we identified, such as Social Media Fans or Homepage Promoters. We would suggest characterizing the nature of this lay discourse more as an elementary discourse or an interdiscourse [
Our 6 groups of posting behavior are based on a careful inspection of different characteristics and are similar to the participants in 5 online forums on self-harm [
Only a small set of the involved users showed enough activity to be suitable for meaningful descriptions of their behavior. This is consistent with the common observation of significant participation inequality in social media. Typically, activity levels are characterized by the power law with about 1% of the users exercising the core influence on a community [
The Highly Active Relational Posters are expected to be important community builders, as a substantial amount of personal “small talk” is attributed to them. Interestingly, a group of 17 people, the CCSVI Activators, played a core role in fuelling the discussion about CCSVI, because they often initiated threads about CCSVI and included many hyperlinks. While there is considerable concern that social media and Internet applications permit a minority of individuals to spread misinformation and damage useful interactions as recently discussed in the case of anti-vaccinationism [
Some Sophisticated Contributors were identified, but these people did not participate in CCSVI discussions very often. Additionally, 4 very short-lived and CCSVI-focused accounts were identified. One possible explanation is that they were the temporarily-used alternative accounts of some users.
One major advantage of this study is its observational nature. Real-world data was observed in an unobtrusive way. We analyzed a public Internet forum, which was unstructured and unmoderated, over a 3-year period of CCSVI discussion. We thus avoided self-reporting biases and artificial setups. Furthermore, we applied a Machine Learning approach in order to shed some light on the complex nature of user interaction.
However, there are several limitations. There is no demographic data available for the forum users and it is even possible that some persons used different accounts. Furthermore, before 27 August 2010, users were able to choose their aliases freely for every individual contribution. Due to the lack of a log-in mechanism, it is possible that different individuals posted under the same name.
The identification of relevant content was non-trivial and did not have 100% accuracy, which resulted in a possibly biased database. The reduction of URLs to the basic domain was a simplification. When assessing user patterns, we had to deal with small sample sizes (N=64 and N=171). The clustering approach itself relied on several assumptions. We assumed that constant behavioral patterns exist, that we defined appropriated features to describe them, and that they are linearly separable in the feature space. The interpretation of the assigned roles is subjective, but based solely on the quantitative data documented in this study.
We had to decide how to identify scientific sources of information in the posts. To be on the safe side, we accepted only the posting of URLs with a link to scientific publications as a use of scientific publications. Of course, other users may have discussed scientific publications in a rather elaborate way without posting URLs. Moreover, publications are often hidden behind a paywall, which may make posting URLs unpopular. They are also written in English, which may pose a language barrier. Our approach underestimates the discussion of scientific publications in online health forums but is highly specific in identifying the introduction of scientific publications.
Our description of participants in this online health forum was based solely on “metrics”, similar to the Jones et al study [
Scientific sources were by far less important than social media in the posts and forum discussions. While some of the uncovered evidence may indicate the successful propagation of scientific results into discussions among laypeople within an online health forum, scientific results represented by far a rather small fraction of the information sources that were discussed in the particular forum under study. Whether this is any indication of the rise of the “expert patient” remains the subject of further studies. Some of the participants in the forum, especially the Sophisticated Contributors, could be considered experts based on the nature of their contribution behavior and their overall behavior, with rather extensive posts often including scientific and other references. They, however, also represent only a tiny fraction and before we can draw reliable conclusions we need to conduct semantic analyses of their statements. In contrast, the majority of overall users tend to rely on social media-based sources of information, which often feature personal experiences and opinions.
The health care system can be described as a two-sided network: a network with large components linked to each other through multiple platforms so that clinicians, health care institutions, and companies can interact with patients and communities [
Our study has used some sophisticated methods for extracting information on the posting behavior in online forums to address important questions in this field. To eliminate some of the limitations of the study and to determine more precisely the role and behavior of forum contributors with regard to scientific information, a qualitative approach is needed, preferably a discourse analysis of the social exchange processes and argumentative strategies in online health forums, similar to a Canadian study of online social support forums for gamblers, in which the interaction of the participants, their common discussions, and how they constructed identities and negotiated legitimacy were analyzed [
Forum information retrieval.
Clustering results data tables.
Referenced scientific publications.
Chronic Cerebrospinal Venous Insufficiency
Deutsche Multiple Sklerose Gesellschaft/ German Multiple Sclerosis Society
multiple sclerosis
We are indebted to Lara Weibezahl for critical input and review of the final draft of the paper and to Dr Richard Nicholas for a neurological perspective on CCSVI.
The Ethics Committee of the University Medical Center Göttingen confirmed (ref 11/5/13) that ethical approval was not necessary due to the nature of the data (secondary data analysis of anonymized data).
Some of the methods used and results are part of a Master’s thesis available online [
None declared.