This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Tools used to appraise the credibility of health information are time-consuming to apply and require context-specific expertise, limiting their use for quickly identifying and mitigating the spread of misinformation as it emerges.
The aim of this study was to estimate the proportion of vaccine-related Twitter posts linked to Web pages of low credibility and measure the potential reach of those posts.
Sampling from 143,003 unique vaccine-related Web pages shared on Twitter between January 2017 and March 2018, we used a 7-point checklist adapted from validated tools and guidelines to manually appraise the credibility of 474 Web pages. These were used to train several classifiers (random forests, support vector machines, and recurrent neural networks) using the text from a Web page to predict whether the information satisfies each of the 7 criteria. Estimating the credibility of all other Web pages, we used the follower network to estimate potential exposures relative to a credibility score defined by the 7-point checklist.
The best-performing classifiers were able to distinguish between low, medium, and high credibility with an accuracy of 78% and labeled low-credibility Web pages with a precision of over 96%. Across the set of unique Web pages, 11.86% (16,961 of 143,003) were estimated as low credibility and they generated 9.34% (1.64 billion of 17.6 billion) of potential exposures. The 100 most popular links to low credibility Web pages were each potentially seen by an estimated 2 million to 80 million Twitter users globally.
The results indicate that although a small minority of low-credibility Web pages reach a large audience, low-credibility Web pages tend to reach fewer users than other Web pages overall and are more commonly shared within certain subpopulations. An automatic credibility appraisal tool may be useful for finding communities of users at higher risk of exposure to low-credibility vaccine communications.
The spread of misinformation, which we define here to include communications that are not a fair representation of available evidence or communicate that evidence poorly, has become an increasingly studied topic in various domains [
Most studies aimed at finding or tracking misinformation on social media define misinformation using
A range of tools have been developed to assess the credibility of health information available on the Web. Most were designed as checklists to be used by experts to assess the credibility and transparency of what they are reading. The DISCERN tool was designed as a general purpose tool for evaluating the quality of health information [
Public perception of vaccines is an exemplar of the problem of misinformation spread through news and social media [
Given the rate at which new information is made available and the resources needed to appraise them, there is currently no way to keep up with new health-related stories as soon as they appear. Although the challenge of managing information volume versus quality was discussed two decades ago [
We sought to characterize the sharing and potential reach of vaccine-related Web pages shared on Twitter, relative to credibility. As it would not have been feasible to manually assess the credibility of all Web pages, we developed and evaluated classifiers to automatically estimate their credibility.
The study used a retrospective observational design. To estimate the credibility of vaccine-related Web pages shared on Twitter, we collected text from vaccination-related Web pages by monitoring links from tweets that mentioned relevant keywords. We manually appraised the credibility of a sample of Web pages by applying a checklist-based appraisal tool, using the sample to train classifiers to predict a credibility score in unseen Web pages. Applying an ensemble classifier to the full set of Web pages collected as part of the surveillance, we examined patterns of sharing relative to credibility scores.
We collected 6,591,566 English language, vaccine-related tweets and retweets from 1,860,662 unique Twitter users between January 17, 2017, and March 14, 2018, using the Twitter Search Application Programming Interface, using a set of predefined search terms (including “vaccin*,” “immunis*,” “vax*,” and “antivax*”). For all unique users posting vaccine-related tweets during the study period, we collected the lists of their followers to construct the social network.
We extracted 1.27 million unique URLs from the set of tweets to identify the set of text-based Web pages to include in the analysis. To restrict the set of Web pages to only English language text, we used a Google library [
To modify how we sampled tweets for constructing a manually labeled dataset, we used PubMed to search for vaccine-related research articles using search terms “vaccine” or “immunisation” in the title or abstract, automatically expanded by PubMed to include synonyms and MeSH terms. The search returned 306,886 articles. We then used the PubMed identifiers of these articles with Altmetric (Digital Science) to identify Web pages (news, blogs, and social media posts) that linked to these articles via their digital object identifier, PubMed entry, or journal Web page. We found 647,879 unique URLs from Altmetric that cited the selected vaccines-related PubMed articles.
The intersection of the URLs extracted from Altmetric and the URLs extracted from the tweets allowed us to oversample from the set of Web pages for which we expected to have higher-credibility scores (described below). This approach also allowed us to exclude most of the URLs shared on Twitter that linked directly to research articles by removing the tweets that were identified by Altmetric.
The steps used to define the training dataset and automatically label Web pages.
The credibility appraisal tool was developed by 3 investigators (AGD, AD, and MS) with expertise in public health, public health informatics, science communication, and journalism. To develop a tool that would work specifically with vaccine-related Web pages, the investigators adapted and synthesized individual criteria from the following checklist-based tools and guidelines [
Centers for Disease Control and Prevention guide for creating health materials [
The DISCERN tool [
Health News Review criteria [
Media Doctor review criteria [
World Health Organization report on vaccination and trust [
The QIMR [
Using these documents as a guide, we adapted from the DISCERN and QIMR checklists, and added 2 additional criteria that were specific to vaccine-related communications. The tool was pilot tested on 30 randomly selected Web pages and iteratively refined through discussion among the 3 investigators. The resulting credibility appraisal tool included the following 7 criteria: (1) information presented is based on objective, scientific research; (2) adequate detail about the level of evidence offered by the research is included; (3) uncertainties and limitations in the research in focus are described; (4) the information does not exaggerate, overstate, or misrepresent available evidence; (5) provides context for the research in focus; (6) uses clear, nontechnical language that is easy to understand; and (6) is transparent about sponsorship and funding.
The 3 investigators then applied the credibility appraisal tool to an additional 474 vaccine-related Web pages. For each Web page, investigators navigated to the website, read the article, and decided whether it satisfied each of the 7 criteria. This process produced a set of values (0 or 1) for each criterion and Web page. We then summarized the information as a
The investigators independently undertook duplicate appraisals of a subset of the Web pages to measure inter-rater reliability, and it was found to be reasonable for separating Web pages as low, medium, or high credibility (Fleiss kappa 0.46; 95% CI 0.41-0.52;
The proportion of Web pages that met the individual criteria in the 474 Web pages used to train the classifiers. cri: criterion.
We compared 3 machine learning methods that are commonly used for document classification problems: support vector machines (SVM), random forests (RF), and recurrent neural networks (RNN). The SVM method trains a large-margin classifier that aims to find a decision boundary between 2 classes that is maximally far from any point in the training data. In the RF method classification, trees are constructed by randomly selecting a subspace of features at each node of the decision tree to grow branches. The method then uses bagging to generate subsets of training data for constructing individual trees, which are then combined to form RF model. The RNN method refers to a class of artificial neural networks comprising neural network blocks that are linked to each other to form a directed graph along a sequence. The method is used to model dynamic temporal behavior for a time sequence, which is useful for understanding the language.
The aim of these supervised machine learning techniques was to train a model to predict the class of an unseen document by learning how to distinguish the language used across classes. To apply the classifiers, we cleaned the text downloaded from Web pages by removing extra spaces, tabs, extra newlines, and nonstandard characters including emoticons. Each Web page was then included as a document in our corpus.
To develop the RNN classifier, we used average-stochastic gradient descent weight-dropped long short-term memory [
Language models are trained to understand the structure of the language used in a corpus of documents, and its performance is measured by its ability to predict the next word in a sentence based on the set of previous words. After the language model is trained for this task, the complete DL-based classifier is then fine-tuned to predict whether a document satisfies each of the credibility checklist criteria. Language models are often trained to learn the structure of the language in a target corpus, but recent advances in transfer learning have produced superior results including shorter training times and higher performance. An example is the Universal Language Model Fine-Tuning method [
We used transfer learning to create the language model backbone. The language model was developed with 3 layers, 1150 hidden units, and an embedding size of 400 per word, and the weights were initialized from a pretrained WikiText-103 language model produced by Howard et al [
The parameters and corresponding values for the initialization of the language model and classifier.
Parameters | Value |
Weight decay | 1.00E-04 |
Backpropagation through time | 60 |
Batch size | 52 |
Dropouts | 0.25, 0.1, 0.2, 0.02, 0.15 |
Embedding size | 400 |
Number of layers | 3 (language model), 5 (classifier) |
Optimizer | Adam |
β1, β2 | 0.8, 0.99 |
The performance difference of the language model (LM) for 2 different settings, including training loss (top-left), validation cross-entropy loss (top-right), and the accuracy of the LM predicting the next word in a sentence given previous words in the validation text (bottom).
For the SVM- and RF-based classifiers, we performed additional preprocessing to remove stop words and low-frequency words to improve accuracy. After preprocessing, there were 60,660 unique words used across the entire corpus; these were used as features for training and testing RF and SVM classifiers. Each document was represented as a set of feature vectors, where features were defined by term frequency–inverse document frequency (tf-idf) weights. tf-idf represents the importance of a word to a document in a corpus, which increases proportionally to the number of times it appears in the document but is offset by the frequency of the word in the corpus, ensuring that the similarity between documents be more influenced by discriminative words with relatively low frequencies in the corpus. The best parameters for SVM and RF are found using grid search functionality of
Using the expert-labeled data, we trained 21 classifiers (1 per criterion for each of the RF-, SVM-, and DL-based classifiers) and evaluated the performance of the classifiers in 10-fold cross-validation tests, reporting the average F1 score and accuracy for all 3 classifiers. Although the comparison of the performance across the set of classifiers may be of interest, our aim was to provide the basis for an ensemble classifier that could reliably estimate which of the criteria were met by each Web page.
The parameters used for support vector machine and random forest classifiers; all other parameters are kept as default.
Parameters | Value | |
|
||
|
C | 100 |
|
Gamma | 1 |
|
Kernel | linear |
|
Norm | l1 |
|
Use-idfa | TRUE |
|
Max-dfb | 1 |
|
N-gram range | (1,1) |
|
||
|
N-estimators | 10 |
|
Criterion | Gini |
|
Min-impurity-split | 1.00E-07 |
aUse-idf: when true, term weights are scaled by the number of documents they appear in.
bMax-df: when set to 1, words that appear in every document are not removed.
Following the development of a reliable tool for automatically estimating the credibility of vaccine-related communications at scale, we aimed to characterize patterns of potential exposure to low-credibility vaccine communications on Twitter. For each Web page that met our study inclusion criteria, we estimated its credibility score using the best-performing classifiers for each criterion. We then aggregated the total number of tweets posted during the study period that included a link to the Web page, including tweets and retweets. We then estimated the
To examine how users posting links to low-credibility Web pages might be concentrated within or across subpopulations, we also estimated a per-user measure of credibility, which was defined by the list of credibility scores for any user sharing links to one or more Web pages. We used these lists in conjunction with information about followers to construct a
The RF classifiers produced the highest performance overall, and in most cases predicted, whether the text on a vaccine-related Web page satisfied each of the credibility criteria with over 90% accuracy (
Where the best-performing classifiers were combined to distinguish between low-, medium-, and high-credibility Web pages, the overall accuracy of the ensemble classifier that combines best-performing classifiers (SVM for criterion 3 and 7 and RF for all other criteria) was 78.30%. In terms of labeling low-credibility Web pages, the ensemble classifier rarely mislabeled a high- or medium-credibility Web page as low credibility; more than 19 out of every 20 Web pages labeled as low credibility were correct.
To consider the expected robustness of the classifiers, we additionally analyzed the set of terms that were most informative of low-credibility Web pages. We used a Fisher exact test to compare the proportion of low-credibility Web pages a term appeared in at least once relative to the proportion of other Web pages in which the term appeared at least once, examining the terms that were over-represented in either direction (
The results indicate a set of mostly general terms; terms that are most indicative of low-credibility Web pages are related to stories about individuals and individual autonomy (eg, “her,” “son,” “autistic,” “right,” and “allowed”), and terms that are most indicative of other Web pages are related to research and populations (eg, “institute,” “phase,” “placebo,” “countries,” “improve,” and “tropical”). The results suggest that the sample of Web pages used to construct the training data is a broad enough sample to capture general patterns rather than specific repeated topics that would limit the external validity of the approach.
Performance of the classifiers (average F1 score and accuracy in 10-fold cross-validation).
Criterion | Deep learninga, mean (SD) | Support vector machinesa, mean (SD) | Random forestsa, mean (SD) | |||
|
F1 score | Accuracy | F1 score | Accuracy | F1 score | Accuracy |
1 | 0.851 (0.005) | 0.740 (0.008) | 0.903 (0.032) | 0.842 (0.045) | 0.924 (0.019) | |
2 | 0.000 (0.000) | 0.638 (0.003) | 0.802 (0.044) | 0.828 (0.018) | 0.943 (0.006) | |
3 | 0.000 (0.000) | 0.865 (0.009) | 0.917 (0.011) | 0.745 (0.088) | 0.944 (0.018) | |
4 | 0.882 (0.001) | 0.789 (0.002) | 0.903 (0.042) | 0.833 (0.068) | 0.936 (0.022) | |
5 | 0.551 (0.249) | 0.486 (0.051) | 0.787 (0.034) | 0.721 (0.051) | 0.920 (0.020) | |
6 | 0.867 (0.002) | 0.765 (0.004) | 0.912 (0.006) | 0.852 (0.010) | 0.943 (0.004) | |
7 | 0.000 (0.000) | 0.840 (0.008) | 0.924 (0.006) | 0.764 (0.057) | 0.936 (0.004) |
aThe classifier with the highest F1-score is italicized for each criterion.
A subset of the terms that were informative of low-credibility scores in the training set of 474 Web pages. Terms at the top are those most over-represented in low-credibility Web pages compared with other Web pages, and terms at the bottom are those most under-represented in low-credibility Web pages compared with other Web pages. OR: odds ratio; Inf: infinity.
Satisfied with the performance of the ensemble classifier, we then applied it to the full set of 144,003 unique vaccine-related Web pages, producing an estimated credibility score for every page. Fewer Web pages with low-credibility scores were shared on Twitter relative to those with medium- or high-credibility scores (
When we examined the total number of potential exposures by counting cumulative followers across all tweets and retweets for each Web page, we found that the distributions were similar (illustrated by the slopes of the 3 distributions in
The sum of tweets and retweets for links to included Web pages relative to the number of credibility criteria satisfied.
The distribution of potential exposures per Web page for low (orange), medium (gray), and high (cyan) credibility scores, where low credibility includes scores from 0 to 2, and high credibility includes scores from 5 to 7.
Measured by the total proportion of exposures to links to relevant Web pages, tweets to low credibility Web pages produced 9.34% (1.64 billion of 17.6 billion) of total exposures, compared with the 24.59% (4.33 billion of 17.6 billion) of total exposures to high-credibility Web pages. This indicates that Twitter users sharing links to high-credibility and medium-credibility vaccine-related Web pages tended to have a greater number of followers than those sharing links to low-credibility vaccine-related Web pages. However, the shape of the distribution shows that some of the low-credibility Web pages may have been influential; the top 100 Web pages by exposure were included in tweets that may have been seen by 2 million to 80 million users, and more than 200 Web pages of low credibility were included in tweets that could have reached 1 million users.
Links to low-credibility vaccine-related Web pages were more heavily concentrated among certain groups of users posting tweets about vaccines on Twitter. This is evident in a visualization of the follower network for the set of 98,663 Twitter users who posted at least two links to Web pages included in the study (
A network visualization representing the subset of 98,663 Twitter users who posted tweets including links to vaccine-related Web pages at least twice and were connected to at least one other user in the largest connected component. Users who posted at least 2 high-credibility Web pages and no low-credibility Web pages (cyan) and those who posted at least two low-credibility Web pages and no high-credibility Web pages (orange) are highlighted. The size of the nodes is proportional to the number of followers each user has on Twitter, and nodes are positioned by a heuristic such that well-connected groups of users are more likely to be positioned close together in the network diagram.
We found that it is feasible to produce machine learning classifiers to identify vaccine-related Web pages of low credibility. Applying a classifier to vaccine-related Web pages shared on Twitter between January 2017 and March 2018, we found that fewer low-credibility Web pages were shared overall, though some had a potential reach of tens of millions of Twitter users. A network visualization suggested that certain communities of Twitter users were much more likely to share and be exposed to low-credibility Web pages.
This research extends knowledge related to the surveillance of health misinformation on social media. Where much of the prior research has aimed to label individual social media posts or the claims made on social media by veracity [
The work presented here is also different from previous studies examining opinions and attitudes expressed by Twitter users, which mostly label individual tweets or users based on whether they are promoting vaccination or advocating against vaccines [
Our study is also not directly comparable with previous studies that have examined how misinformation spreads through social media [
This study has implications for public health. The ability to measure how people engage with and share misinformation on social media may help us better target and monitor the impact of public health interventions [
There were several limitations to this study. Although we used a modified sampling strategy to ensure a more balanced representation of Web pages, the manually labeled sample used for training and internal validation was relatively small, and this might have affected the results in 2 ways. First, our results showed that the DL-based classifiers were less accurate than the RF-based classifiers, but this might have been the consequence of the available training data rather than the general value of the DL approach. Without testing on larger sets of training data, we are unable to reliably conclude about the comparative performance of the machine learning methods. Second, in some document classification tasks where features are relatively sparse or many documents are very similar, using a smaller set of labeled examples can lead to overfitting. To avoid this, we were careful about removing duplicates and Web pages with overlapping text.
A second type of limitation relates to the choices we made about the methods. Other methods and architectures could have been used to predict credibility from text. For example, we could have used simpler methods including Naïve Bayes and logistic regression, used a single multi-label classifier to predict whether a document extracted from a Web page satisfied any of the criteria, or constructed a model that directly predicts the credibility score rather than the individual components.
A further limitation relates to the external validity of the classifier and our inability to draw conclusions about Web pages that do not include contiguous sections of text. We included only Web pages from which we could extract contiguous blocks of text and used a novel approach to sampling from those Web pages to create a reasonably balanced sample across the set of credibility scores. Other URLs included in vaccine-related tweets included links to other social media posts (including links to other tweets), links to YouTube and Instagram, links to memes in which text is embedded in an image, links to dynamic pages that no longer show the same information, and links to a range of other pages that included videos or images alongside a small amount of text. As we were unable to estimate the credibility of the vaccine-related information presented on these other Web pages, our conclusions are limited to the characterization of text-based Web pages. It is likely that a substantial proportion of Instagram, Facebook, and YouTube Web pages would receive a low-credibility score if they were assessed [
Our estimates of exposure were imperfect. To estimate how many Twitter users might have been exposed to information relative to credibility, we summed the total number of followers of a user for each user that posted the link. We did not count the total number of unique followers who might have seen the link, did not report the number of likes, and do not have access to the number of replies. In the absence of more detailed measures of engagement that can estimate the number of times a Web page was accessed via Twitter, we felt measures of potential exposure were a reasonable upper bound. The conclusions related to measures of potential exposure, therefore, need to be interpreted with caution, and further studies using robust epidemiological designs are needed to reliably estimate exposure.
We developed and tested machine learning methods to support the automatic credibility appraisal of vaccine-related information on the Web, showing that it is feasible. This allowed us to scale our analysis of large-scale patterns of potential exposure to low-credibility vaccine-related Web pages shared on Twitter. We found that although low-credibility Web pages were shared less often overall, there were certain subpopulations where the sharing of low-credibility Web pages was common. The results suggest two new ways to address the challenge of misinformation, including ongoing surveillance to identify at-risk communities and better target resources in health promotion and embedding the tool in interventions that flag low-credibility communications for consumers as they engage with links to Web pages on social media.
deep learning
human papillomavirus
Quality Index for health-related Media Reports
random forests
recurrent neural networks
support vector machines
term frequency–inverse document frequency
The study was funded by the National Health & Medical Research Council Project Grant APP1128968. The authors would like to thank Paige Martin for contributing to the research with the management of the database, data collection, and data synthesis and Maryke Steffens for the design and development of the credibility criteria and expert labeling.
None declared.