Characterizing Twitter Discussions About HPV Vaccines Using Topic Modeling and Community Detection

Background In public health surveillance, measuring how information enters and spreads through online communities may help us understand geographical variation in decision making associated with poor health outcomes. Objective Our aim was to evaluate the use of community structure and topic modeling methods as a process for characterizing the clustering of opinions about human papillomavirus (HPV) vaccines on Twitter. Methods The study examined Twitter posts (tweets) collected between October 2013 and October 2015 about HPV vaccines. We tested Latent Dirichlet Allocation and Dirichlet Multinomial Mixture (DMM) models for inferring topics associated with tweets, and community agglomeration (Louvain) and the encoding of random walks (Infomap) methods to detect community structure of the users from their social connections. We examined the alignment between community structure and topics using several common clustering alignment measures and introduced a statistical measure of alignment based on the concentration of specific topics within a small number of communities. Visualizations of the topics and the alignment between topics and communities are presented to support the interpretation of the results in context of public health communication and identification of communities at risk of rejecting the safety and efficacy of HPV vaccines. Results We analyzed 285,417 Twitter posts (tweets) about HPV vaccines from 101,519 users connected by 4,387,524 social connections. Examining the alignment between the community structure and the topics of tweets, the results indicated that the Louvain community detection algorithm together with DMM produced consistently higher alignment values and that alignments were generally higher when the number of topics was lower. After applying the Louvain method and DMM with 30 topics and grouping semantically similar topics in a hierarchy, we characterized 163,148 (57.16%) tweets as evidence and advocacy, and 6244 (2.19%) tweets describing personal experiences. Among the 4548 users who posted experiential tweets, 3449 users (75.84%) were found in communities where the majority of tweets were about evidence and advocacy. Conclusions The use of community detection in concert with topic modeling appears to be a useful way to characterize Twitter communities for the purpose of opinion surveillance in public health applications. Our approach may help identify online communities at risk of being influenced by negative opinions about public health interventions such as HPV vaccines.

The plate (rectangle) represents a repetition of a variable (e.g. words in a document) and a circle represents the variable. The observed variable (e.g. words, w) is represented by a shaded circle, while an unobserved variable is represented by an unshaded circle (e.g. topic mixture of a document,, topic assignment of a word, z, and distributions over words for topics ). The goal of topic model is to learn the latent variables which is a Bayesian inference problem.

Dirichlet Mixture Model
The Dirichlet Multinomial Mixture (DMM) model is a generative model that differs from LDA in that each document m is associated with a single topic zm rather than a distribution over topics as in LDA [5]. Thus, DMM is a mixture model, whereas LDA is an admixture model.
Recently, Yin et al. [6] showed the DMM achieved significantly better performance for short text clustering tasks such as on Twitter data set. Figure

Cluster alignment
The adjusted Rand index (ARI) is an extended version of Rand index (RI), which measures the percentage of tweets with the same topics being grouped into same community and tweets with different topics into different communities. An ARI assumes the generalised hypergeometric distribution as the model of randomness. Thus an ARI score is bounded above by 1 and close to 0 is expected if tweets are distributed at random among the communities. The ARI is defined as: where H represents marginal entropy, I represents mutual information, A={a1,...,aN} represents community labels, and B={b1,…,bN} represents topic assignments. A value close to 0 represents poor alignment, while a value of 1 represents perfect alignment between the community structure and the topics.
The purity of a community is the number of elements of the largest class (topic assignment) in the community divided by the total number of tweets in the community. Thus, the purity is defined as: where nr is the size of particular community Vr, n r k is the number of tweets in the community Vr that are assigned to topic k. A purity close to 0 indicates a poor alignment between the community structure and the topics, and a purity of 1 represents a perfect alignment.
The ARI, NMI and purity were used in an attempt to quantify how often individual topics were concentrated within a small number of communities. To do this, we compared clusters of tweets by topic with clusters of tweets by community-defining a community cluster by the set of tweets posted by any users within that community, and a topic cluster as all the tweets that were assigned to that topic. Figure A3 shows the ARI, NMI and purity scores for combinations of DMM and LDA with Louvain and Infomap for number of topics 5 to 200. close to 1 mean good alignment between the community structure and the topics.
In general, the alignment between the community structure and the topics was higher across all measures for the DMM method compared to the LDA method. Under the assumption that we expected to observe a concentration of some topics within a small number of communities, and given that the topic modelling was undertaken without any consideration of the social connections between users, the results of these experiments suggest that the DMM method may have produced a more realistic clustering of the tweets by topic.

Individual topic concentration
We found that in the combination of the DMM method for topic modelling (with the number of topics set to 30) and the Louvain method for community detection, a random assignment of the 30 topics across the set of tweets without any consideration of the structure most often required 9 communities to cover 95% of any topic. In the observed topic distribution, the TC 95 values range between 6 and 11 communities, and the majority of topics are 95% covered by 8 or fewer communities. The difference between the two distributions suggests that in the observed network, topics are more concentrated within communities than would be expected by chance.
We calculated the TC95 values for all topics with at least one tweet for each combination of the community detection and topic modelling methods, and varying the number of topics between 5 and 200 ( Figure A4). We found that when the number of topics was relatively low, the DMM method tended to find topics that had higher levels of concentration within communities.  Figure A5 shows the results of manual intrusion test. Each value represents how many times in five test case the investigator correctly identified the intrusion topics. Higher values mean that the intrusion topics were easily identified. Figure A5. Manual intrusion test results. The numbers in the table represent how many times the investigator correctly identified the intrusion topics in the baseline topics. The colour on the horizontal and vertical axes represents the themes: harms/conspiracies (red), evidence/advocacy (green), and experiential (blue) themes.