This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
An estimated 3.9 billion individuals live in a location endemic for common mosquito-borne diseases. The emergence of Zika virus in South America in 2015 marked the largest known Zika outbreak and caused hundreds of thousands of infections. Internet data have shown promise in identifying human behaviors relevant for tracking and understanding other diseases.
Using Twitter posts regarding the 2015-16 Zika virus outbreak, we sought to identify and describe considerations and self-disclosures of a specific behavior change relevant to the spread of disease—travel cancellation. If this type of behavior is identifiable in Twitter, this approach may provide an additional source of data for disease modeling.
We combined keyword filtering and machine learning classification to identify first-person reactions to Zika in 29,386 English-language tweets in the context of travel, including considerations and reports of travel cancellation. We further explored demographic, network, and linguistic characteristics of users who change their behavior compared with control groups.
We found differences in the demographics, social networks, and linguistic patterns of 1567 individuals identified as changing or considering changing travel behavior in response to Zika as compared with a control sample of Twitter users. We found significant differences between geographic areas in the United States, significantly more discussion by women than men, and some evidence of differences in levels of exposure to Zika-related information.
Our findings have implications for informing the ways in which public health organizations communicate with the public on social media, and the findings contribute to our understanding of the ways in which the public perceives and acts on risks of emerging infectious diseases.
Internet data, including data from social media platforms such as Twitter, have been used extensively in recent years to study health patterns and better understand infectious disease outbreaks [
A particularly successful area of research has used internet data to improve the forecasting of disease outbreaks. Several studies have found that these data, when combined with traditional sources of epidemiological data, can improve the surveillance and forecasting of seasonal diseases such as flu [
In this study, we have considered disease epidemics from the perspective of human
We have answered these questions by analyzing a collection of 29,386 English-language tweets filtered for keywords describing Zika and travel. We used a cascade of 3 machine learning classifiers to identify behavior mentions in tweets, and we have proposed a method of incorporating classifier error into our statistical analyses to test our hypotheses.
Mosquito-borne infections have long been known to cause large outbreaks that result in substantial morbidity and mortality. An estimated 3.9 billion individuals live in a location endemic for common mosquito-borne diseases, for example, dengue, chikungunya, and now, Zika [
For the overwhelming majority, Zika is a mild infection; the majority of cases are asymptomatic [
Importantly, these causal relationships have only been recently established. In October 2015, Brazil reported an association between Zika cases and microcephaly, a condition where an infant’s head circumference is extremely small and is accompanied by severe developmental and health complications [
Travel advisories are an important public health intervention because of the documented impact of travel on the emergence of infectious diseases [
Simulations find that the impact of travel on disease spread varies based on a number of factors. For example, Bajardi et al found that travel restrictions could reduce cases but probably only minimally [
Internet data have been used to better understand individual health behaviors and health discourse on the Web. Studies have found evidence that users publicly discuss a variety of ailments [
As the largest known Zika outbreak occurred recently, researchers are only now beginning to investigate the use of internet data to understand this particular disease. McGough et al used an autoregressive modeling approach to combine epidemiological data from PAHO, Twitter, Google search queries, and reports from HealthMap to build short-term forecasts for several Central and South American countries. They found that the lowest error models were produced when using Google search query volumes [
Others have found important information in Twitter data. Stefanidis et al used tweets from the first 3 months of the outbreak to characterize discourse around Zika [
Sharma et al investigated information dispersion on Facebook and specifically noted that inaccurate or misleading posts were more popular than those with scientifically sound information [
Seltzer et al used Instagram to look at image-sharing practices around Zika [
Zika is likely to continue to be an emerging illness of concern with considerable impacts in South, Central, and North America [
Human behaviors directly impact disease transmission [
This section describes the process used to identify relevant tweets and the techniques used to train and tune the classifiers. We then provide details on the collection of the Twitter timeline and followee data used in later analyses.
Data processing and experimental overview. Dotted boxes show datasets and corresponding sizes where applicable. Solid boxes show methods used and reference relevant text figures or tables. Black arrows show the flow of data through the pipeline. The gray arrows denote that the final classifiers were used to identify first person, travel consideration, and travel change tweets from the keyword filtered tweets.
Our data come from a set of 15 million Zika-related tweets from March 1, 2015, to October 31, 2016, with about 7 million in English, described in Daughton et al [
Qualitatively, we observed that the bulk of these Zika tweets were sharing news or other information, usually with links to external articles. However, we also observed a number of English-language tweets describing personal or shared experiences with Zika, including behavior changes in response to concerns about Zika (eg, changing travel plans or buying a mosquito repellent). This section describes our approach to identifying such personal mentions of travel-related behavior through a pipeline of keyword filtering and supervised machine learning.
As personal mentions of travel behavior are a very small proportion of the dataset, we first filtered the dataset to provide a subset with a higher fraction of relevant tweets. This is a standard approach in many social media applications to obtain a large enough fraction of relevant instances to build a reasonably balanced training set [
To be as comprehensive as possible when constructing the list of travel-related terms, we included all major airlines in the United States and all airlines with flights to South America [
After filtering and excluding retweets, 29,386 English-language tweets matched these criteria.
After keyword filtering, we still observed a variety of tweet topics in the data. This included mentions of changes in travel, opinions about the Olympics (which were hosted in Brazil during the outbreak), opinions about quarantining travelers, and general worry about Zika. The filters also captured tweets that were neither first person nor about travel, such as the headline,
To further filter the dataset to tweets of relevance to this study—tweets in which people express that they are personally changing or thinking about changing their travel behavior—we constructed 3 binary classifiers:
Each category only applies to tweets positively labeled with the previous category—travel consideration tweets must also be first-person tweets, and travel change tweets must also be travel consideration tweets.
To create a training set for learning supervised classifiers, we randomly sampled 2000 English-language tweets from the keyword-filtered dataset and annotated them with the 3 categories above. Furthermore, 2 researchers independently annotated all tweets to measure agreement. As tweets were only labeled for travel consideration and travel change when they were labeled with the previous category, we only calculated agreement for these categories when annotators also agreed on the previous category. This can be interpreted as measuring: in the cases where annotators agreed on first person, what was their agreement on travel consideration?
Examples of each category, frequency, and agreement are shown in
Label frequency (%), annotator agreement (Cohen’s κ), and example tweets for each classification category.
Category | Example (paraphrased) | % (n/N) | κ |
First person | When Zika explodes after the Olympics, I’m going to say I told you so! | 41.15% (823/2000) | .52 |
Travel consideration | Thinking about going to Rio for honeymoon. Will I be safe with Zika? | 17.5% (350/2000) | .76 |
Travel change | So mad I had to cancel my island babymoon because of Zika | 10.8% (216/2000) | .66 |
All classifiers were binary logistic regression classifiers built using the Python package scikit-learn (version 0.19.1) [
Performance results on the held-out test data are shown in
Precision is a measurement of type I error and describes the number of selected items that are actually relevant (percent of those classified positive that are actually positive). Recall, related to type II error, instead describes how many relevant items are selected (percent of positive instances in the full dataset that are classified positive). F1 then combines these 2 metrics, using a harmonic mean, to describe the system overall. We show both F1 using the pipelined approach (the final classifier) as well as the F1 score if each classifier is built independently (see
Final precision, recall, and F1 of the 3 classifiers.
Classifier | Precision | Recall | F1 | F1 (no pipeline) |
First person | 0.89 | 0.94 | 0.92 | 0.92 |
Travel consideration | 0.61 | 0.74 | 0.67 | 0.63 |
Travel change | 0.66 | 0.81 | 0.73 | 0.65 |
Our analyses involve measuring the proportion of tweets classified as the various categories along different dimensions. When appropriate, we have provided CIs of these estimates. Our CIs are based on
We further modify this approach to account for the uncertainty present in the classifier, using the negative predictive value (NPV) and the positive predictive value (PPV). The NPV is the ratio of true negatives to the sum of true negatives and false negatives whereas the PPV (equivalent to precision in classification) is the ratio of true positives to the sum of true positives and false positives (see [
Owing to the widespread attention the Zika outbreak received in the media, we wanted to identify if there are other characteristics that differentiate users who changed or considered changing travel as compared with users who tweeted about Zika but did not discuss travel plans.
Using our labeled training data, we collected a set of 100 users sampled at random for each of the 3 classification categories. To construct comparison groups, we also sampled 100 users from the entire set of English-language Zika tweets, as well as 100 English-language users selected at random from all of Twitter. When sampling, we excluded verified users, as the inclusion of celebrities and other prolific accounts could bias the results. We then identified 3 sets of 100 users at random for each classifier. For each group, we collected the Twitter timelines of the users and the list of individuals they follow (their
Owing to Twitter’s application programming interface (API) restrictions on user timelines, we were only able to collect the most recent 3200 tweets for each user. This means that we were not able to collect tweets during the time period of the Zika outbreak especially frequently. This could affect the analyses but will be a close approximation as long as these users have not substantially changed their tweeting behavior since 2016. Tweets were preprocessed in the same manner as described in the Classification section.
Applying the classifiers to the keyword-filtered tweets resulted in a final dataset of 13,225 first-person tweets, 3083 travel consideration tweets, and 1567 travel change tweets. This section describes the results of our analyses of these tweets and the users who posted these tweets.
Temporal trends in the 3 datasets are shown in
We also explored temporal differences in the destinations of the users’ cancelled travel. To do this, we manually labeled the destinations in all 1567 tweets that were classified in the
Temporal trends in classifications by week.
Temporal trends in decisions to change international (outside of the United States) and domestic (within the United States) travel.
To evaluate spatial trends, we geolocated tweets using Carmen [
We grouped tweets into geographic regions defined by the US Department of Health and Human Services (HHS). HHS Regions are regional groupings of states in the United States that are commonly used to aggregate states for health studies. As the traditional HHS Regions group geographically disparate states together (eg, Hawaii and island territories are grouped with mainland regions), we modified the HHS Regions as follows:
R1: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont.
R2: New Jersey, New York.
R3: Delaware, District of Columbia, Maryland, Pennsylvania, Virginia, West Virginia.
R4: Alabama, Florida, Georgia, Kentucky, Mississippi, North Carolina, South Carolina, Tennessee.
R5: Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin.
R6: Arkansas, Louisiana, New Mexico, Oklahoma, Texas.
R7: Iowa, Kansas, Missouri, Nebraska.
R8: Colorado, Montana, North Dakota, South Dakota, Utah, Wyoming.
R9: Arizona, California, Nevada.
R10: Alaska, Idaho, Oregon, Washington.
Caribbean Islands: Puerto Rico, US Virgin Islands.
Pacific Islands: Hawaii, American Samoa, Northern Mariana Islands, Federated States of Micronesia, Guam, Marshall Islands, Republic of Palau.
We ultimately excluded both the Pacific Islands and Caribbean Islands from this analysis because there were not enough tweets classified in these regions (fewer than 50 tweets each).
As tweet volume varies by location, we created a type of per-capita estimate to adjust for the overall popularity of Twitter in each region. We collected a 1% sample of tweets from the Twitter streaming API over approximately 10 nonconsecutive days throughout December 2017 and January 2018 to normalize the estimates (42.1 million tweets). The number of tweets classified from each region was then divided by the total number of tweets from that region in the random sample.
As Zika is primarily a concern for women who are pregnant or trying to become pregnant, we investigated the relative percentage of women tweeting versus men (
Weighted volume of classified tweets by modified US Department of Health and Human Services Region. Bars show median weighted volume. Error bars represent 95% confidence intervals obtained using weighted bootstrapped sampling.
Relative percent of women in a sample of Twitter (red), English Zika dataset (orange), travel consideration dataset (yellow), and the travel change dataset (blue). Bars show 95% weighted bootstrapped confidence intervals.
To better understand the factors that contribute to a decision to change travel, we compared the style and content of messages between users in the travel consideration and travel change groups with the random sample of Twitter users. We hypothesized that those who discuss Zika travel are more likely to talk about health in general than typical Twitter users and that those who consider changing travel may have higher levels of fear or anxiety.
We used Linguistic Inquiry Word Count (LIWC) [
For each user timeline and each LIWC category, we calculated the percentage of tweets that contain a term from the category. In this calculation, we excluded the tweets mentioning Zika so that the analysis does not reflect the same data used to select users. In addition, we restricted the analysis to timelines with a minimum of 10 tweets across the timeline. Finally, for each category, we calculated the average percentage across all timelines in each user group. The results are shown in
Compared with a random sample of Twitter users, users who tweeted about changing or considering changing travel in reaction to Zika are significantly more likely to use past and present tense, as well as terms indicating social processes, perhaps indicating increased planning. Travel consideration users are significantly more likely to use personal pronouns and singular first-person pronouns and were significantly higher in the anxiety category. Travel change users were significantly more likely to use plural first-person pronouns, had higher inhibition, and tweeted more about pregnancy. There are no significant differences between the travel consideration and travel change groups.
Contrary to our expectations, the travel groups do not tweet significantly differently from the overall Twitter population about health or bodily functions. This indicates that the users we identified as part of this behavior change pipeline were uniquely concerned about Zika and did not appear to be generally more aware or interested in discussing health-related topics on social media (with the important exception of pregnancy). It would be useful to explore more on this line of inquiry in future work, as understanding who talks about infectious diseases (and how) is of immediate interest to the disease surveillance community [
Average percent of Linguistic Inquiry Word Count category prevalence per group.
Type | Category | All Twitter | Consideration | Change |
Linguistic processes | Personal pronouns | 0.6080 | 0.7501 | |
Linguistic processes | 1st singular | 0.2788 | 0.3214 | |
Linguistic processes | 1st plural | 0.0458 | 0.0699 | |
Linguistic processes | 3rd singular | 0.0692 | 0.0699 | |
Linguistic processes | 3rd plural | 0.0474 | 0.0561 | 0.0571 |
Linguistic processes | Past tense | 0.1794 | ||
Linguistic processes | Future tense | 0.0648 | 0.0842 | 0.0871 |
Linguistic processes | Present tense | 0.6053 | ||
Psychological processes | Social processes | 0.7181 | ||
Psychological processes | Affective processes | 0.6648 | 0.7362 | 0.7587 |
Psychological processes | Positive emotion | 0.4323 | 0.5106 | 0.5105 |
Psychological processes | Negative emotion | 0.2290 | 0.2225 | 0.2440 |
Psychological processes | Anxiety | 0.0246 | 0.0331 | |
Psychological processes | Tentativeness | 0.1556 | 0.2019 | 0.2075 |
Psychological processes | Certainty | 0.1203 | 0.1437 | 0.1375 |
Psychological processes | Inhibition | 0.0470 | 0.0633 | |
Psychological processes | Biological processes | 0.2230 | 0.2712 | 0.2401 |
Psychological processes | Body | 0.0705 | 0.0787 | 0.0674 |
Psychological processes | Health | 0.0495 | 0.0744 | 0.0734 |
Psychological processes | Sexual | 0.0857 | 0.0648 | 0.0526 |
Other (non- Linguistic Inquiry Word Count) | Pregnancy | 0.0004 | 0.0106 |
aInstances where there are significant differences from the random sample. Significance is estimated using an unpaired 2-sided
As a final experiment, we look at the number of followees each of the randomly selected users had that were also present elsewhere in the Zika dataset—that is, the accounts a user follows that had at least one Zika-related tweet.
Although it is impossible to replicate Twitter’s algorithm for showing information on the timeline, we have the unique capability to look at network effects because we have 100% of the tweets during the time period that explicitly mentions either
Indeed, we did find that those individuals who considered or changed their travel plans had a higher number of followees and tweets that they could have been exposed to in the sample. Although the travel groups had higher counts under every metric when compared with the control group, the difference is only significant under the normalized metrics.
The number of followees an individual user has who are also in the dataset, and the number of tweets that followees tweeted that are also in the dataset. We normalized to the number of total followees for each individual. Values in italics are significant (
Metric | All Twitter, median (95% CI) | Consideration, median (95% CI) | Change, median (95% CI) |
Number of followees (raw) | 92.8 (58.3-135.4) | 111.6 (71.1-170.9) | 122.2 (82.3-177.4) |
Number of followees (normalized) | 0.08 (0.06-0.11) | ||
Number of tweets (raw) | 93.6 (56.2-141.2) | 111.3 (67.7-169.8) | 122.7 (79.6-179.0) |
Number of tweets (normalized) | 1.71 (1.02-2.62) |
In an age where infectious diseases are emerging and re-emerging rapidly [
We present supervised classifiers that identify evidence of behavior changes with regard to concerns and changes in travel plans owing to Zika on Twitter. Although previous work has observed that individuals mention protective health behaviors on social media [
We additionally find significant differences in the gender distribution of users tweeting about travel consideration and change compared with the general population of Twitter. In particular, we find that the relative proportion of women engaging in conversation indicating travel change behaviors on Twitter is higher than men. This, in combination with the results of RQ2(b) discussed below, is evidence that pregnancy was playing a role in these considerations.
For comparison with existing knowledge on this subject, we discuss 2 small surveys (85 and 121 participants) conducted in New York (NY) [
There are several limitations of the data and our methodology that must also be considered. First, it is known that Twitter is a demographically biased data source [
Second, we recognize the lack of external validity owing to the absence of comparable ground truth data. We view this as a motivation for this research, where findings from this study can be viewed as hypotheses to test with future experiments. It is well known that human behaviors directly impact disease transmission [
Third, machine learning classifiers introduce error [
Finally, there are some limitations of our labeled dataset. It is relatively small compared with some previous work. We specifically chose not to scale up the annotation process with crowdsourcing [
In addition, the labeling criteria we used could introduce bias. In particular, we can only capture people who explicitly state that they are canceling travel and that they are doing so because of Zika. Research in this field is limited, but initial work on self-reports of cold and flu illness indicates that it is rare for individuals to tweet about their health concerns [
The results of this study show that people do describe first-person behavior changes on Twitter and that such tweets can be classified and analyzed at scale. In particular, we find that our behavior change classifier produces a dataset that corresponds to events during the outbreak and shows evidence of geographic and gender-based differences in the behavior change.
These data support hypotheses that social media can play a role in an individual’s health choices. Other research has shown that an important predictor of population health is knowledge and that this knowledge can be disproportionate across different geographical areas based on access to health care expertise [
Eventually, we envision these types of algorithms being used within the disease surveillance community. There is substantial previous work using internet data to gather traces of information about individuals’ health to monitor and forecast infectious disease outbreaks (eg, search query volumes used for Google Flu Trends). In principle, social media–derived data about behaviors that affect the spread of disease could be incorporated into forecasting models to better describe disease transmission dynamics. As part of this study, we plan to eventually incorporate this type of data into such models.
In addition to monitoring and forecasting, data and conclusions from studies such as this work can inform preventative health messaging. Previous research has found that the ways infectious diseases are framed contribute in important ways to the public perception of the event’s severity [
Travel-related keywords used to filter tweets.
Cross-validated F1 scores for classifiers stratified by n-gram range and percentage of features used (based on chi-square).
application programming interface
US Department of Health and Human Services
Linguistic Inquiry Word Count
negative predictive value
New York
Pan American Health Organization
positive predictive value
research question
World Health Organization
ARD and MJP conceptualized the work, developed methodology, wrote associated software, performed the analyses and wrote and edited the original manuscript. MJP provided supervision and project administration.
The LANL publication number is LA-UR-18-24423.
MJP serves on the advisory board to Sickweather, a company that uses social media to forecast illness.