This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The COVID-19 vaccine is considered to be the most promising approach to alleviate the pandemic. However, in recent surveys, acceptance of the COVID-19 vaccine has been low. To design more effective outreach interventions, there is an urgent need to understand public perceptions of COVID-19 vaccines.
Our objective was to analyze the potential of leveraging transfer learning to detect tweets containing opinions, attitudes, and behavioral intentions toward COVID-19 vaccines, and to explore temporal trends as well as automatically extract topics across a large number of tweets.
We developed machine learning and transfer learning models to classify tweets, followed by temporal analysis and topic modeling on a dataset of COVID-19 vaccine–related tweets posted from November 1, 2020 to January 31, 2021. We used the F1 values as the primary outcome to compare the performance of machine learning and transfer learning models. The statistical values and
We collected 2,678,372 tweets related to COVID-19 vaccines from 841,978 unique users and annotated 5000 tweets. The F1 values of transfer learning models were 0.792 (95% CI 0.789-0.795), 0.578 (95% CI 0.572-0.584), and 0.614 (95% CI 0.606-0.622) for these three tasks, which significantly outperformed the machine learning models (logistic regression, random forest, and support vector machine). The prevalence of tweets containing attitudes and behavioral intentions varied significantly over time. Specifically, tweets containing positive behavioral intentions increased significantly in December 2020. In addition, we selected tweets in the following categories: positive attitudes, negative attitudes, positive behavioral intentions, and negative behavioral intentions. We then identified 10 main topics and relevant terms for each category.
Overall, we provided a method to automatically analyze the public understanding of COVID-19 vaccines from real-time data in social media, which can be used to tailor educational programs and other interventions to effectively promote the public acceptance of COVID-19 vaccines.
The outbreak of COVID-19 has affected 219 countries and territories with 102,083,344 confirmed cases causing 2,209,195 deaths as of January 31, 2021, as reported by the World Health Organization (WHO) [
Vaccine hesitancy, defined as “a behavior with delay in acceptance or refusal of vaccines despite available services,” was identified by the WHO as a global threat in 2019 [
With the increased growth of internet-based applications, more people have begun sharing their opinions on social media platforms. In particular, during the current COVID-19 pandemic, people may increase their use of social media due to social distancing [
Machine learning and deep learning techniques have been used as efficient methods to detect public perceptions on social media platforms. In health care, researchers have developed deep learning models to perform longitudinal and geographic analyses to understand human papillomavirus (HPV) vaccine discussions [
Although previous studies have explored additional knowledge in the context of other vaccines using machine learning and deep learning methods, several questions related to COVID-19 vaccines remain unknown: What is the prevalence of user opinions on a social media platform? How many tweets express positive/negative attitudes and behavioral intentions to take vaccines? Which topics are mostly associated with these contents? To answer these questions, we developed machine learning models (logistic regression, random forest, support vector machine) and transfer learning models to detect the content expressing user opinions, attitudes, and behavioral intentions toward COVID-19 vaccines. We then performed a temporal analysis to explore trends over time and developed probabilistic topic models to obtain the most important and valuable topics. We believe that this study will be of great benefit to the timely rollout of COVID-19 vaccines by extracting the latest public opinions, attitudes, and behavioral intentions that can help tailor promotion programs to fit different populations.
We collected tweets related to COVID-19 vaccines posted from November 1, 2020 to January 31, 2021, and annotated 5000 tweets as the gold standard. We developed machine learning and transfer learning models to classify tweets for three tasks: (1) opinions (yes, no); (2) attitudes (positive, negative, neutral); and (3) behavioral intentions (positive, negative, unknown). The above tasks all focused on COVID-19 vaccines. We then applied the models to predict unlabeled tweets and performed a temporal analysis to capture trends in the unlabeled tweets. In addition, we performed a topic analysis using word clouds and a latent Dirichlet allocation (LDA) model to further understand the content of tweets in the following categories: positive attitudes, negative attitudes, positive behavioral intentions, and negative behavioral intentions. The overall framework is shown in
Overall study framework. API: application programming interface.
We used a combination of keywords and hashtags related to COVID-19 vaccines to collect tweets in English published from November 1, 2020 to January 31, 2021. We intentionally chose November, following the announcement of the first effective vaccine on November 9, 2020, to determine if the announcement of successful vaccine trial results might influence the perceptions of vaccines or vaccination. The search strategy employed the following search terms: “(#covid OR covid OR #covid19 OR covid19) AND (#vaccine OR vaccine OR #vacine OR vacine OR vaccinate OR immunization OR immune OR vax) since:2020-11-01 until:2021-01-31 lang:en.” We used snscrape and tweepy in Python 3 to collect data and to exclude retweets. To clean up the original tweets, we removed nonalphanumeric characters and converted the text to lowercase. We randomly selected 5000 tweets from November 1, 2020 to November 22, 2020, annotated by two independent reviewers (SL and JL) in batches of 200. Any annotation disagreements were discussed and adjudicated by the supervising investigators. For each tweet, we first labeled whether it included a user opinion toward the COVID-19 vaccines (yes or no). We considered a tweet to include an opinion about the COVID-19 vaccines if it met both of the following conditions: (1) targeted at the COVID-19 vaccines and (2) generated by a user. For the tweets that expressed user opinions toward the COVID-19 vaccines, we labeled the attitude (positive, negative, or neutral) and the behavioral intention (positive, negative, or unknown) toward COVID-19 vaccines. The attitude category used the traditional emotional polarity. The analysis of attitude was performed on the aspect level. If both positive and negative attitudes toward COVID-19 vaccines were present in the same tweet, we labeled it in the unknown category. The coding rules were iteratively developed by our group in which an independent review was performed, disagreements were discussed, and coding rules were revised. This process continued until the interrater agreement reached ≥0.80. The annotated corpus was used as a gold standard to train and evaluate the machine learning and transfer learning models.
For data preprocessing, we used the tweet-preprocessor package in Python 3 to remove URLs, hashtags, mentions, reserved words (eg, RT, FAV), emojis, smileys, and numbers in each tweet. We split the annotated dataset into three parts: training (60%), validation (20%), and testing (20%). The training and validation datasets were used to train models and select optimal hyperparameters through 5-fold cross-validation. We applied transfer learning using text frequency-inverse document frequency to compare traditional machine learning algorithms (logistic regression, random forest, and support vector machine) to transfer learning models. The machine learning models were developed using the scikit-learn package in Python 3.
For transfer learning, we used the BERT-base-cased as the pretrained language model and the “BERT for sequence classification” model as the pretrained classification model. Because the BERT model requires each sentence to be the same length, we padded each tweet with 64 tokens, as most tweets have lengths in this range. We then fine-tuned this model on the training and validation datasets using the Adam algorithm with weight decay (AdamW) as an optimizer. We performed three text classification tasks. We first developed a binary classifier to determine whether the tweets state an opinion related to the COVID-19 vaccines. We then developed two multiclass classifiers to categorize attitudes and behavioral intentions, respectively. The BERT models were generated using the huggingface package in Python 3. The models were developed with the Google Colab platform using a high-RAM GPU.
We evaluated the models on the testing dataset and report outcomes with 1000 rounds of bootstrapping. The primary outcome was the macro-F1 value and the secondary outcomes were recall, precision, and accuracy. We performed the Nemenyi test to compare the F1 values of traditional machine learning models and transfer learning models [
We applied the optimal models to predict the unlabeled data for 3 months starting from November 1, 2020. For the task of extracting opinions, we calculated the proportion of tweets classified as containing opinions to the total number of tweets posted each day about the COVID-19 vaccines. For the tasks of classifying attitudes and behavioral intentions toward the COVID-19 vaccines, we calculated the percentage of tweets predicted to exhibit a particular attitude or behavioral intention to all tweets indicating attitudes or behavioral intentions, respectively. To assess the statistical significance of variability over time, we performed the Augmented Dickey-Fuller (ADF) test [
To understand the content of tweets in each category, we used word clouds to illustrate the frequency of words appearing in the content. The more frequently used words have larger sizes, indicating more importance in the category [
We annotated 5000 tweets from 4796 unique users with an average interrater reliability (κ) of 0.76. The prediction performances of models on the testing dataset using four different algorithms for three tasks are presented in
Metrics of transfer learning models and machine learning models in classifying tweets related to COVID-19 vaccines.
Task | Recall, mean (95% CI) | Precision, mean (95% CI) | F1, mean (95% CI) | Accuracy, mean (95% CI) | |
|
|
|
|
|
|
|
BERTa | 0.762 (0.759-0.766) | 0.862 (0.858-0.866) | 0.792b (0.789-0.795) | 0.854 (0.852-0.856) |
|
Logistic regression | 0.774 (0.770-0.779) | 0.757 (0.753-0.762) | 0.764 (0.761-0.767) | 0.807 (0.805-0.810) |
|
Random forest | 0.754 (0.750-0.758) | 0.732 (0.728-0.735) | 0.740 (0.737-0.743) | 0.783 (0.781-0.786) |
|
Support vector machine | 0.767 (0.764-0.771) | 0.752 (0.748-0.755) | 0.758 (0.755-0.761) | 0.803 (0.801-0.806) |
|
|
|
|
|
|
|
BERT | 0.529 (0.521-0.536) | 0.698 (0.686-0.710) | 0.578b (0.572-0.584) | 0.873 (0.871-0.875) |
|
Logistic regression | 0.475 (0.468-0.482) | 0.530 (0.520-0.541) | 0.495 (0.490-0.500) | 0.859 (0.856-0.861) |
|
Random forest | 0.518 (0.511-0.526) | 0.558 (0.545-0.570) | 0.508 (0.502-0.514) | 0.830 (0.827-0.833) |
|
Support vector machine | 0.506 (0.498-0.514) | 0.551 (0.541-0.562) | 0.523 (0.517-0.530) | 0.863 (0.860-0.865) |
|
|
|
|
|
|
|
BERT | 0.562 (0.549-0.575) | 0.734 (0.716-0.752) | 0.614b (0.606-0.622) | 0.961 (0.960-0.962) |
|
Logistic regression | 0.472 (0.461-0.483) | 0.725 (0.699-0.752) | 0.527 (0.519-0.536) | 0.951 (0.949-0.952) |
|
Random forest | 0.447 (0.437-0.457) | 0.577 (0.543-0.611] | 0.466 (0.457-0.476) | 0.935 (0.934-0.937) |
|
Support vector machine | 0.469 (0.458-0.479) | 0.710 (0.684-0.737) | 0.523 (0.513-0.533) | 0.950 (0.948-0.951) |
aBERT: Bidirectional encoder representations from transformers.
b
We collected 2,678,372 tweets related to COVID-19 vaccines posted by 841,978 unique users from November 1, 2020 to January 31, 2021. The daily prevalence distributions of opinions, attitudes, and behavioral intentions are shown in
Distribution of the prevalence of the tweets containing opinions (A), attitudes (B), and behavioral intentions (C) about COVID-19 vaccines for each day from November 1, 2020 to January 31, 2021.
After tuning hyperparameters of the LDA models, each model had 10 components (topics).
Intertopic distance maps for tweets that contained information in the following categories: negative attitudes (A), positive attitudes (B), negative behavioral intentions (C), and positive behavioral intentions (D).
1. worry, prevent, covid, stop, need, spread, symptom, transmission, catch, people, reduce, infection, virus, eat, doesn
2. death, covid, case, people, rate, die, number, cause, population, test, trial, fear, report, survival, day
3. risk, covid, test, people, health, worker, trial, know, need, woman, work, child, pregnant, safe, age
4. effect, long, term, know, covid, bad, unknown, risk, affect, people, study, concern, damage, potential, impact
5. covid, year, make, anti, month, mask, rush, people, want, safe, need, just, know, sense, wear
6. covid, dose, use, virus, immune, antibody, body, immunity, trial, second, make, protein, cell, test, response
7. virus, new, covid, strain, effective, work, mutate, year, develop, mutation, research, cold, variant, different, make
8. covid, people, just, say, think, make, know, trust, want, cure, government, believe, thing, come
9. covid, die, people, life, chance, treatment, old, kill, effective, want, say, sick, save, safe, family
10. flu, covid, reaction, shot, drug, adverse, expect, people, shoot, allergic, just, high, bad, year, polio
1. covid, thank, work, great, today, day, make, worker, scientist, happy, mom, care, just, hard, nurse
2. covid, feel, effect, day, long, arm, just, little, work, fine, hour, term, good, excited, sore
3. safe, stay, end, covid, news, pandemic, effective, trial, good, amp, light, continue, home, hope, step
4. covid, hope, soon, look, forward, normal, life, hopefully, come, available, new, world, news, return, year
5. covid, good, year, just, time, wait, thing, hope, think, come, pray, love, wish, news, day
6. people, covid, want, need, know, die, risk, just, really, say, think, make, life, safe, fear
7. covid, dose, receive, today, grateful, second, family, feel, patient, able, thankful, protect, friend, happy, excited
8. flu, virus, covid, make, immune, fight, sure, body, new, immunity, just, strain, world, distribute, cause
9. mask, wear, covid, stop, social, spread, distancing, hand, catch, need, distance, people, virus, stay, help
10. covid, vaccinate, amp, case, symptom, prevent, ready, immunity, just, mean, virus, reduce, life, rate, infection
1. covid, virus, stop, prevent, symptom, test, dose, immune, spread, mask, antibody, sick, just, catch, body
2. covid, flu, shot, shit, shoot, just, allow, work, win, scare, dead, year, virus, arm, sure
3. risk, covid, say, immune, make, high, virus, people, disease, just, healthy, sense, dangerous, case, good
4. want, covid, vaccinate, child, use, kill, kid, new, wait, way, cure, effective, doctor, just, people
5. covid, body, rate, vaccination, survival, choice, eat, mandatory, know, worry, life, fear, want, hear, need
6. covid, anti, just, tell, say, refuse, vaxxer, afraid, reason, people, stop, right, make, job, stupid
7. covid, year, trust, chance, inject, month, government, test, old, develop, cold, make, research, come
8. effect, know, long, term, covid, dna, affect, change, people, bad, rush, chance, unknown, study, test
9. people, covid, die, need, think, just, kill, family, care, damn, believe, say, real, death, chance
10. covid, force, try, reaction, people, bad, severe, look, allergic, medical, receive, say, fine, pay
1. covid, people, want, just, think, say, know, mask, wear, make, really, ask, scare, right
2. covid, want, need, look, tomorrow, let, know, life, forward, ready, dose, morning, normal, receive, volunteer
3. covid, wait, long, turn, effect, line, term, finally, eat, excited, worried, afraid, use, drink, polio
4. just, dose, covid, second, got, day, effect, symptom, receive, fever, ache, hour, experience, headache, body
5. flu, covid, shot, year, time, bad, shoot, sick, immune, just, need, month, make, think, doctor
6. covid, arm, sore, sign, just, feel, today, hour, little, hurt, yesterday, injection, far, nervous, appointment
7. work, covid, home, thank, stay, patient, hospital, help, safe, care, protect, family, trial, receive, vaccinate
8. covid, risk, immune, die, people, virus, chance, high, know, need, vaccinate, healthy, live, just, catch
9. covid, today, hope, mom, test, dose, happy, able, soon, dad, positive, receive, good, grateful
10. feel, covid, day, week, fine, great, make, shit, ago, better, worker, body, job, good, healthcare
Ten topics were extracted among the tweets that contained negative attitudes. The interactive display interface of pyLDAvis is shown in
For tweets containing positive attitudes, in a dominant topic (topic 3), relevant key terms included “safe,” “stay,” “end,” pandemic,” “news,” “effective,” “trial,” “continue,” and “hope.” This indicates that some positive attitudes might be derived from news of effective trial results and some users hoped that COVID-19 vaccines could end the pandemic. Relevant terms for topic 4 were “hope,” “normal,” “life,” “return,” “start,” “new,” “world,” and “great.” Tweets in topic 4 showed that some users expressed positive attitudes toward vaccines because of the desire to return to a normal life.
PyLDAvis visualization highlighting the top 30 relevant keywords for a topic found in the tweets that contained negative attitudes toward COVID-19 vaccines.
For tweets containing negative behavioral intentions, topics 8 and 10 clustered independently; however, other topics showed some degree of mutual inclusiveness, indicating that similarities existed in those topics. Key terms for topic 8 were “effect,” “know,” “long,” “term,” “DNA,” “unknown,” and “rush.” This topic reflected that some users’ negative behavioral intentions came from the concerns of the long-term and unknown side effects of COVID-19 vaccines. As another unique topic, the most relevant terms for topic 10 were “force,” “reaction,” “bad,” “allergic,” “pay,” “adverse,” and “government.” This analysis highlighted that some users mentioned that they would not take the vaccine if it was forced on them by the government. Others worried about the adverse reactions to the COVID-19 vaccines. Some users compared COVID-19 to influenza and mentioned that because they had not previously been vaccinated against influenza, there was also no need to vaccinate against a disease they mistakenly thought had the same low lethality (topic 2). Other users reported that their immune system could naturally help them fight the virus.
For tweets containing positive behavioral intentions, mutual inclusivity existed among topics 1-4 and between topics 9 and 10. Other topics clustered independently. In topic 8, the keywords were “risk,” “immune,” “healthy,” “antibody,” and “immunity.” In this topic, users would like to become immune to the virus causing COVID-19 and stay healthy by being vaccinated.
In this study, we provided an annotated dataset with 5000 COVID-19 vaccine–related tweets with labels supporting three classification tasks (opinions, attitudes, and behavioral intentions). We assessed that transfer learning could be used to analyze COVID-19 vaccine content tweets and proved that they outperformed common machine learning models. We analyzed the temporal trends and topics in the COVID-19 vaccine–related tweets posted over a 3-month period (from November 1, 2020 to January 31, 2021). The prevalence of tweets containing positive behavioral intentions increased over time. The word clouds and the LDA analysis proved to be efficient tools to understand topics for tweets in each category.
Transfer learning is now widely used to analyze social media content. Some researchers have applied transfer learning with datasets of tweets related to COVID-19 [
Several researchers have applied the Valence Aware Dictionary and Sentiment Reasoner (VADER) tool [
Temporal analysis and topic modeling provide an efficient approach to monitor public perceptions of the COVID-19 vaccines on social media platforms. The following events could explain the significant increase in the prevalence of positive behavioral intentions in mid-December. For example, the FDA issued Pfizer-BioNTech COVID-19 vaccines on December 11, 2020, turning the vaccines from a hypothetical situation into a reality. The United States launched its rollout to high-risk health care facilities on December 14, 2020. A large number of health care workers and influential figures such as Joe Biden received COVID-19 vaccines to increase public confidence. This also suggests that more people might be willing to be vaccinated after successful vaccine development and a large-scale rollout. Indeed, social influence has been shown to positively affect the acceptance rate [
This study has several limitations. First, users of the Twitter platform are not representative of the entire public. The Twitter platform is usually considered to gather more antivaccinators and spread misinformation. This group of users is the main subgroup of the population with sentiments of vaccine hesitancy and should therefore be one of the main targets to receive vaccine education. Compared to other populations, they tend to question vaccines from specific perspectives such as the presence of microchips in vaccines [
For future work, we will perform a theory-based content analysis to gain insight into the reasons that led to the changes in behavioral intentions we noted in the temporal analysis. Using the transfer learning model in this study, researchers can automatically collect tweets containing COVID-19 vaccine–related behavioral intentions and systematically analyze the data through a theoretical model (eg, Capability, Opportunity, Motivation, Behavior model [
In this study, we presented an annotated corpus of 5000 tweets and analyzed the potential to use transfer learning with a pretrained BERT model to automatically identify public opinions, behavioral intentions, and attitudes toward COVID-19 vaccines from social media. We demonstrated that transfer learning models outperformed traditional machine learning models in general. In addition, we explored the temporal trends of the public’s change in attitudes and behavioral intentions on a larger dataset with 2,678,372 tweets from November 1, 2020 to January 31, 2021. We found that the LDA technique is useful to extract topics from identified tweets. Overall, we provided an automatic method to analyze the public’s understanding of COVID-19 vaccines from real-time data, which could be used to tailor education programs and other interventions to promote COVID-19 vaccine acceptance urgently.
Augmented Dickey-Fuller
bidirectional encoder representations from transformers
COVID-Twitter-bidirectional encoder representations from transformers
Food and Drug Association
human papillomavirus vaccine
latent Dirichlet allocation
Valence Aware Dictionary and Sentiment Reasoner
World Health Organization
This work was supported by Sichuan Science and Technology Program (grant number 2020YFS0162).
JLiu and SL conceived the study. SL, JLiu, and JLi performed the analysis, interpreted the results, and drafted the manuscript. All authors revised the manuscript. All authors read and approved the final manuscript.
None declared.