Twitter Discussions and Emotions About the COVID-19 Pandemic: Machine Learning Approach

Background It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. Objective The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. Methods We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. Results Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. Conclusions This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.


Introduction
Thirty million cases of COVID-19 have been confirmed across 110 countries as of mid-September 2020, and the death toll has reached close to 947,000 [1]. The widespread use of social media, such as Twitter, accelerates the process of exchanging information and expressing opinions about public events and health crises [2][3][4][5]. COVID-19 has been one of the trending topics on Twitter since January 2020 and has continued to be discussed to date. Since quarantine measures have been implemented across most countries (eg, the shelter-in-place order in the United States), people have been increasingly relying on different social media platforms to receive news and express opinions. Twitter data are valuable for revealing public discussions and sentiments related to various topics, as well as real-time news updates during global pandemics, such as H1N1 and Ebola [6][7][8][9]. Chew and Eysenbach's study [6] showed that Twitter could be used for real-time "infodemiology" studies, providing a source of opinions for health authorities to respond to public concerns. During the COVID-19 pandemic, many government officials worldwide have used Twitter as one of their main communication channels to regularly share policy updates and news related to COVID-19 to the general public [10].
Since the COVID-19 outbreak, a growing number of studies have collected Twitter data to understand the public responses to and discussions around COVID-19 [11][12][13][14][15][16][17][18]. For instance, Abd-Alrazaq and colleagues [11] adopted topic modeling and sentiment analysis to determine the main discussion themes and sentiments around COVID-19, using tweets collected between February 2 and March 15, 2020. Budhwani and Sun [14] compared Twitter discussions before and after March 16, 2020, when President Trump tweeted about the "Chinese virus," and found a significantly increased use of the phrase "Chinese virus" in people's tweets across many US states afterward. Mackey and colleagues [16] analyzed about 3465 tweets collected between March 2 and 20, 2020, using a topic model to explore users' self-reported experiences with COVID-19 and related symptoms. Ahmed and colleagues [12] conducted social network analysis and content analysis of collected tweets between March 27 and April 4, 2020, to understand what may have driven the misinformation that linked 5G towers in the United Kingdom to the COVID-19 pandemic. As conversations on Twitter continue to take place and evolve, it is worth continuing to use tweets as a source of data to track and understand the salient topics discussed on Twitter in response to the COVID-19 pandemic and track their changes across time.
To expand the literature on public reactions to the COVID-19 pandemic, this study aims to examine the public discourse and emotions related to the COVID-19 pandemic by analyzing more than 4 million tweets collected between March 7 and April 21, 2020.

Research Design
We used a purposive sampling approach to collect COVID-19-related tweets published between March 7 and April 21, 2020. Our Twitter data mining approach followed the pipeline displayed in Figure 1. Data preparation included the following three steps: (1) sampling, (2) data collection, and (3) preprocessing the raw data. The data analysis stage included unsupervised machine learning, sentiment analysis, and thematic qualitative analysis. The unit of analysis was each message-level tweet. Unsupervised learning is one approach in machine learning; it is used to examine data for patterns, and derives a probabilistic clustering based on text data. We chose unsupervised learning because it is commonly used when existing studies have few observations of or insights into unstructured text data [19]. Since a qualitative approach would be challenging when analyzing large-scale Twitter data, unsupervised learning allows us to conduct exploratory analyses of large text data for social science research. In this study, we first employed an unsupervised machine learning approach to identify salient latent topics. We used a thematic analysis approach to develop themes further, allowing a deeper dive into the data, such as through manual coding and inductively developing themes based on the latent topics generated by machine learning algorithms.
Twitter's open application programming interface (API) allowed us to collect updated Twitter messages set to open by default. From March 7 to April 21, 2020, we collected 35,204,604 tweets during this period ( Figure 2). After removing non-English tweets, 23,817,948 tweets remained. After removing duplicates and retweets (ie, tweets that only repost the original message without adding any more words), we had 4,196,020 tweets in our final data set. We collected and downloaded the following features for each tweet: (1) the full text, (2) the numbers of favorites, followers, and followings, (3) users' geolocation, and (4) users' description/self-created profile.

Preprocessing the Raw Data
We used Python to clean the raw data ( Figure 1). The process was as follows [18]: 1. We removed the hashtag symbol, @users, and URLs from the tweets in the data set. 2. We removed non-English characters (non-ASCII characters) because this study focused on tweets in English. 3. We removed special characters, punctuation, and stop-words [19] from the data set as they do not contribute to the semantic meanings of messages.

Unsupervised Machine Learning
Latent Dirichlet Allocation (LDA) [20] is a widely used unsupervised machine learning approach that allows researchers to analyze unstructured text data (eg, Twitter messages). Based on the data itself, the algorithm produces frequently mentioned pairs of words, the pairs of words that co-occur together, and latent topics and their distributions over topics in the document [21]. Existing studies have indicated the feasibility of using LDA to identify the patterns and themes of tweets related to COVID-19 [11,22].

Qualitative Analysis
To triangulate and contextualize findings from the LDA model, we employed a qualitative approach to develop themes further. Specifically, we used Braun and Clarke's [23] six steps of thematic analysis: (1) getting familiar with the keyword data, (2) generating initial codes, (3) searching for themes, (4) reviewing potential themes, (5) defining themes, and (6) reporting. In addition to following the six-phase approach, our process was iterative and reflective by moving backward and forward through the six phases [24]. The thematic approach relied on human interpretation, a process that can be significantly influenced by personal understanding of the topics and a variety of biases. Two team members who have experience analyzing Twitter data documented their thoughts about potential codes in NVivo independently. Two other team members then reviewed the initial codes and considered whether they reflected the identified topics. For example, two team members collapsed several similar codes into one theme to ensure the topics corresponded meaningfully under one theme. The next stage was naming the themes to ensure the themes fitted into the overall meanings of the identified salient topics. We finalized themes corresponding to each of the 13 topics.

Sentiment Analysis
We used sentiment analysis, a natural language processing (NLP) approach, to classify the main sentiments of a given twitter message, such as fear and joy [25]. In this study, we used the NRC Emotion Lexicon, which consists of 8 primary emotions: anger, anticipation, fear, surprise, sadness, joy, disgust, and trust [26]. We followed 4 steps to calculate the emotion index for each Twitter message: (1) removed articles and pronouns (eg, "and," "the," "to"), (2) applied a stemmer by removing the predefined list of prefixes and suffixes (eg, "running" becomes "run" after stemming) [27], and (3) calculated the emotion index (if a sentence had multiple emotions, we only kept the emotion with the highest matching count), and (4) calculated the scores for each 8-emotion type. We discussed these 4 steps in detail in a previous study [18].

COVID-19-Related Topics
Our approach, LDA, produced frequently co-occurring pairs of words related to COVID-19 and organized these co-occurring words into different topics. LDA allowed us to manually define the number of topics (eg, 10 topics, 20 topics) that we would like to generate. Consistent with previous studies, we used the coherence model, Gensim (RARE Technologies Ltd) [28], to calculate the most appropriate number of topics based on the data itself. For this data set, the LDA indicated that having 13 topics would give a high coherence score and the smallest topic number (eg, while having 19 or 20 topics would give a higher coherence score, they involve more topics; Figure 5). We further analyzed the document-term matrix and obtained the distributions of 13 topics. We presented the results of 13 salient topics and the most popular pairs of words (bigrams) within each topic in Table 2. For example, Topic 3 had the highest distribution (8.87%) among all 13 common latent topics.
The bigrams associated with Topic 3 included "tested positive," "coronavirus outbreak," "New York," "shelter place," and "mental health." These pairs of words frequently co-occurred together, and therefore the LDA model assigned them to the same topic.

COVID-19-Related Themes
The thematic analysis enabled us to categorize these topics into different distinct themes. The team considered the identified topics, bigrams, and representative tweet samples in each topic and categorized them into different themes. To protect the privacy and anonymity of the Twitter users, we did not present any user-related information, such as users' Twitter handles or other identifying information. Therefore, sample tweets were excerpts drawn from original tweets in Table 3.
We organized 13 topics into 5 themes: "Public health measures to slow the spread of COVID-19" (eg, face masks, test kits, vaccine), "Social stigma associated with COVID-19" (eg, Chinese virus, Wuhan virus), "Coronavirus news cases and deaths" (eg, new cases, deaths), "COVID-19 in the United States" (eg, New York, protests, task force), and "Coronavirus cases in the rest of the world" (eg, UK, global issue). For example, the theme "public health measures to slow the spread of COVID-19" included the relevant topics of "facemasks," "quarantine," "test kits," "lockdown," "safety," "vaccine," and "shelter-in-place." In addition, "home quarantine" and "self-quarantine" were two of the most commonly co-occurred words under the topic quarantine. Table 3. Themes based on topic classification, bigrams, and sample tweets.

Public health measures to slow the spread of COVID-19
We protect us and our family by wearing masks every day. face masks, wear masks Face masks @realDonaldTrump @JustineTrudeau They're all under mandatory 2 week quarantine, and they are essential workers… home quarantine,

COVID-19 new cases and deaths
RT @neeratanden: 4,591 people died in a day from the virus, the highest number anywhere ever that we know of. new cases, total number, confirmed cases New cases #Britain's death toll could be DOUBLE official tally as care homes coronavirus death, death toll, people died Deaths

COVID-19 in the United States
New Yorkers on their apartment roofs during quarantine is a whole different vibe. This is gonna be in history books new york, shelter place, mental health Mental health and COVID-19 in New York I stand with the Healthcare workers!!! Bravo! Healthcare workers face off against anti-lockdown protesters in Colorado anti lockdown, people protesting, protesting stay Protests against the lockdown RT @Jim_Jordan: There are #coronavirus task forces doing great work. But there is one task force that's missing in action: the U.S. congress task force Task force in the United States Stay-at-home orders continue in much of the United States united states, white house, new jersey, 21 million, million people, dr fauci,

COVID-19 cases in the rest of the world
The Prime Minister gave the game away early on when he openly said to Scrofulous and Willibooby that the government's plan was Herd Immunity the REAL people in charge must have been so furious with him he had to be sent to an isolation ward with the virus to shut him up! Herd immunity, UK lockdown, Prime Minister

United Kingdom
Worldwide it is now 182,726." And "New Zealand Prime Minster Jacinda Ardern says the government will partially relax its lockdown in a week, as a decline in … Entire world, south Korea, world health, global pandemic, new Zealand

Sentiment Analysis
We presented the results of the sentiment analysis for each of the 13 latent topics in Figure 6 and Table 4. Figure 6 presented 8 emotions of trust, anticipation, joy, surprise, anger, fear, disgust, and sadness. Results showed that across all 13 topics, anticipation (dark blue line) dominated 12 topics, followed by fear (orange line), trust (grey line), and anger (yellow line).
We also ran a one-tailed z test to examine if each of the 8 emotions is statistically significantly different across topics. A P value <.01 was set as the threshold for significance. For example, about 23.8% of tweets in Topic 5 revealed feelings of anticipation that "necessary steps and precautions will be taken" [18,29]. Statistical significance indicated that it was very likely (P<.001) that the anticipation emotion is more prevalently expressed in Topic 5 (23.8%) than all other topics. The emotion fear (of the impacts of the virus) was found in 18.8% of the tweets in Topic 10, which was statistically different from the fear expressed in other topics.

Principal Results
In this study, we addressed public discussions and emotions using COVID-19-related messages on Twitter. Twitter users discussed 5 main themes related to COVID-19 between March 7 and April 21, 2020. Topic modeling of the tweets was useful for providing insights about COVID-19 topics and concerns. Results showed several essential points. First, the public uses a variety of terms when referring to COVID-19, including virus, COVID-19, coronavirus, and corona virus. Second, COVID-19 has been referred to as the "China virus," which can create stigma and harm efforts to address the COVID-19 outbreak [14]. Third, discussions about the pandemic in New York were salient, and its associated public sentiment was anger. Fourth, public discussions about the Chinese Communist Party (CCP) and the spread of the virus emerged as a new topic that was not identified in previous studies [18], suggesting the connection between COVID-19 and politics is increasingly circulating on Twitter as the situation evolves. Fifth, public sentiments on the spread of COVID-19 reveal anticipation for the potential measures that can be taken, followed by mixed feelings of trust, anger, and fear. Results suggest that the public is not surprised by the rapid spread of COVID-19. Sixth, people have a significant feeling of fear when they discuss the COVID-19 crisis and deaths. Lastly, trust is no longer a prominent emotion when Twitter users discuss COVID-19, which is different from the findings of an earlier study [18].
Compared with a study examining public discussions and concerns related to COVID-19 using Twitter data from January 20 to March 7, 2020, we found that several salient topics are no longer popular: (1) an outbreak in South Korea, (2) the Diamond Princess cruise ship, (3) the economic impact [11,32], and (4) supply chains [18]. Given current preventive measures, washing hands is no longer a prevalent topic; instead, quarantine has become dominant.
In addition, our study identified new discussion topics about COVID-19 occurring between March 7 to April 21: (1) the need for a vaccine to stop the spread, (2) quarantine and shelter-in-place orders, (3) protests against the lockdown, and (4) the COVID-19 pandemic in the United States. The new salient topics suggest that Twitter users (tweeting in English) are focusing their attention on COVID-19 in the United States (eg, New York, protests, task force, millions of confirmed cases) rather than global news (eg, South Korea, Diamond Princess cruise ship, Dr Li Wenliang in China).

Limitations
First, we only sampled 20 hashtags as the key search terms to collect Twitter data (Multimedia Appendix 1). New hashtags keep coming up as the situation evolves. For example, a hashtag may become widely used after a related topic becomes more popular, such as the official name for the virus (COVID-19). Second, Twitter users are not representative of the whole global population, and topics of tweets only indicate online users' opinions about and reactions to COVID-19. However, the Twitter data set is still a valuable resource, allowing us to examine real-time Twitter users' responses and online activities related to COVID-19. Third, non-English tweets were removed from our analyses, and hence the results are limited to users who posted in English only. Future COVID-19 studies should include other languages, such as Italian, French, German, and Spanish.

Future Research
Future research could further explore public trust and confidence in existing measures and policies, which are essential. Compared to prior work, our study showed that Twitter users had a feeling of joy when talking about herd immunity. Sentiments of fear and anticipation related to the topics of quarantine and shelter-in-place. Future studies could evaluate how government officials (eg, President Trump) and international organizations (eg, World Health Organization) deliver and convey messages to the public, and the subsequent impact on public opinions and sentiments. Anti-Chinese/Asian sentiments spread on social media, and it would be worth assessing how people use these platforms to resist and challenge COVID-19 stigma. Misinformation during the COVID-19 pandemic was not a prominent theme in this study. An existing study showed that 25% (n=153) of sampled tweets contained misinformation [34]. The term COVID-19 has lower rates of misinformation associated with it than that associated with #2019_ncov and Corona. Future research should investigate misinformation and how it expands on social media. Finally, trust is no longer prominent when people tweet about confirmed cases and deaths. Instead, fear has replaced trust to be the dominant emotion. Future research should examine the changes in trust over time.

Conclusions
Twitter data and machine learning approaches can be leveraged for infodemiology studies by studying evolving public discussions and sentiments during the COVID-19 pandemic. Our findings facilitate an understanding of public discussions and concerns about the COVID-19 pandemic among Twitter users between March 7 and April 21, 2020. Several topics were consistently dominant on Twitter, such as "the confirmed cases and death rates," "preventive measures," "health authorities and government policies," "stigma," and "negative psychological reactions" (eg, fear). As the situation rapidly evolves, new salient topics emerge accordingly. Fear arises in messages of new cases or death reports [18]. Real-time monitoring and assessment of Twitter users' concerns can be promising for informing public health emergency responses and planning. Hearing and reacting to real concerns from the public can enhance trust between the health care system and the public and enable better preparation for a future public health emergency.

Conflicts of Interest
None declared.