Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach.

The objective of the study is to examine coronavirus disease (COVID-19) related discussions, concerns, and sentiments that emerged from tweets posted by Twitter users. We collected 22 million Twitter messages related to the COVID-19 pandemic using a list of 25 hashtags such as "coronavirus," "COVID-19," "quarantine" from March 1 to April 21 in 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigram, bigrams, salient topics and themes, and sentiments in the collected Tweets. Popular unigrams included "virus," "lockdown," and "quarantine." Popular bigrams included "COVID-19," "stay home," "corona virus," "social distancing," and "new cases." We identified 13 discussion topics and categorized them into different themes, such as "Measures to slow the spread of COVID-19," "Quarantine and shelter-in-place order in the U.S.," "COVID-19 in New York," "Virus misinformation and fake news," "A need for a vaccine to stop the spread," "Protest against the lockdown," and "Coronavirus new cases and deaths." The dominant sentiments for the spread of coronavirus were anticipation that measures that can be taken, followed by a mixed feeling of trust, anger, and fear for different topics. The public revealed a significant feeling of fear when they discussed the coronavirus new cases and deaths. The study concludes that Twitter continues to be an essential source for infodemiology study by tracking rapidly evolving public sentiment and measuring public interests and concerns. Already emerged pandemic fear, stigma, and mental health concerns may continue to influence public trust when there occurs a second wave of COVID-19 or a new surge of the imminent pandemic. Hearing and reacting to real concerns from the public can enhance trust between the healthcare systems and the public as well as prepare for a future public health emergency.


Introduction
More than four million people were confirmed positive of COVID-19 across 110 countries as of May 2020, and the death toll has reached close to 300,000 [1]. The widespread utilization of social media, such as Twitter, accelerates the process of exchanging information and expressing opinions about public events and health crises. COVID-19 is one of the trending topics on Twitter in the past four months since the outbreak. Since quarantine measures have been implemented across most countries (e.g., the Shelter-in-Place order in the United States), people have been increasingly relying on different social media platforms to receive news and express opinions. Twitter data are valuable in revealing public discussions and sentiments to interesting topics, and real-time news updates in global pandemics, such as H1N1 and Ebola [2][3][4][5]. In the current COVID-19 pandemic, many government officials worldwide are using Twitter, as one of the main communication channels, to regularly share policy updates and news related to COVID-19 to the general public [6].
Although there has been a growing body of empirical literature examining issues related to COVID-19 using Twitter data [6 7], most study samples were small, and therefore analyses using large samples of Tweets remain to be scant. Furthermore, methodologically, data processing and analysis were mainly relying on traditional qualitative coding techniques to make meaning of tweets. To extend the literature on public responses to COVID-19, the present study examines public responses in the face of the pandemic by analyzing approximate 22 million Tweets collected between March 1 and April 21, 2020. We integrated both unsupervised machine learning methods and qualitative coding techniques to triangulate the findings.
The present study aims to examine: (1) What the public discusses about the COVID-19 outbreak? (2) What are the communication patterns of the Twitter-based discussions about the pandemic?
And (3) What are the concerns the public has expressed about the COVID-19 pandemic?

Research design
We used a purposive sampling approach to collect COVID-19 related Tweets published between March 1 and April 21, 2020. Our Twitter data mining approach followed the pipeline 1 displayed in Figure 1. Data preparation included three steps (1) sampling, (2) data collection, and (3) preprocessing the raw data. The data analysis stage included unsupervised machine learning, sentiment analysis and qualitative method. The unit of analysis was each message-level tweet.
Unsupervised learning is one approach in machine learning, and used to examine data for patterns, and derives a probabilistic clustering based on the text data. We chose unsupervised learning because it is commonly used when existing studies have little observations or insights of the unstructured text data [8]. Since a qualitative approach has challenges analyzing large-scale Twitter data, unsupervised learning allows us to conduct exploratory analyses of large text data in social science research. In the present study, we first employed an unsupervised machine learning approach to identify salient latent topics. Using the topics, we used a qualitative approach to develop themes further, as a qualitative approach allows a deeper dive into the data, such as through manual coding and inductively developing themes based on the latent topics generated by machine learning algorithms.

Pre-processing the raw data
Shown in Figure 1, we used Python to clean the raw data, such as removing all non-English characters, the hashtag symbol, and its content, repeated words, and special characters, punctuations, and numbers from the dataset. More details were discussed in previous work [9].

Unsupervised machine learning
Latent Dirichlet Allocation (LDA) [10] is one of the widely used unsupervised machine learning approaches allowing researchers to analyze unstructured text data (e.g., Twitter messages). Based on the data itself, the algorithm produces frequently mentioned pairs of words, the pairs of words co-occur together, and the latent topics and their distributions over topics in the document [11].
Existing studies have indicated the feasibility of using LDA in identifying the patterns and themes of the Tweets texts related to COVID-19 [12 13].

Qualitative analysis
To triangulate and contextualize findings from the LDA model, we employed a qualitative approach to develop themes further. Specifically, using Braun and Clarke's [14] six steps of thematic analysis: (1) getting familiar with the keyword data, (2) generating initial codes, (3) searching for themes, (4) reviewing potential themes, (5) defining themes, and (6) reporting. Since the thematic approach relies on human interpretation, a process that can be significantly influenced by the personal understanding of the topics and a variety of bias, we have two team members conduct the first five steps independently. Then, the two members reviewed all independently identified themes and resolved disagreements together. Finally, we finalized themes corresponding to each one of the 13 topics.

Sentiment analysis
We used sentiment analysis, a natural language processing (NLP) approach, to classify the main sentiments of a given twitter message, such as fear and joy [15]. In the study, we used the NRC Emotion Lexicon, which consists of eight primary emotions: anger, anticipation, fear, surprise, sadness, joy, disgust, and trust [16]. We followed the four steps to calculate the emotion index for each twitter message, including (1) removing articles, pronouns (e.g., "and," "the," or "to"), (2) applying a stemmer by removing the predefined list of prefixes and suffixes (e.g., "running" after stemming becomes "run") [17], and (3) calculating the emotion index. We only keep one emotion with the maximum matching counts if one sentence has multiple emotions; and (4) calculating the scores for each eight-emotion type. We discussed the four steps in detail in the previous study [9].

Descriptive results
A total of four million (n=4,196,020) Tweets consists of our final dataset after pre-processing all raw data (e.g., removing the duplicates). These Tweets were posted by Twitter users between March 7 and April 21, 2020. We identified the most popular tweeted bigrams (pairs of words) related to COVID-19. Bigrams captured "two concessive words regardless of the grammar structure and semantic meaning and may not be self-explanatory" [18], including "covid 19," "stay home," "social distancing," "new cases," "don't know," "confirmed cases," "home order," "New York," "tested positive," "death toll," and "stay safe." Popular unigrams included virus, lockdown, quarantine, people, new, home, like, stay, don't, and cases. We presented the most popular unigrams and bigrams related to COVID-19 in Table 1 and visualized them using the word clouds in figure 2 and figure 3.

COVID-19 related topics
Our approach, Latent Dirichlet Allocation (LDA), produced frequently co-occurred pairs of words related to COVID-19. We organized these co-occurring words into different topics. LDA allowed researchers to manually define the number of topics (e.g., ten topics, twenty topics) that we would like to generate. Consistent with the previous studies, we used the coherence model -gensim [19] to calculate the most appropriate number of topics based on the specific data itself. For this dataset, the number of topics (n=13) returned by LDA had the highest coherence score as well as the smallest topic number. For example, the numbers of topics (n=19 or n=20) had higher coherence scores than the number of topics (n=13), but they represented larger topic numbers, shown in   Table 2. For example, Topic 3 had the highest distribution (8.87%) among all 13 common latent topics. Within Topic 3, these pairs of words tend to co-occur together and share the same Topic 3, such as "tested positive," "coronavirus outbreak," "New York," "shelter place," and "mental health."

Sentiment analysis of each latent topic
We presented the results of the sentiment analysis for each of the thirteen latent topics in Table 3.

COVID-19 related themes
The qualitative content analysis approach allows users to categorize these topics into different distinct themes. Two team members discussed these bigrams and generated Tweets samples in each topic and then categorized the identified 13 topics into different themes. Besides, we computed the topic distance [10] to cross-validated the classification of the themes. Figure 6 showed a 2D plane of the intertopic distance 3 [21], in which each cycle represented a topic from Topic 1 to Topic 13 in the study. To protect the privacy and anonymity of the Twitter users, we did not present any user-related information, such as users' twitter handles or other identifying information; therefore, sample tweets shown in Table 4 were excerpts drawn from original tweets.  A theme identified based on keywords in topic one was about public discussions around "measures or solutions to slow the spread of COVID-19." Keywords, such as "lockdown," "herd immunity," "face masks," "home quarantine," and "test kits," indicated this overarching theme. Topic 3 was about the public attention to the COVID-19 in New York and mental health concerns. We also identify other themes such as "misinformation and fake news (fake news, wuhan lab)," "need for vaccine (coronavirus vaccine)," "protests against the lockdown (healthcare workers, don't understand, anti-lockdown)," "coronavirus deaths (death toll, people die, people dying, coronavirus death)," and "coronavirus in the US (the United States, work home, god bless, wear mask, grocery store)."

Principal results
The results show several essential points. First, the public is using a variety of terms referring to COVID-19, including "virus," "COVID 19," "coronavirus," "coronavirus." In addition, coronavirus has been referred to as the "China virus" that can create stigma and harm efforts to address the COVID-19 outbreak [22]. Second, discussions about the pandemic in New York are salient, and its associated public sentiments are anger. Third, public discussions about the Chinese Communist Party (PPC) and the spread of the virus emerged as new topics, which were not identified in a previous study using Twitter data collected between January 20 to March 7, 2020 [9], suggesting the connection between the COVID-19 and politics is increasingly to be circulating on Twitter as the situation evolves. Fourth, public sentiment on the spread of coronavirus was anticipation for the potential measures that can be taken and followed by a mixed feeling of trust, anger, and fear. Results suggest that the public was not surprised by the rapid spread of growth.
Fifth, the public reveals a significant feeling of fear when they discuss the coronavirus crisis and deaths.

Comparison with prior work
Our findings are consistent with previous studies using social media data to assess the public health response and sentiments for COVID-19, and suggest that public attention has been focusing on the following topics since January 2020, including (1) the confirmed cases and death rates [9 23]; (2) preventive measures [9 23]; (3) health authorities and government policies [6 9]; (4) daily life impacts such as food supplies and school closing [13 23]; (5) fake news and misinformation about the coronavirus [13]; (6) an outbreak in New York [9]; and (7) COVID-19 stigma by referencing the coronavirus as the "Chinese virus" [22].
Compared with the study examining public discussions and concern for COVID-19 using Tweets from January 20 to March 7, 2020, we find that several salient topics are no longer popular, including (1) Outbreak in South Korea; (2) Diamond princess cruise; (3) economic impact [24]; and (4) supply chains [9]. Given the preventive measures, washing hands is no longer a prevalent topic. Instead, quarantine has become dominant.

Future research
First, future research could further explore public trust and confidence in existing measures and policies, which is essential. Compared to prior work, our study shows that Twitter users reveal a feeling of joy when talking about "herd immunity." We also find sentiments of fear and anticipation related to topics of quarantine and shelter-in-place. In addition, future studies could evaluate how government officials (e.g., President Trump) and international organizations (e.g., WHO) deliver and convey messages to the public, and its impact on the public opinions and sentiments. Finally, future studies could examine the spread of anti-Chinese/Asian sentiments social media and how people use social media platforms to resist and challenge COVID-19 stigma.

Conclusions
Studies have shown that Twitter data and machine learning approaches can be leveraged for