This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
COVID-19, caused by SARS-CoV-2, has led to a global pandemic. The World Health Organization has also declared an infodemic (ie, a plethora of information regarding COVID-19 containing both false and accurate information circulated on the internet). Hence, it has become critical to test the veracity of information shared online and analyze the evolution of discussed topics among citizens related to the pandemic.
This research analyzes the public discourse on COVID-19. It characterizes risk communication patterns in four Asian countries with outbreaks at varying degrees of severity: South Korea, Iran, Vietnam, and India.
We collected tweets on COVID-19 from four Asian countries in the early phase of the disease outbreak from January to March 2020. The data set was collected by relevant keywords in each language, as suggested by locals. We present a method to automatically extract a time–topic cohesive relationship in an unsupervised fashion based on natural language processing. The extracted topics were evaluated qualitatively based on their semantic meanings.
This research found that each government’s official phases of the epidemic were not well aligned with the degree of public attention represented by the daily tweet counts. Inspired by the issue-attention cycle theory, the presented natural language processing model can identify meaningful transition phases in the discussed topics among citizens. The analysis revealed an inverse relationship between the tweet count and topic diversity.
This paper compares similarities and differences of pandemic-related social media discourse in Asian countries. We observed multiple prominent peaks in the daily tweet counts across all countries, indicating multiple issue-attention cycles. Our analysis identified which topics the public concentrated on; some of these topics were related to misinformation and hate speech. These findings and the ability to quickly identify key topics can empower global efforts to fight against an infodemic during a pandemic.
The COVID-19 pandemic has affected global health and the economy. The use of social media and the internet to seek and share information about the virus has increased rapidly [
Analysis of risk communication is critical because it helps better understand how and why people propagate or consume certain information upon a threat to their health, economic, or social well-being. Such analysis helps stakeholders prepare and reach informed conclusions about how their decisions affect individuals’ interests, values, and well-being [
Studies have identified online risk communication topics by collectively considering temporal tweet trends by adopting, for instance, a statistical clustering method that scans over time [
This research used the data gathered from social media to understand public discourse on COVID-19. Understanding public concerns will help determine which unproven claims or pieces of misinformation need to be debunked first and will contribute to fighting the disease. Primarily, we aim to identify what people say without gatekeeping. For instance, identifying new misinformation in countries that are experiencing a pandemic at an early stage can buy time to debunk the same piece of misinformation in other countries before it poses a threat to public health [
To detect meaningful topical shifts of risk communication, one needs to demarcate temporal phases from the public discourse that reflect prevailing circumstances in the real world. If social media conversations were to change by the epidemic phases announced by local governments, one might use the same phases. However, government announcements do not necessarily match with the public interest. Following the issue-attention cycle theory [
We used a spatiotemporal approach and considered tweets from different countries to provide more holistic views of risk communication. We present views from four Asian countries. Such a multicountry view was used to explore possible opportunities for joint efforts in managing risk communication. For example, early detection of misinformation can help social media services, social media communicators, journalists, policy makers, and medical professionals fight infodemics worldwide.
We ask the following research questions (RQs):
RQ1: Do the official epidemic phases announced by governments reflect online interaction patterns?
RQ2: Can topic phases be demarcated automatically based on a bottom-up approach?
RQ3: What are the major topics corresponding to each topic phase?
RQ4: What are the unique traits of the topic trends by country, and what are the distinguishing online communicative characteristics?
By answering these RQs, this study makes four contributions. First, we propose an end-to-end method of extracting risk communication topics in a spatial–temporal fashion with less gatekeeping. Second, we provide a theoretical ground (issue-attention cycle) to the framework and successfully assess its validity by observing multiple prominent peaks in the daily conversation. Third, we demonstrate via a case study of four countries a common risk communication trait. During the peak moments of conversation, users on social media concentrate on a few topics. Finally, we show from the case study which topics were directly linked to misinformation and hateful speech in the studied data.
The gathered data from Twitter and the codes (including language tokenizers and analysis codes) are accessible in
The issue-attention cycle model [
Not all issues follow the five stages of the issue-attention cycle [
Despite these fragmented findings, the issue-attention cycle framework provides insights into how public attention dramatically waxes and wanes. An issue that has gone through the cycle is different from issues that have not gone through the cycle in at least two ways. First, when an issue has achieved national prominence, new institutions, programs, and measures will have been developed to address the situation. These developments and their societal impacts are likely to persist even after public attention has shifted elsewhere. Second, the prolonged impacts of these developments are shaped by what was heavily discussed when the issue was of primary public concern.
Although the issue-attention cycle was initially proposed to model traditional media such as newspapers and television, there is a burgeoning literature applying the model to social media platforms. Among them, Twitter serves as a forum that the public is increasingly turning toward to seek and share information that is not subjected to a gatekeeping process [
Building on these prior studies, we analyzed Twitter conversations about COVID-19 to examine social media’s issue-attention cycle. We present how to build an end-to-end method of identifying meaningful
Studies have examined various impacts of the pandemic. Researchers have focused on predicting the transmissibility of the virus. One study estimated the viral reproduction number (R0) of SARS-CoV-2, which is known to be more substantial than that of severe acute respiratory syndrome (SARS)–related coronavirus, which was the cause of the SARS outbreak that first appeared in Guangdong Province in southern China in 2002 [
Other studies have sought to understand the propagation of misinformation related to COVID-19. One study used an epidemic model to represent the spread of misinformation about COVID-19 on various social media platforms such as Twitter, Instagram, YouTube, Reddit, and Gab; the study showed that users interact and consume information differently on each platform [
Among the regional research, one article argued that fake online news in Japan has led to xenophobia toward patients and Chinese visitors [
More recently, a report showed that the public could not easily receive the information on COVID-19 shared by public health officials due to prevalent misinformation on fake cures and conspiracy theories [
Several studies have used data gathered from Twitter to analyze risk communication amid COVID-19. Some of them focus on sentiment analysis based on conventional rule-based lexicon models [
Many types of data sets have been released to the public and research communities on COVID-19. One study crawled Twitter for approximately 3 months and collected information on tweets with relevant keywords in 10 languages [
Natural language processing such as topic modeling is increasingly used to process extensive documents and extract hidden thematic patterns of textual information [
One work analyzed COVID-19–related tweets over 2 weeks to study ongoing topics and found that Twitter can be considered a rich medium to understand public opinions in real time [
Despite the growing literature on risk communication during COVID-19, most studies that use topic modeling extract topics from either the entire studied period or manually segmented periods. This study considers time and topics jointly; we used an algorithmic approach to identify topical phases that arise naturally. Our goal is to observe changing risk communication contexts (even when conversations contain similar keywords) from the issue-attention cycle perspective. We also chose to study risk communication in Asian countries that have received relatively little attention. Our data method is not restricted to the studied countries; it can be applied to other languages and countries.
We crawled Twitter for messages by using the Twint Python library [
The four countries were selected as a case study to demonstrate differences in their COVID-19 developments. In Iran, confirmed cases have gradually increased. In contrast, the case count in Vietnam has consistently stayed low. There was an abrupt increase in the numbers after the first confirmed case in South Korea, but the rising curve of confirmed cases has since flattened, unlike other countries. In India, the situation was relatively mild until mid-March 2020, and since then, there has been a drastic surge. Future research can replicate our methodology in other countries.
We set up two keywords,
Statistics of the scraped tweets.
Language | Duration | Keywordsa used | Tweets, n |
Korean | January 1 to March 27, 2020 | corona, Wuhan pneumonia | 1,447,489 |
Farsi | January 1 to March 30, 2020 | #corona, #coronavirus, #Wuhan, #pneumonia | 459,610 |
Vietnamese | January 1 to March 31, 2020 | corona, n-CoV, COVID, acute pneumonia | 87,763 |
Hindi | January 1 to March 27, 2020 | corona, Wuhan pneumonia | 1,373,333 |
aKeywords were used to collect relevant data for each country. We used two kinds of keywords: one official naming of COVID-19 and
As shown in
The pipeline structure of the topic analysis.
We first tokenized the data, a process that can be defined as converting data to the smallest units that have meaning. We filtered unnecessary textual information such as stop words, special characters (nonletters), special commands, and emojis. We then used existing Python tokenizer libraries corresponding to each language. Detailed information about the language-specific tokenizers is explained on GitHub [
The next step is to demarcate specific phases divided by dates to extract topics. This is nontrivial since there are multiple fluctuations and changes in topics reflecting real events such as increased patients with COVID-19. Furthermore, we ruled out using the epidemic phases announced by each government because the offline epidemic phases do not seem to capture actual online topic trends as explained in the forthcoming Basic Daily Trends section.
The issue-attention cycle moderating public attention to a given issue can be measured in media attention, such as the number of news stories [
We set the
We established joint thresholds for
We adopted a low-pass filter with 0.2 as the low-frequency threshold to remove noisy signals and smooth the data. Finally, the temporal data are divided into topic phases (see
We used latent Dirichlet allocation (LDA) for the topic modeling task. LDA is a well-known machine learning method to extract topics from given textual documents (ie, a collection of discrete data points) [
The topic count for each phase is a hyperparameter. The topic count’s range is between 2 and 50. We calculated perplexity, that is, the probability of how many tokens might be placed in the next step (ie, indicating ambiguity over the next possible token). Perplexity is a metric that is often used to optimize language models [
The optimal number of phases and topics by country.
Country | Phase 0 | Phase 1 | Phase 2 | Phase 3 | Phase 4 | Phase 5 | |||||||
|
N/Aa | N/A | |||||||||||
|
Time period | Jan 1-19, 2020 | Jan 20-Feb 12, 2020 | Feb 13-Mar 9, 2020 | Mar 10-27, 2020 |
|
|
||||||
|
Total tweets, n | 507 | 161,790 | 672,080 | 366,073 |
|
|
||||||
|
Average users per day | 14.06 | 2415.52 | 5376.77 | 5577.88 |
|
|
||||||
|
Average original tweets per day | 28.17 | 5244.09 | 17,796.08 | 13,095.65 |
|
|
||||||
|
Average retweets per day | 21.78 | 56,809.78 | 211,310.89 | 147,759.41 |
|
|
||||||
|
Tweet depthb | 0.77 | 10.83 | 11.87 | 11.28 |
|
|
||||||
|
Topics determined by perplexity, n | 2 | 41 | 15 | 43 |
|
|
||||||
|
75th percentile of topicsc, n | 1 | 18 | 6 | 14 |
|
|
||||||
|
Final topicsd, n | 1 | 8 | 5 | 11 |
|
|
||||||
|
N/A | N/A | N/A | N/A | |||||||||
|
Time period | Jan 1-Feb 18, 2020 | Feb 19-Mar 30, 2020 |
|
|
|
|
||||||
|
Total tweets, n | 15,473 | 437,176 |
|
|
|
|
||||||
|
Average users per day | 245.34 | 1442.46 |
|
|
|
|
||||||
|
Average original tweets per day | 385.63 | 5272.04 |
|
|
|
|
||||||
|
Average retweets per day | 1315.13 | 22,128.76 |
|
|
|
|
||||||
|
Tweet depth | 3.41 | 4.20 |
|
|
|
|
||||||
|
Topics determined by perplexity, n | 3 | 5 |
|
|
|
|
||||||
|
75th percentile of topics, n | 2 | 4 |
|
|
|
|
||||||
|
Final topics, n | 3 | 6 |
|
|
|
|
||||||
|
|||||||||||||
|
Time period | Jan 1-20, 2020 | Jan 21-25, 2020 | Jan 26-Feb 15, 2020 | Feb 16-Mar 4, 2020 | Mar 5-22, 2020 | Mar 23-31, 2020 | ||||||
|
Total tweets, n | 140 | 1499 | 18,424 | 28,458 | 26,950 | 12,292 | ||||||
|
Average users per day | 3.79 | 131.25 | 179.65 | 485.59 | 340.65 | 433.29 | ||||||
|
Average original tweets per day | 7.37 | 218.50 | 686.60 | 1238.77 | 1089.94 | 1224.00 | ||||||
|
Average retweets per day | 0.21 | 20.75 | 159.80 | 582.29 | 192.24 | 201.86 | ||||||
|
Tweet depth | 0.03 | 0.09 | 0.23 | 0.47 | 0.18 | 0.16 | ||||||
|
Topics determined by perplexity, n | 19 | 3 | 6 | 46 | 48 | 16 | ||||||
|
75th percentile of topics, n | 1 | 1 | 3 | 22 | 19 | 4 | ||||||
|
Final topics, n | 1 | 2 | 4 | 7 | 10 | 2 | ||||||
|
N/A | N/A | N/A | ||||||||||
|
Time period | Jan 1-29, 2020 | Jan 30-Mar 9, 2020 | Mar 10-27, 2020 |
|
|
|
||||||
|
Total tweets, n | 3088 | 151,210 | 1,219,030 |
|
|
|
||||||
|
Average users per day | 107.41 | 1364.95 | 13,318.63 |
|
|
|
||||||
|
Average original tweets per day | 269.72 | 4261.13 | 58,924.55 |
|
|
|
||||||
|
Average retweets per day | 415.69 | 14,467.8 | 318,368.05 |
|
|
|
||||||
|
Tweet depth | 1.54 | 3.40 | 5.40 |
|
|
|
||||||
|
Topics determined by perplexity, n | 3 | 50 | 47 |
|
|
|
||||||
|
75th percentile of topics, n | 2 | 22 | 20 |
|
|
|
||||||
|
Final topics, n | 3 | 5 | 9 |
|
|
|
aN/A: not applicable.
bMeasured as the ratio of retweets to original tweets.
cMajor topics.
dAfter human annotators merged similar themes.
This step involves labeling the themes of the extracted topics and allocating semantic meanings to each topic. We first sorted all tweets with the identified topics in descending order (ie, tweets on the most prevalent topics listed first) and discarded the minor topics that accounted for less than 25% of all tweets.
We then extracted the top 1000 retweeted tweets and the 30 keywords with the highest probability of usage for each topic. We provided these data sets to local users from each country and asked them to label themes for each topic based on the given data sets. Any similar or hierarchical topics were then merged via qualitative coding into a higher category. If one topic corresponded to several themes, then it was given multiple class labels. The maximum number of multiple classes within topics was two, and each class within a topic was weighted as 0.5 in the plot of daily trends in the number of tweets.
Human annotators, who are familiar with the local language and Twitter, qualitatively assessed the extracted topics. First was the intralevel, where annotators labeled each topic based on the contents of the sampled top 1000 tweets and top 30 words. The second was the interlevel, where the annotators compared tweet contents and top-occurring words among topics regardless of the phase. Other annotators then cross-checked the assessment.
The Cohen kappa coefficient to measure the intercoder reliability was 0.766 (see
Concerning the local and global news themes, we narrowed down the labels since people talked about different news categories. We sublabeled tweets as “_confirmed” if it was about confirmed cases or deaths, “_hate” if it was about hate crimes toward individual races, “_economy” if it was about the economic situation and economic policies, “_cheerup” if it was about supporting each other, and “_education” if it was about when to reopen schools; finally, no sublabel was given to tweets about general information.
Daily trends in the four countries. The x-axis is dated, and the y-axis is the number of tweets with a log scale.
Daily trends in South Korea. Start/end dates of the official epidemic phases (vertical dashed lines), trends in the number of tweets (blue lines), and trends in the number of confirmed cases (red bars).
Daily trends in Iran. Start/end dates of the official epidemic phases (vertical dashed lines), trends in the number of tweets (blue lines), and trends in the number of confirmed cases (red bars).
Daily trends in Vietnam. Start/end dates of the official epidemic phases (vertical dashed lines), trends in the number of tweets (blue lines), and trends in the number of confirmed cases (red bars).
Daily trends in India. Start/end dates of the official epidemic phases (vertical dashed lines), trends in the number of tweets (blue lines), and trends in the number of confirmed cases (red bars).
The first patient with COVID-19 was reported in South Korea on January 20, 2020. This explains why the tweet count remains relatively low during early January and mostly increases only after late January (see
On February 18, 2020, the tweet numbers increased sharply due to the 31st confirmed case related to a cult religious group Shincheonji in Daegu City. After this case was confirmed, the quarantine authority began rigorous testing, focusing on Daegu, and the number of confirmed cases increased drastically until mid-March. The tweet trends follow an identical pattern. However, the official epidemic phases announced by the government, represented by vertical dashed lines in the figure, seem to lag behind the increases in the number of tweets. This pattern shows that the official epidemic phases do not align well with the amount of online attention.
We repeated the analysis with the other three countries, as shown in
We used the daily theme labels acquired from the “Label Topics” module and analyzed the topic changes over time with plots for the four countries. One plot showed daily trends based on the number of tweets, while another plot shows trends based on the number of tweets mentioning country names such as the United States. Overall, as people talked more about the COVID-19 outbreak (ie, as the daily number of tweets increased), people’s topics became less diverse.
The data yielded a total of four topic phases, which are used in
Daily topic trends in South Korea. Trends based on number of tweets (top) and based on number of tweets mentioning country names (bottom).
We portrayed daily trends of interest in other countries by counting the tweets mentioning other countries’ names in local languages or English. Korea, China, and Japan were mentioned most frequently; we suspect that this was mainly triggered by political and diplomatic relationships. Meanwhile, the United States and Italy were both mentioned steadily across the 3 months, with the media outlets broadcasting global news affecting this phenomenon.
We repeated the same analysis and interpreted the results for the other cases (Iran, Vietnam, and India), as depicted in
Daily topic trends in Iran based on the number of tweets.
Daily topic trends in Vietnam based on the number of tweets.
Daily topic trends in India based on the number of tweets.
This paper analyzes tweets to understand the public discourse on the COVID-19 pandemic. In South Korea, the daily numbers of tweets reached their local maxima in tandem with major offline events. However, in Iran and Vietnam, the tweet counts did not synchronize well with offline events; this may be because of various reasons (eg, Twitter is only one of the platforms used by citizens of this country). Overall, it is interesting to observe that the Twitter data peaks do not necessarily correlate with local governments’ announcements. Social media attention can precede the official announcements, while the official announcements can reinforce the attention.
Based on the topics labeled as people talked more about COVID-19, they tended to refer to a smaller number of topics. This was more apparent when the tweet depth value was used for the phases, as presented in
Tweet depth is defined as the number of retweets per day divided by the number of tweets per day. It can be deemed a measure of standardized cascading depth, with a higher value signifying a greater depth for one tweet. The country-level sociopolitical and cultural background, and Twitter popularity may lead to the observed differences in tweet depth. We verified that tweet depth tended to increase in South Korea and Vietnam cases when people communicated more about COVID-19. This phenomenon reaffirms the finding in another study that the online coronavirus network’s diameter value was smaller than that of other keyword networks [
The topical phases with the most considerable tweet depth appeared in the second stage of the issue-attention cycle, where public awareness of an issue soars. In Iran and India, the number of phases might have been too small to discern any such trends. It is also worth noting that this pattern has no intercountry temporal dependence. In other words, even though the pandemic hit the countries at different times, our analysis shows that the tweet depth reached a maximum when the pandemic worsened in that country. This observation could prove to be an effective forewarning of upcoming misinformation cascades.
Moreover, the daily tweet volume peaks reflected the daily number of confirmed cases. In Iran, Vietnam, and India, the daily tweet volume peak anticipated the peak of the number of daily confirmed cases by up to a few weeks. Although the two peaks are close to each other for South Korea, it is worth noting that, around the time of their occurrence, South Korea was becoming the country most affected by COVID-19 outside mainland China.
Interestingly, as shown in
We also observed a number of countrywise differences. One of them is the national versus international focus of South Korea and Vietnam during the initial phase. Phase 0 tweets in Korea were not directly related to COVID-19 but simply contained the word
With specific reference to each country, in South Korea, when the local (offline) pandemic situation became severe (phase 2), the number of topics discussed on Twitter decreased, which means that people focused more on only a handful of issues. A unique feature of phase 0 was that people sought to cheer each other up and express solidarity in difficult times. In Iran’s case, the topic count was relatively steady over time. The significant topics discussed were confined to news and information; we interpreted this as a sign that Iranian users tend to be cautious about using social media.
For Vietnam, in phase 4, when tweet traffic was lower than in phase 3, the number of topics became more substantial, and the topic themes became less related to the numbers of confirmed cases and death tolls. For instance, people talked more about the economy in phases 2 and 4. The Indian case also displayed a unique trait: many topics were related to misinformation, the scale of which was much lower in the other countries. A large portion of the topics consisted of misinformation and hateful content; this trend was observed throughout phases 2 and 3 (see
There are several limitations to be considered. First, we analyzed tweets from only four countries, and therefore, we need to be cautious about extrapolating explanations and insights generally. We plan to extend this study by including more countries. Second, there are other ways to demarcate the topic phases. Our approach was informed by the issue-attention cycle framework, as we computed unique communication traits (ie,
Last, there are also other methodologies to model topics. One natural extension would be to use the external web links that are embedded in the relevant tweets. Scraping the content from external web pages could provide richer contexts in understanding risk communication on social media. One recent work used multilingual Bidirectional Encoder Representations from Transformers, a well-known transformer-based deep embedding model, and fine-tuned it by considering topical and temporal information to model topics of COVID-19 tweets [
The current literature on the infodemic has emphasized the social media platform’s content moderation efforts [
Our research found that when the tweet count on COVID-19 increased, it did not lead to an increased number of topics; regardless of the tweet count, much of the public attention remained focused on a limited set of topics. The early days of the COVID-19 pandemic also involved various misinformation and hateful speech in the studied countries; fake news was one of the central topics discussed (not a peripheral topic). The proposed steps could indicate the global effects of infodemics during a pandemic and identify the emergence of misinformation and its prevalence, which will help prioritize which misinformation to debunk.
Data/code description, computed daily velocity/acceleration trends for each country, and derived temporal phases.
Daily topic trends on social media, the reliability of the topic modeling results, intercoder reliability, and the labeled list of major topics by country.
Daily numbers of COVID-19 confirmed cases and tweet trends by country.
ground truth
latent Dirichlet allocation
research question
severe acute respiratory syndrome
The authors would like to thank Tae-Gwan Kang for his insightful comments. SP, SH, and MC were supported by the Institute for Basic Science (IBS-R029-C2) and the Basic Science Research Program through the National Research Foundation of Korea (No. NRF-2017R1E1A1A01076400).
None declared.