This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The COVID-19 pandemic has affected the lives of people globally for over 2 years. Changes in lifestyles due to the pandemic may cause psychosocial stressors for individuals and could lead to mental health problems. To provide high-quality mental health support, health care organizations need to identify COVID-19–specific stressors and monitor the trends in the prevalence of those stressors.
This study aims to apply natural language processing (NLP) techniques to social media data to identify the psychosocial stressors during the COVID-19 pandemic and to analyze the trend in the prevalence of these stressors at different stages of the pandemic.
We obtained a data set of 9266 Reddit posts from the subreddit \rCOVID19_support, from February 14, 2020, to July 19, 2021. We used the latent Dirichlet allocation (LDA) topic model to identify the topics that were mentioned on the subreddit and analyzed the trends in the prevalence of the topics. Lexicons were created for each of the topics and were used to identify the topics of each post. The prevalences of topics identified by the LDA and lexicon approaches were compared.
The LDA model identified 6 topics from the data set: (1) “fear of coronavirus,” (2) “problems related to social relationships,” (3) “mental health symptoms,” (4) “family problems,” (5) “educational and occupational problems,” and (6) “uncertainty on the development of pandemic.” According to the results, there was a significant decline in the number of posts about the “fear of coronavirus” after vaccine distribution started. This suggests that the distribution of vaccines may have reduced the perceived risks of coronavirus. The prevalence of discussions on the uncertainty about the pandemic did not decline with the increase in the vaccinated population. In April 2021, when the Delta variant became prevalent in the United States, there was a significant increase in the number of posts about the uncertainty of pandemic development but no obvious effects on the topic of fear of the coronavirus.
We created a dashboard to visualize the trend in the prevalence of topics about COVID-19–related stressors being discussed on a social media platform (Reddit). Our results provide insights into the prevalence of pandemic-related stressors during different stages of the COVID-19 pandemic. The NLP techniques leveraged in this study could also be applied to analyze event-specific stressors in the future.
The COVID-19 pandemic has affected the lives of people globally for over 2 years. To mitigate infection, safety measures such as social distancing, lockdowns, and school closures have been implemented. The mental health of many has been impacted due to stress, anxiety, loneliness, and feelings of uncertainty about the pandemic [
Social media platforms, such as Twitter and Reddit, are commonly used as the data source for obtaining insights regarding mental health status. As people share their feelings or experiences on the platforms, the content may reflect users’ emotions. The changes in emotions of the population could be reflected by their behavior on social media. Many researchers have utilized NLP techniques and social media data to analyze the mental health status of the population. De Choudhury et al [
In the field of public health and informatics, some researchers have utilized NLP topic modeling to summarize COVID-19–related discussions on social media. Medford et al [
Although some prior NLP research aimed at devising methods to utilize social media data to measure the population’s mental health status, few studies focused on identifying psychosocial stressors. According to the American Psychological Association [
In this paper, we utilized the latent Dirichlet allocation (LDA) topic model to identify pandemic-related distress, by identifying the topics being discussed on the subreddit \rCOVID19_support. After applying the LDA model, we visualized the trends in the prevalence of topics at different stages of the pandemic. Several existing works leveraged NLP to analyze the population’s mental health status during the pandemic; however, existing studies did not explore the possibility of analyzing COVID-19–related stressors. This study focused on monitoring the stressors during the pandemic. Although most existing research focused on the mental health impacts at the beginning of the outbreak of the pandemic, the data set extracted in this study covered the posts on the subreddit starting from February 2020, the outbreak of the pandemic, up to July 2021. This allowed us to visualize the changes in the prevalence of stressors at different stages of the pandemic. Observing the trends can provide insights into the latest predominant stressors, and the findings could also be useful for mental health support providers and policy makers. We believe applying NLP to summarize text can help us obtain insights into mental health status and stressors during the pandemic.
Overview of the research framework. LDA: latent Dirichlet allocation; TF-IDF: term frequency-inverse document frequency.
In this study, Reddit was selected as the data source for applying machine learning models to obtain insights into the prevalence of stressors during COVID-19. There are several advantages of using Reddit in this study. De Choudhury and De [
There are different subreddits related to COVID-19. We selected \rCOVID19_support for studying COVID-19-related stressors. Users on the platform \rCOVID19_support ask questions about COVID-19 and share their experience during hard times in the pandemic. The discussions on other subreddits, such as \rCOVID-19, \rCoronavirus, and \rCoronaVirus2019nCov focused on news and information instead of sharing experiences during the pandemic. As the topics discussed on \rCOVID19_support are more related to our research topic, we selected this subreddit as our data source.
Reddit posts were extracted from the subreddit \rCOVID19_support using the Pushshift API [
On Reddit, some of the posts are tagged by “flairs,” which describe the content or the nature of the posts. The flairs in the data set included: “Support,” “Questions,” “Discussions,” “Trigger Warning,” “Good News,” “Firsthand Account,” “Resources,” “Vaccines are SAFE,” “News,” “Biosafety Request,” “The answer is NO,” “Misinformation-debunked,” and “Desperate mod.”
Posts that are tagged with flairs such as “Resources” and “News” have a low tendency to include content related to psychosocial stressors. With the use of flairs tagged to the posts, we filtered posts with a low chance of including content related to our research topic. On the subreddit, some of the posts included content related to sharing information such as the latest news about COVID-19, potential adverse effects of vaccines, and tips about infection prevention. Some users asked questions such as which vaccines are safe, whether the adverse effects of vaccines are normal, and whether it is safe to visit their grandparents during the pandemic. The posts labelled with flairs “Support” and “Trigger Warning” have a high tendency to include content about users’ personal experiences, stressors, or feelings during the pandemic. Posts with flairs “News” and “Questions” tend to not include content about stressors. As we were focused on understanding stressors in this study, we extracted a subset to include posts that were labelled with the flairs that have a high tendency to include relevant content.
In the data set, 4654 posts were labelled by flairs, and 4612 posts were not. The number of posts tagged by each of the flairs is shown in
Number of posts tagged by each of the flairs in the data set.
Flairs | Subset with labelled flairs (n=4654) | Subset with predicted flairs (n=4612) | Data set with labelled or predicted flairs (n=9266) | ||||
|
|||||||
|
Support | 2386 | —a | — | |||
|
Trigger warning | 197 | — | — | |||
|
Deperate modb | 1 | — | — | |||
|
Total | 2584 | 2888 | 5472 | |||
|
|||||||
|
Questions | 1069 | — | — | |||
|
Discussion | 597 | — | — | |||
|
Vaccines are SAFE | 55 | — | — | |||
|
The answer is NO | 7 | — | — | |||
|
Biosafety request | 14 | — | — | |||
|
Total | 1742 | 1417 | 3159 | |||
|
|||||||
|
Good news | 146 | — | — | |||
|
Resources | 59 | — | — | |||
|
News | 18 | — | — | |||
|
Misinformation—debunked | 3 | — | — | |||
|
Total | 226 | 225 | 451 | |||
|
|||||||
|
Firsthand account | 102 | — | — | |||
|
Total | 102 | 82 | 184 |
aNot applicable.
bThe flair in the original post was “Deperate mod,” which is likely a misspelling of “desperate mood.”
In this study, we used the LDA topic model to identify the COVID-19–related distress that was mentioned on Reddit. The topic model is an unsupervised machine learning model that can be applied to different research topics such as computational social science and understanding scientific publications [
In this study, the LDA model in the Python scikit-learn package was used to identify the topics in the data set. The model was trained by TF-IDF features that were created from the data set. For each of the data points, the topic model outputs the proportions of contents belonging to each of the topics. The dominant topic for each of the posts was then identified by finding the topic with the highest value in the LDA output. The number of topics is the major hyperparameter of the LDA model, which affects the interpretability of topics. There are different ways to determine a suitable number of topics to achieve high interpretability of topics. In the study by Jelodar et al [
The output of topic modeling is highly dependent on the feature vector (a matrix of TF-IDF values for each of the documents). For the first trial, the feature vector used to train the LDA model included TF-IDF with max_feature 300. This means the feature vector includes 300 columns of tokens, which are the terms consisting of one or more words that have the highest TF-IDF values in the given corpus. Then, the model was evaluated by the aforementioned methods, namely selecting sample posts from each of the topics and evaluating the topic coherence manually. In addition to evaluating LDA topic coherence, features were manually evaluated to determine whether they were likely to cause the LDA topic model to cluster posts in desired ways. For example, the topic model may group posts with tokens “suggestion,” “anyone,” and “thank you” into the same topic because those words tend to appear together when authors were asking for suggestions at the end of posts. However, clustering this topic does not help with understanding the stressors and may act as noise. To avoid this, those tokens were removed from the feature vectors. Besides removing those tokens, we also identified some tokens that were useful for identifying the topics. For example, the tokens “grocery shopping,” “maskless,” and “no mask” commonly appeared when users were expressing their fears of getting infected. We hypothesized that including those tokens in the feature vector would help the LDA topic model identify topics related to fear of coronavirus. The feature vector was then updated by selecting these tokens. Then, the LDA was trained by the new feature vector, leading the output of the model to be closer to the desired result. The iteration process to improve the topic model is illustrated in
A psychosocial stressor lexicon is a list of keywords for each of the stressors. After the topic groups were defined, each of the topic groups was evaluated to search keywords that directly indicate the existence of topics in the text. For example, if the posts include education-related words such as “college” and “online learning,” it can be concluded that the authors have mentioned educational problems in the post. Lexicons were created for some of the topics. If a post included any of the words listed in the lexicons, it was assumed that the post included content related to the corresponding topics. Each of the posts could contain more than one topic. With the use of lexicons, each of the posts was annotated by whether it included content of each of the topics.
In LDA, topics are defined as a mixture of terms with different probability distributions; this means a word could belong to more than one topic and it could cause inaccuracy in the prediction. In contrast, the lexical approach has higher interpretability on topic classification, but it requires careful selection of keywords to avoid including terms belonging to more than one topic. In this study, we assumed that we do not have prior knowledge of how Reddit users expressed their feelings on the subreddit. To obtain insights into what topics existed in the subreddit and the keywords for each of the topics, we applied the LDA model before applying the lexical approach. In this study, the lexical approach was created for 2 purposes: first, to further analyze the subtopics within the topic group. Some of the topics may include some common words and have to be grouped into the same topic group in the LDA model. The topics in the same topic group could be separated by choosing unique keywords. The second reason for creating the lexicon was to verify the result from the LDA.
In previous steps, each of the data points was annotated with the LDA output, which represents the proportion of the content belonging to each of the topic groups, and the results of the lexical approach, which represent whether the post includes words that were listed in lexicons. Then, the monthly sum of the LDA model output for each topic groups was computed. The trend of topics in the data set can then be visualized and compared with the development of the pandemic. Regarding the pandemic development, the numbers of total cases, new cases per day, and vaccinated population were obtained from Our World Data [
This study uses public datasets derived from social media and does not involve any human participants. As such it does not require ethics board approval.
After grouping LDA topics, 6 topics were identified: (1) fear of coronavirus, (2) educational and occupational problems, (3) family problems, (4) problems related to social relationships, (5) mental health symptoms, and (6) uncertainty about the development of the pandemic. We grouped the posts according to the topics and created word clouds for each of the 6 subsets.
We used t-distributed Stochastic Neighbor Embedding (t-SNE), which can be used to visualize high-dimensional data, to visualize the coherence of topic groups.
The major topic in the data set was “fear of coronavirus.” The words frequently appearing in the topic of “fear of coronavirus” are shown in
The LDA model identified the topic of “educational and occupational problems.” Some students shared their feelings about the online learning experience, and some felt upset because they missed the social engagement from school activities. Some people expressed worries about finding jobs during the pandemic. Other posts mentioned the situation of current university students having online classes for the final year and worried about finding a job after graduation. Due to the existence of this type of content and tokens related to education and work such as “online learning,” “graduation,” “university,” and “find jobs” appearing together, we grouped the educational and occupational LDA topics into the same group.
The topic “family problems” included content about several subtopics: the worry of elderly parents getting infected, worry about parents who were already infected, and the anger of having opposing views with family members due to different acceptance of social distancing measures. The topic “mental health symptoms” included mentioning symptoms such as insomnia, obsessive compulsive disorder, and feeling depressed. Some of the users described their feelings and mental health symptoms on Reddit and asked for mental health support but did not explain the stressors. The topic “problems related to social relationships” included content about loneliness and uncertainty about the balance between social activities and pandemic safety measures. The topic “uncertainty on development of pandemic” refers to the content about the worry of the pandemic lasting forever. For this topic, some posts included discussions about whether vaccination would help get life back to normal.
Word clouds for each of the topics identified: (A) uncertainty about the development of the pandemic, (B) problems related to social relationships, (C) family problems, (D) educational and occupational problems, (E) mental health symptoms, (F) fear of the coronavirus.
t-distributed Stochastic Neighbor Embedding (t-SNE) plot for the topic groups identified.
On March 11, 2020, the World Health Organization declared COVID-19 a “pandemic” [
The trends in the prevalence of topics have been plotted separately in
As shown in
The prevalence of the topics “family problems” and “educational and occupational problems” dropped in February 2021. For the topic of uncertainty, its prevalence declined in February 2021 but increased in April 2021, when the new variant was becoming prevalent in the United States. The most common topic in December 2020 and January 2021 was “uncertainty about development of pandemic.” Starting from May 2020, the prevalence had a high correlation with the number of cases.
To determine the distribution of topics in each month, we calculated the percentage of posts belonging to each topic in each month, and the corresponding trend was plotted (
Trend of topics on \rCOVID19_support. The stacked area plot represents the sum of the latent Dirichlet allocation (LDA) output for each month, for each of the topics. The line plot represents the total number of cases and the vaccinated population.
Trends in the prevalence of topic groups: (A) educational and occupational problems, (B) mental health symptoms, (C) family problems, (D) problems related to social relationships, (E) fear of the coronavirus, and (F) uncertainty on the development of the pandemic. The line plots represent the number of cases for each month. The area plots represent the prevalence of topics, which was measured using the output of the latent Dirichlet allocation (LDA).
Trend on proportions of topics mentioned on \rCOVID19_support. The dashed lines represent the total number of cases and the vaccinated population. The solid lines represent the proportion of each topic for each month.
According to
The keywords for each of the topics were selected by evaluating the sample posts from the topic groups in the LDA model. As we assumed we had no prior knowledge of which mental health issues were expressed on the platform and the commonly used keywords for each of the topics, the LDA model was applied before using the lexical approach. Once the keywords and topics were obtained, we could use the lexical approach to label the topics mentioned in each of the posts and visualize the trends. In some cases, the posts may describe the topics without using the keywords. For example, in the topic “mental health symptoms” in the LDA model, the text in a post may include the words “feel,” “anxious,” “depressed,” and “tired”; however, those words are also common in other topics. Therefore, it may be unsuitable to use those keywords to identify the topic “mental health symptoms” using the lexical approach. For this case, the LDA model is needed to identify that topic. In this study, both methods were used, showing similar results for the topics “Fear of coronavirus” and “Pandemic development” (measured using the correlation coefficient).
We also plotted the trends in the number of posts for each topic separately in
To understand which stressor was the most prevalent at different stages of the pandemic, we measured the percentage of posts containing the topics in each month and plotted the trend.
According to
Lexicon for COVID-19 stressors.
Topics | Tokens |
Education problems | college, online learning, class, semester, freshman |
Occupational problems | lost job, unemployed, laid off, income, money, quit job, career |
Lonely | social interaction, interact, connection, lonely, friendless, feel alone, loneliness, friendless, social life, friendship, socialize, make friends, new friends, disconnected |
Fear of coronavirus | no mask, without mask, maskless, unmasked, grocery, panic, precautions, coworker, cough, exposed, wash, temperature, OCD |
Pandemic development | forever, permanent, back normal, new normal, ever end, never ending, endless, lose hope, normal life |
This figure compares the trends in the prevalence of topics between the latent Dirichlet allocation (LDA) model (solid lines) and the lexicon (dashed lines).
Trend on the number of posts mentioning each of the topics in the lexical approach: (A) educational problems, (B) occupational problems, (C) lonely, (D) fear of coronavirus, (E) pandemic development.
Percentage of posts mentioning each of the topics in each month (dominant topic).
In this study, with the use of the topic model on Reddit data (subreddit \rCOVID19_support), we identified 6 topics related to the pandemic, which were “fear of coronavirus,” “educational and occupational problems,” “family problems,” “problems related to social relationships,” “mental health symptoms,” and “uncertainty about development of pandemic.”
According to the result of our study, the prevalence of discussions on “fear of coronavirus” dropped significantly after the start of vaccination. One of the possible explanations is that the increase in the vaccinated population may have reduced the perceived risk of COVID-19. Perez-Arce et al [
The number of posts about “uncertainty about pandemic development” did not have a notable drop while the vaccinated population was growing, but the trend shows a correlation with the number of cases. People may have been uncertain about the length of lockdown and how long social distancing requirements would last. This could be correlated with the number of cases instead of the perceived risks of infections. Briscese et al [
In our study, the LDA models identified some of the COVID-19–related stressors proposed in prior studies. Taylor et al [
The trend observed in our study is consistent with those of prior studies. Yarrington et al [
In this study, the prevalence of stressors was only compared with the number of cases and the vaccinated population and not compared with specific safety measures in specific cities. This was due to the anonymity on Reddit. The demographic information of users was unknown, and the data set obtained in this study included posts written by users in different countries. Every city has implemented social distancing measures and lockdowns at different times, depending on the number of cases and the hospitalization rates. Due to this situation, we could not analyze the relationship between mental health status and the safety measures.
Similar to other studies that utilized social media data, the data extracted could only represent the population who would share their experience and feelings on social media. During the pandemic, children and older adults were strongly impacted by the changes, but it is unlikely that they shared their experiences on Reddit or other social media. To understand the needs of people with different demographics, questionnaires or interview-based studies are still required.
According to the results of this study, the stressors that were caused by perceived risks were alleviated since the beginning of the vaccine distribution. However, the stressors related to the frustration of uncertainty on the length of social distancing measures then became the major stressors. Lockdown and social distancing policies may depend on the hospitalization rate, transmissivity, and severity of the virus. With a high proportion of the population vaccinated and more experience with handling COVID-19 patients, lockdowns such as those that occurred in the first 2 waves of COVID-19 are not likely to be required. In terms of alleviating mental distress, the government may consider explaining to the public that the health care system is prepared to handle new outbreaks of coronavirus and we are in the process of getting back to normal life.
With the use of the lexicons created in this study, we obtained the posts or the sentences that described the worries of getting infected and the uncertainty about whether the pandemic is never-ending. This could be used as a data set for training machine learning classifier models to detect tweets that describe the stressors. Unlike Reddit posts, geographic information is available for tweets. By specifying the users’ location of tweets, we could analyze the mental distress impacts of specific social distancing policies. We could also establish time series models to quantify and predict the expected effects on stressors. This could help policy makers to estimate the impacts on mental distress before implementing a policy.
In 2022, a new variant, Omicron, became prevalent, and there were updates to social distancing measures. In the future, we can use similar methods in this study to compare the trends in stressors at the time of the Delta variant and Omicron variant.
In this study, we applied topic modeling to a data set that contained Reddit posts in \rCOVID19_support to identify the COVID-19–related psychosocial stressors and to visualize the trend in the prevalence of the stressors. Compared with existing research, which utilized NLP techniques on social media data to study the mental health impacts of the pandemic, our study focused on stressors instead of mental health status. The data set used in this study included posts that were created in a time period of more than 1 year during the pandemic. This allowed us to compare the difference in the prevalence of stressors before and after vaccines were distributed. The proposed topic model also allowed for monitoring the dominant stressors, which enables mental health support providers to notice changes in stressors at different stages of the pandemic. This study demonstrated the potential of using topic modeling on social media discussions to identify event-specific stressors and create a dashboard to analyze and monitor the trends. We hope the findings in this study will provide insights for health care providers and social workers to address the needs of COVID-19–related mental health support. Furthermore, we hope the NLP techniques used in this study will be applied to analyze psychosocial stressors and create corresponding lexicons of future events such as pandemics, protests, or financial crises.
latent Dirichlet allocation
natural language processing
Patient Health Questionnaire 4
t-distributed Stochastic Neighbor Embedding
term frequency-inverse document frequency
None declared.