Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study

Background The COVID-19 pandemic is impacting mental health, but it is not clear how people with different types of mental health problems were differentially impacted as the initial wave of cases hit. Objective The aim of this study is to leverage natural language processing (NLP) with the goal of characterizing changes in 15 of the world’s largest mental health support groups (eg, r/schizophrenia, r/SuicideWatch, r/Depression) found on the website Reddit, along with 11 non–mental health groups (eg, r/PersonalFinance, r/conspiracy) during the initial stage of the pandemic. Methods We created and released the Reddit Mental Health Dataset including posts from 826,961 unique users from 2018 to 2020. Using regression, we analyzed trends from 90 text-derived features such as sentiment analysis, personal pronouns, and semantic categories. Using supervised machine learning, we classified posts into their respective support groups and interpreted important features to understand how different problems manifest in language. We applied unsupervised methods such as topic modeling and unsupervised clustering to uncover concerns throughout Reddit before and during the pandemic. Results We found that the r/HealthAnxiety forum showed spikes in posts about COVID-19 early on in January, approximately 2 months before other support groups started posting about the pandemic. There were many features that significantly increased during COVID-19 for specific groups including the categories “economic stress,” “isolation,” and “home,” while others such as “motion” significantly decreased. We found that support groups related to attention-deficit/hyperactivity disorder, eating disorders, and anxiety showed the most negative semantic change during the pandemic out of all mental health groups. Health anxiety emerged as a general theme across Reddit through independent supervised and unsupervised machine learning analyses. For instance, we provide evidence that the concerns of a diverse set of individuals are converging in this unique moment of history; we discovered that the more users posted about COVID-19, the more linguistically similar (less distant) the mental health support groups became to r/HealthAnxiety (ρ=–0.96, P<.001). Using unsupervised clustering, we found the suicidality and loneliness clusters more than doubled in the number of posts during the pandemic. Specifically, the support groups for borderline personality disorder and posttraumatic stress disorder became significantly associated with the suicidality cluster. Furthermore, clusters surrounding self-harm and entertainment emerged. Conclusions By using a broad set of NLP techniques and analyzing a baseline of prepandemic posts, we uncovered patterns of how specific mental health problems manifest in language, identified at-risk users, and revealed the distribution of concerns across Reddit, which could help provide better resources to its millions of users. We then demonstrated that textual analysis is sensitive to uncover mental health complaints as they appear in real time, identifying vulnerable groups and alarming themes during COVID-19, and thus may have utility during the ongoing pandemic and other world-changing events such as elections and protests.


Economic Stress
Unemploy, economy, rent, mortgage, evict, enough money, more money, pay the bills, owe, debt, make ends meet, afford, save enough, salary, wage, income, job, eviction  Table S1: Manually constructed lexicons. Developed to assess the prevalence of tokens related to these topics in all of the subreddits.

Latent Dirichlet Allocation Topic Modeling on Pre and MidPandemic Subreddits
The dictionary created as described in Section 1.2 was applied to all posts to create a bag-of-words corpus which was used to create an LDA model using 25 passes and 3 workers. Models with 5, 10, 15, 20, and 25 topics were created. Models were also generated multiple times with different subsamples of posts to assess stability of topics. A manually chosen LDA model with 10 topics was then applied to all posts across all subreddits (mental health and non-mental health) to assess the distribution of topics, allowing for comparison between the distribution of posts prepandemic vs midpandemic. A manually chosen LDA model created on midpandemic data was applied to posts from r/COVID19_support to assess any change in topic distribution.

Measuring Similarity Between Subreddits over Time with Supervised Dimensionality Reduction
Since UMAP contains parameters that could affect relative distance between subreddits as could downsampling the data to obtain balanced classes, we estimated the precision of this approach on 2019 data. First, hyperparameter tuning was performed (2700 samples for each subreddit) to find the parameter set that optimized clustering measured through silhouette score using n neighbors (2,10,20,50,100,200), min dist (0.0, 0.1, 0.25, 0.5, 0.8, 0.99) and metric (euclidean, cosine). Second, to tackle the variance caused by subsampling, we measured the pairwise Hausdorff distances between 2019 clusters across 50 runs, each with new random subsampling. Using a distance metric between clusters, rather than their absolute centroid location, allows for avoiding rotation or flipping effects of dimensions. Bootstrapping across runs provides an estimate of the method's precision and also allows us to measure how rare 2020 changes in distances are with respect to a distribution of regular fluctuations for a non-pandemic year (2019). For 2020, we also compute the median distance across 50 bootstrapping samples for our final analysis.

Trend Analysis
See Figure S4 for examples of trends and regression. See Figure S5 for 2020 main results and Figures S6 and S7 for comparisons to 2019 and 2018 trends. See Table 2 in the main text for examples of significant trends in Figure S5.      Figure S5 for more details.  Table S4. Cluster annotations. Cluster annotations were assigned based on a review of the features found to distinguish each cluster, using Wilcoxon rank-sum tests with Bonferroni correction. Representative significantly cluster-associated features informing the annotation are shown, along with the total number of significant features per cluster. Clustering was performed separately with k=20 on the prepandemic posts dataset from 2019 and the midpandemic posts dataset from 2020. The majority of clusters were approximately replicated between the two time periods. Each pair of clusters (one from the prepandemic data and one from the midpandemic data) assigned the same annotation are shown side-by-side to illustrate the overlap of their predictive features. A few clusters were detected only in the prepandemic dataset (e.g., General Mental Health, Seeking Advice) and a few were only detected in the midpandemic dataset (e.g., Self Harm, Entertainment). Two "Resources" clusters were detected in the prepandemic dataset, and three were detected in the midpandemic dataset. The characteristic features were not meaningfully distinct for any among these "Resources" clusters and these clusters partitioned an island of posts visible in UMAP space, so all "Resources" clusters detected in a given dataset were collapsed into a single cluster.     Columns are marked with stars for subreddits on which posts from the given cluster were significantly enriched. See Figure 3B for the full set of cluster enrichments on the analysed subreddits.

Topic Modeling with Latent Dirichlet Allocation (LDA)
We analyzed whether the subreddits that most increased in their Health Anxiety topic correlated with the ones that most increased in negative semantic change as measured by the trend analysis, but this was not significant (ρ = -0.046, P = 0.819).   . This highlights that the distributions of these particular topics within a single subreddit largely did not change between pre and midpandemic timeframes, except for an increase in the topics "Health Anxiety" and "Life" and a decrease in the "Alcohol/Addiction" topic.
Figure S11: Prepandemic LDA model over non-mental health subreddits. Distribution of prepandemic LDA topics for posts in non-mental health subreddits prepandemic (left) and midpandemic (right). As with the mental health subreddits, distributions of these particular topics within a single subreddit largely did not change between pre