Understanding Weight Loss via Online Discussions: Content Analysis of Reddit Posts Using Topic Modeling and Word Clustering Techniques

Background: Maintaining a healthy weight can reduce the risk of developing many diseases, including type 2 diabetes, hypertension, and certain types of cancers. Online social media platforms are popular among people seeking social support regarding weight loss and sharing their weight loss experiences, which provides opportunities for learning about weight loss behaviors. Objective: This study aimed to investigate the extent to which the content posted by users in the r/loseit subreddit, an online community for discussing weight loss, and online interactions were associated with their weight loss in terms of the number of replies and votes that these users received. Methods: All posts that were published before January 2018 in r/loseit were collected. We focused on users who revealed their start weight, current weight, and goal weight and were active in this online community for at least 30 days. A topic modeling technique and a hierarchical clustering algorithm were used to obtain both global topics and local word semantic clusters. Finally, we used a regression model to learn the association between weight loss and topics, word semantic clusters, and online interactions. Results: Our data comprised 477,904 posts that were published by 7660 users within a span of 7 years. We identified 25 topics, including food and drinks, calories, exercises, family members and friends, and communication. Our results showed that the start weight ( β =.823; P <.001), active days ( β =.017; P =.009), and median number of votes ( β =.263; P =.02), mentions of exercises ( β =.145; P <.001), and nutrition ( β =.120; P <.001) were associated with higher weight loss. Users who lost more weight might be motivated by the negative emotions ( β =−.098; P <.001) that they experienced before starting the journey of weight loss. In contrast, users who mentioned vacations ( β =−.108; P =.005) and payments ( β =−.112; P =.001) tended to experience relatively less weight loss. Mentions of family members ( β =−.031; P =.03) and employment status ( β =−.041; P =.03) were associated with less weight loss as well. Conclusions: Our study showed that both online interactions and offline activities were associated with weight loss, suggesting that future interventions based on existing online platforms should focus on both aspects. Our findings suggest that online personal health data can be used to learn about health-related behaviors effectively. (J


Introduction
Background Maintaining a healthy weight can reduce the risk of developing many diseases, including type 2 diabetes, hypertension, heart disease and strokes, kidney diseases, and certain types of cancers [1][2][3][4][5][6]. Unfortunately, overweight or obesity has nowadays become a public health crisis that impacts many Americans. For example, it was reported that nearly 94 million US adults were affected by obesity in 2015 and the annual medical cost was approximately US $150 billion [7]. To promote public health and help control overweight and obesity, it is critical to understand what factors are associated with weight loss and design effective weight loss interventions.
Over the past decade, people have been increasingly leveraging online social media platforms to share personal experiences and seek social support regarding weight loss, understand the impact of obesity, and learn their interpretation as contributing factors to a healthy life [8]. This huge amount of online information enables health care providers and researchers to gain insights into both public and personal health. For example, studies showed that what people shared on Instagram and Twitter could be used to effectively assess obesity prevalence in the United States [9,10]. While a content analysis showed that Twitter users were more likely to discuss weight loss during and after holidays [11], a survey-based study suggested that the interactions on Twitter were too brief and shallow [12], which might constrain its users to gain deep social support. This is, however, a major motivation for people to engage in other online weight loss forums or communities [13].
As such, recent studies, which aimed to prevent obesity and promote healthy weight loss, tended to incorporate online social media platforms into their intervention design, but the effects were found to be mixed. While some investigations showed that these online platforms had the potential for an innovative weight loss intervention [14,15], others found their effects were limited because of a low retention and engagement rate [16,17]. Moreover, a meta-analysis of over 2000 studies concluded that the effects of the interventions incorporating online social networks were very modest in improving health-promoting behaviors [18]. This suggested that before designing interventions based on qualitative evidence that online support and engagement are helpful [19], a quantitative analysis is necessary in determining how online interactions with other users (as a potential external influential factor), and a user's offline activities recorded in online discussions (as a potential internal driving factor), are associated with their weight loss.

Current Research and Its Limitation
It should be noted that there were studies showing that consistent online activities (eg, updating progress in weight loss and interacting with others) in online weight loss programs or training were associated with higher weight loss [20][21][22]. Particularly, a study using data from r/loseit showed that higher BMI levels and higher online activities were associated with more weight loss [23]. Similarly, another study based on causal inference found that, on average, users who received comments in r/loseit lost 9 lb more than users who did not receive any comments [24]. Although online interactions were shown to have a significant impact on weight loss [25], existing studies focused less on the content of posts. While both the aforementioned studies used topic modeling to extract topics, they merely focused on the most popular ones. This method, however, was too general to identify detailed offline activities that were disclosed in such a large number of posts. Moreover, these studies used the data that were generated during 2010 and 2014, which, as we showed later, consisted of only a small fraction of the posts published in r/loseit.
In fact, highlighting both online interactions and personal offline activities aligns with social cognitive theory (SCT) [26]. The theory emphasizes that external and internal social reinforcement together lead to behavior change in a dynamic fashion and is often applied to guild the design of effective intervention strategies [27]. This suggests that focusing on either online interactions or personal offline activities but not both might lead to an incomplete view of the roles of online communities in the process of weight loss, but this needs to be examined and confirmed with evidence.

Objective
Therefore, considering the limitation of previous studies and inspired by SCT, we aimed to investigate the extent to which the offline activities communicated by users in the r/loseit subreddit, an online community for discussing weight loss, and online interactions were associated with their weight loss. Specially, we focused on a data set consisting of 477,904 posts that were published by 7660 users before January 2018 in r/loseit. We used the self-reported weight change to measure weight loss and the average number of comments and votes that they received from other users to characterize online interactions. We applied both topic modeling and word clustering to obtain detailed and interpretable contributing factors from online posts. Finally, we used a linear regression model to quantitively examine the association between online interactions, factors described in online discussions, and the amount of weight loss.
Our work provided evidence that an online social media platform can serve as an effective data source to understand weight loss, and our findings implied that in future weight loss analyses or interventions, online interactions should be considered as a factor that influences long-term self-efficacy.

Data
Our data were collected from r/loseit, a subreddit focusing on weight loss in Reddit, an online discussion platform. Within the subreddit, users can either publish a submission to start a new discussion thread or make comments on either a submission or another comment to an existing discussion thread. For simplicity, we used the word post to denote either a submission or a comment when we did not differentiate them. In addition, Reddit users can upvote or downvote a comment but can only upvote a submission. Furthermore, users in many subreddits are allowed to enter text or symbols into a flare, which appears next to their usernames in a post, to show some basic information of themselves. For example, in r/loseit, users can show their start weight, current weight, and goal weight and even their gender, age, and height information in flairs. However, as creating a flare is not required, users can ignore it when publishing a post.
In this study, we used the Python Reddit API Wrapper python package (version 5.3.0) to extract data from the Reddit application programming interface. Specifically, we collected all the posts in r/loseit that were published before January 13, 2018. We used the flares tagged with usernames to confine our study cohorts to those users who disclosed their start, current, and goal weights and were active for at least 30 days in this subreddit [23]. It should be noted that we did not ask for permission to use the data from the Reddit community because the data are publicly accessible. However, we never tried to identify any Reddit user by linking their Reddit data with additional data sets. All the results, and post samples presented in this paper, were carefully examined and revised such that no personally identifiable information was disclosed.

Topic Modeling and Word Semantic Clustering
Owing to high dimensionality, noise, and ambiguity of natural language text, processing and analyzing raw post content are often challenging, and the analyzed results are difficult to interpret. In this study, we used 2 types of methods to mitigate this problem: topic modeling and word semantic clustering based on low dimensional representation (eg, word2vec) [28]. While topic modeling can help identify themes in a global context, word semantic clusters can provide more detailed concepts in a local context [29].
Specifically, we used the implementation of latent Dirichlet allocation (LDA) in Mallet (version 2.0.8) to identify the main topics of online discussions in r/loseit [30]. Since LDA is an unsupervised algorithm, we relied on a coherence score to determine the optimal number of topics [31]. In this study, we trained LDA models for 5 to 75 topics (with a step size of 5) on all of the posts and chose the number of topics that was corresponding to the highest coherence score. To mitigate word sparsity, we only kept nouns, verbs, adjectives, and adverbs.
To obtain word semantic clusters, we relied on the Google pretrained word2vec model because our data set was not large enough to train an accurate word2vec model. We relied on the standard deviation of cluster size to determine the optimal number of clusters [29]. Specifically, we used a hierarchical clustering algorithm with 25 to 1000 clusters (with a step size of 25) and used the elbow rule to the standard deviation of the number of words in clusters. Intuitively, a large word cluster is more likely to contain multiple concepts, while a small word cluster is more likely to have little contribution to reducing hundreds of thousands of word dimensions [29].

Regression Analysis
In this study, we investigated the association between weight loss and online discussions by using a linear regression model. Specifically, we characterized a user's online discussion by using the following predictors: • The days that the user was active in the subreddit.

•
The number of posts that the user published.

•
The topics conveyed in the posts, measured by topic distribution.

•
The word semantic clusters, measured by term frequency-inverse document frequency values.
• The median karma score or votes that the user received for each post, measured by subtracting the number of upvotes from the number of downvotes [32].
• The median number of comments for each post that the user published.
We used weight loss, measured by subtracting the start weight from the current weight, as the outcome variable of the regression model. As the distribution of the weight loss variable is right-skewed, we log-transformed it before feeding it into the regression model. All the predictors were normalized and scaled into a range of (0, 1). It should be noted that the active days and the number of comments were also log-transformed because of their right-skewed distributions.
Considering that a person who had higher weight at the beginning is more likely to lose more weight, we also introduced start weight as a control variable in the model. Before applying the regression analysis, we used the findCorrelation function, as implemented in the caret R package (version 6.0-81), with a cutoff of 0.3 to remove correlated predictors. We reported predictors with a statistical significance level of .05.

Data Statistics
We collected 2,526,277 posts published by 205,722 users during the period between July 30, 2010, and January 13, 2018. Focusing on the users who disclosed their start, current, and goal weights in flairs, we finally obtained 7660 users with a total of 477,904 posts. These posts included 16,332 submissions and 461,572 comments. Table 1 summarizes the basic statistics of key factors regarding this study cohort. From the table, we observed that most posts received a small number of comments and karma scores.

Topics Discovered in Online Discussions
We identified 25 topics that were corresponding to the highest coherence score (Multimedia Appendix 1). Table 2 shows the inferred topics, their marginal distribution, and the most relevant terms. The marginal distribution of a topic was measured by the probability that the topic was sampled from online discussions, while the relevance of a term was measured by the probability that it was sampled from a topic. The identifier of each topic was named based on the descending order of their topic distribution. For example, topic T1, talking about drinks, had the highest distribution, while topic T25, one of the weight change-related topics, had the lowest topic distribution.
We also manually summarized the 25 topics into 11 categories and provided the associated labels in Table 2. The table shows that people in this subreddit often talked about food and drinks, exercise, calorie, clothes, time, health issues, weight change, feelings, plans, and communication.

Regression Analysis
We chose 425 as the optimal number of word clusters based on the elbow rule (Multimedia Appendix 1). The word semantic clusters, together with other proposed predictors, were applied to fitting a linear regression model. After examining the feature correlation, we included 6 topics and 402 word semantic clusters into the regression model. Figure 2 shows the distribution of the log-transformed weight loss, which matches the loose constraint of applying a linear regression model.

Goodness of Fit and Non-content Predictors
The fitted linear regression model had an adjusted R 2 =0.315, F 7,191 =9.553, and P<.001, suggesting that the model with proposed predictors predicted weight loss better than the basic intercept-only model. Among the non-content-related predictors, the start weight (β=.823; P<.001), the active days in the subreddit (β=.02; P=.009), and the median karma score (β=.263; P=.02) were associated with higher weight loss. However, the median number of comments was not significantly associated with weight loss in our analysis (β=.001; P=.95).

Topic-Related Predictors
There were 3 topics that were found to be significantly associated with weight loss. Topics T16 (exercise in the gym) and T19 (purchase of clothes) were associated with higher weight loss (β=.072; P=.007 and β=.080; P<.001, respectively), while topic T7 (counting calorie) was associated with lower weight loss (β=−.074; P=.007). Table 3 shows word semantic clusters that were significantly associated with higher weight loss, which are summarized below.

Diet-or Dining-Related Clusters
Diet-or dining-related clusters included C301 (nutrition), C143 (minerals), C291 (restaurant), C293 (leftover), C108 (hosting), C66 (cookout), C25 (evil gluttony), and C285 (city-related food, exercise). The following are some examples that were communicated in related posts: today I eat a pancake with plain greek yoghurt as a pre-workout in-work snack, after gym I will inhale two greek yoghurts with two spoons of protein powder, which catapults me to 170g of protein today.

I think I've spent more time replacing bad items (like I have asparagus and Brussel sprouts instead of freezer fries now, for example) with better options than I have really giving things up.
I learned the same thing with pizza. I love pizza, but nowadays I would much rather enjoy a slice of pizza from a good local restaurant, than an entire pizza from Domino's or Papa John's.

Other Clusters
Other clusters included C304 (performances), C334 (negative emotions), C24 (simple, straightforward, economic), C193 (flowers), C110 (beginner), and C383 (replicate). The following are some examples of these clusters: I ended up just drinking water at the theater because everything was either full of calories or full of sugar. ended up having some leftover chicken and a banana when I got home.
About the anxiety, yea same boat. I struggle with social anxiety since I'm 13 years old but it did get a lot better over the last few months.

Look into a Hot Pot-they are relatively inexpensive (< $15) and at least allow you to do some basics like cook rice/pasta/soup/sauces.
Finished the 13-week beginner program on DDP Yoga. Hit every workout on the schedule, didn't lag behind the schedule by even a single day once...Feels great to have stuck with it so well.
I got flowers at work! And a sweet little card that said "I'm proud of you" from my mom because of how much effort and progress I've made in the last two months. Table 4 shows word semantic clusters that were significantly associated with lower weight loss, which are summarized below.
I have three diagnosed illnesses and have been hospitalized twice. Users who exhibit suicidal behavior should be pointed to suicide prevention hotlines... I've backtracked by a couple weeks, which is partially water weight, and partially actual weight gain. It sucks.
However, I'm over my 1200 for the day, not by much but I made a silly calorie budgeting decision earlier in the day. I readjusted dinner to try to make up for it, but I was too far in the hole.

Principal Findings
We used topic modeling to identify 25 general topics from the r/loseit subreddit. These topics covered a broad range of weight loss-related themes, including food and drinks, exercises, calorie, health issues, family members and friends, feelings, and communication. Among these topics, topics regarding food and drinks, health issues, family members and friends, calorie, and exercise were most discussed. These topics were aligned with the findings from another study [23].
Our regression analysis showed that the start weight and active days were associated with higher weight loss, which was aligned with our common sense. Furthermore, our results showed that receiving a higher karma score was associated with higher weight loss, but the median number of comments received was not significantly associated with weight loss. Our findings were a little different from the two studies, where both the karma score and comments were associated with higher weight loss [24,25]. We suspected that this might be because (1) we included far more users in our study and most of the posts received a very limited number of comments (Table 1) and (2) the previous studies did not control the model with more detailed content.
After adjusting for active days, start weight, karma score, and the median number of comments received, our analysis suggested that exercises, including coaching and nutrition, were the most effective content factors that were associated with higher weight loss, which were confirmed by previous investigations [33,34]. In addition, users with higher weight loss mentioned negative emotions that they experienced before they started to make efforts for weight loss. Our findings also suggested that mentioning food-related topics (eg, not eating too much, eating healthy food) were associated with higher weight loss, which was also found in a previous study [35]. Interestingly, we found that the mention of Xbox games was associated with higher weight loss as well. Evidence suggested that incorporating active video games had a positive effect on increasing physical activity and promoting healthy weight for both overweight adults and children [36,37].
In addition, we found that many content factors were associated with lower weight loss. For example, we found that people who mentioned vacations and clubs were more likely to have lower weight loss [38]. Furthermore, users in this subreddit mentioned that they gained weight after college graduation. Those users who had lower weight loss often mentioned supermarkets, payment or refund to exercise programs, and employment [39,40]. We also found that users who experienced health issues related to otolaryngology tended to have less weight loss. This might be due to the fact that the related treatment disturbed the weight loss plan. However, a study found that otorhinolaryngologic diseases themselves were associated with patients with obesity [41].
Another interesting finding was that users who used skyrocket to describe their weight loss experience (eg, feeling of eating) were less likely to have significant weight loss. This suggested that controlling the diet extensively during this process might not be an effective, healthy strategy [42]. In addition, users who often mentioned maintenance (eg, maintaining the intake of calories) were less likely to lose weight as expected. After a close examination of the related posts, we found that some of these users struggled with weight loss activities. We also found that expense-related content was associated with lower weight loss. This could be explained by a recent finding that low socioeconomic status was associated with lower weight loss outcomes [43]. Finally, users who mentioned family members were found less likely to lose more weight, suggesting that family members may not always have a positive impact on weight loss as found by other studies [44].
It should be noted that after examining feature correlation, only 6 topic predictors were included in the regression model, suggesting that word semantic clusters can capture more detailed offline activities. It was interesting that we found a calorie-related topic (T7) associated with lower weight loss, which could be partially supported by a previous finding that reducing calorie intake alone may not help in weight loss [45].

Implications
In this study, we acknowledged that while some associations were statistically significant, the value of the coefficients was very low, indicating a weak correlation between predictors and the dependent variable. However, we did not directly interpret the predictor importance from the values of their coefficients. This is because it is practically meaningless to say that more weight can be reduced by increasing the distribution of a certain topic discussed in an online community [46]. Rather, we believe that it is the actual offline activities described in online discussions (or self-efficacy) that actually matter in weight loss. By using word clusters, we obtained more detailed, concrete offline activities that were often ignored by other social media-based studies but were significantly associated with the amount of weight loss. While karma scores (votes) from other users were associated with higher weight loss, considering the right-skewed distribution of karma scores (Table 1 and Multimedia Appendix 1), a majority of posts in this online community received very small karma scores.
These were somewhat aligned with the findings in an offline, SCT-based weight loss intervention program [47], where self-efficacy and intention, instead of online interaction, were found to be significant factors leading to weight loss. While self-efficacy performed well in weight loss interventions [48][49][50], there was evidence that self-efficacy may face the challenge of decreasing over time [51]. This is very interesting because it implied that interactions, either in online or offline environments, may serve as an indirect factor that affects weight loss through maintaining a participant's long-term self-efficacy.
From this perspective, future weight loss analyses or interventions should consider online interaction as a key factor to improve self-efficacy, instead of directly being linked to weight loss. Our study also implied that an aggressive weight loss plan may not work in the long run.

Limitations and Future Work
There are several limitations that we want to highlight here. First, our findings were based on merely the r/loseit subreddit, which constrained the generality of findings. Future work may consider extending the research to other online platforms. Second, we did not incorporate gender into the analysis. It might be possible to first infer such information from online discussions [52] and then investigate how the association between posting content and weight loss changes after controlling for this information. Third, we relied on self-reported weight loss in this study, which limited our investigation, and findings were applicable to only a small fraction of Reddit users who disclosed weight change. It will be interesting to investigate the specific characteristics that are related to the majority of Reddit users who did not report such information. Furthermore, self-reported weight changes might not be accurate because Reddit users might not update their weights in a timely manner. Our study only investigated what was presented in the online discussions, instead of examining the real-world events. Finally, it would be interesting to investigate the extent to which online interaction, not merely responses and votes but the detailed categories, and offline activities that were recorded in online discussions can lead to weight loss change in a dynamic setting.

Conclusions
In this study, we analyzed online discussions regarding weight loss in r/loseit. We used topic modeling and the hierarchical clustering algorithm to extract topics and word clusters that were discussed in this subreddit. We used a regression analysis to determine the association between weight loss change and the factors that were conveyed in these online discussions. We found that the start weight and median karma scores were associated with higher weight loss. Users who had higher weight loss might be motivated by negative emotions experienced before starting weight loss. By contrast, users who mentioned vacations and payments were less likely to lose more weight. Furthermore, mentions of family members and employment were also found to be associated with lower weight loss. Our findings suggest that future interventions based on online social media platforms should focus on both online interaction and offline activities and that online personal health data can be effectively used to learn about users' health-related behaviors.