This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Mental health problems have become increasingly prevalent in the past decade. With the advance of Web 2.0 technologies, social media present a novel platform for Web users to form online health groups. Members of online health groups discuss health-related issues and mutually help one another by anonymously revealing their mental conditions, sharing personal experiences, exchanging health information, and providing suggestions and support. The conversations in online health groups contain valuable information to facilitate the understanding of their mutual help behaviors and their mental health problems.
We aimed to characterize the conversations in a major online health group for major depressive disorder (MDD) patients in a popular Chinese social media platform. In particular, we intended to explain how Web users discuss depression-related issues from the perspective of the social networks and linguistic patterns revealed by the members’ conversations.
Social network analysis and linguistic analysis were employed to characterize the social structure and linguistic patterns, respectively. Furthermore, we integrated both perspectives to exploit the hidden relations between them.
We found an intensive use of self-focus words and negative affect words. In general, group members used a higher proportion of negative affect words than positive affect words. The social network of the MDD group for depression possessed small-world and scale-free properties, with a much higher reciprocity ratio and clustering coefficient value as compared to the networks of other social media platforms and classic network models. We observed a number of interesting relationships, either strong correlations or convergent trends, between the topological properties and linguistic properties of the MDD group members.
(1) The MDD group members have the characteristics of self-preoccupation and negative thought content, according to Beck’s cognitive theory of depression; (2) the social structure of the MDD group is much stickier than those of other social media groups, indicating the tendency of mutual communications and efficient spread of information in the MDD group; and (3) the linguistic patterns of MDD members are associated with their topological positions in the social network.
Mental health problems, such as anxiety, bipolar disorder, and depression, have become increasingly prevalent in recent years. According to a recent report from the World Health Organization [
During the past decade, social media has played an increasingly important role in the promotion of mental health. It has been widely utilized by people to deal with health-related issues because of its publicity, broad reach, usability, and immediacy [
Many research works have used social media for the detection and monitoring of depression. In Ramirez-Esparza et al [
In addition to linguistic patterns, the topological properties of the social networks formed in social media also play an important role in the understanding of depression-related issues [
In this paper, we investigate online health groups for depression with data from Douban
In this research, we attempt to answer the following questions:
What are the unique language use patterns in the conversations of the MDD group?
What are the unique characteristics of the social networks formed by the conversations in the MDD group?
What are the relations between the language use patterns and the topological properties of members in the MDD group?
The group we studied was the MDD group on Douban [
In Douban interest groups, members use an anonymous user ID to communicate with one another. Once the group is created, any Douban user can start a discussion thread in this group. Each thread contains the title, the user ID of the creator (initiator), the created time, and the content of the message posted by the creator. MDD group members can join the discussion by posting messages in a certain thread. A message contains the user ID of the member, the post time, and the content of the message. A member can specify the message is a reply to the original post of the initiator or to an existing message posted by another member, thus forming a reply-to relationship.
We focused on the textual content of messages and the reply-to relationship between members in the MDD group. The messages conveyed valuable information about the linguistic patterns of the members. The reply-to relationship among the members represented their mutual conversations in the MDD group. We collected the full information of 3700 threads, 40,357 messages, and from 5050 members from the founding date on August 26, 2008 to January 6, 2015.
Distribution of the number of participants in a thread.
To process Chinese content, we employed a popular open-sourced toolkit, jieba, to perform the Chinese segmentation task [
To analyze Chinese text, we used the Chinese version dictionary of Linguistic Inquiry Word Count (SC-LIWC) [
We constructed the social network formed by the conversations (indicated by the reply-to relationship of messages) of the members. A unique node in the network represented a unique user ID. An edge between two nodes represented the existence of a reply-to relationship between the two corresponding nodes. The direction of an edge is from the user who posted replies to the user who posted the original message. There could be multiple edges between two nodes because there could be multiple replies. The procedure of network construction was as follows:
When a Douban user joined the MDD group, either by initiating a new discussion thread or posting a message in an existing discussion thread, a new node representing the new member was generated and a new edge representing his or her behavior was constructed based on the following rules.
When a user
When a user
When a user
We named the constructed network the MDD network, which was a directed network with self-loops and multiple edges.
We adopted a set of well-established network metrics to analyze the MDD network, including degree centrality (including both in-degree and out-degree), average shortest path length, betweenness centrality, and clustering coefficient. For detailed explanations of these metrics, please refer to Newman [
In addition, the Bow-Tie model was used to examine the general structure of the MDD network in more detail. In the Bow-Tie model [
To have an in-depth understanding of the formation of the MDD group, we compared the network properties of the MDD network with other online health groups, social media communities, and the Web. We also employed the classic Erdös-Rényi random network model [
We collected the full information of the MDD group with 2,281,678 words written by 5013 group members in 3565 threads. There were a total of 5050 users in the MDD group, but 37 users deleted their Douban accounts making the content of their messages inaccessible; therefore, we have the information about their social networking behaviors without the actual text content of their posts. So we analyzed the linguistic patterns of 5013 users instead of 5050 users in this section. In all, 74.82% (1,707,151/2,281,678) of the words in our dataset were tagged into one or more categories defined by the SC-LIWC dictionary. Compared to the tag rate of other studies using LIWC/SC-LIWC [
The distributions of the word count and the seven main categories are shown in
Previous psychological studies observed that style words made up approximately 55% of all the words people speak, hear, and read [
According to Beck’s cognitive theory of depression, self-preoccupation and negative thought content are the characteristics of depression [
We first calculated the frequency of individual words and drew the word clouds for better visualization.
The affective processes category has negative and positive subcategories. Negative words can be further divided into anxious, angry, and sad words. In the MDD group, there were 11.87% (595/5013) of users who used negative words but not positive words; 11.03% (553/5013) of users used positive words but not negative words. In all, 58.51% (2933/5013) of users used both positive and negative words in their messages and 18.59% (932/5013) used neither positive nor negative words, as shown in
The pronoun category had five subcategories, in which “I” and “we” were combined into first-person pronouns, “you” belonged to second-person pronouns, and “she,” “he,” and “they” belonged to third-person pronouns. The first-person pronouns-only or second-person pronouns-only members were not differentiated because both values were small (<5%). Similar to the affective processes category, we analyzed the occurrence rates of pronouns words for members who used all three types of pronouns and the results are shown in
To summarize, more intensive use of negative and first-person pronoun words verified the characteristics of self-preoccupation and the negative content focus, which indicated that the users of the MDD group possessed the two characteristics of depression depicted by Beck’s cognitive theory. In addition, the MDD group members also revealed additional linguistic signals of depression (eg, the occurrence rates of the seven main categories), which could be further used in the surveillance and detection of depression in public health. These findings were also verified by aggregating the messages on individual threads (refer to
Distributions of word count and the seven main categories of words.
Pie charts of (a) the mean occurrence rates of the seven main word categories and (b) the proportions of members (N=5013) using negative and positive words.
Word clouds of the content of the conversations in the MDD group in (a) translated English and (b) Chinese.
Box plot of the occurrence rate of two categorical words in the MDD group in the (a) affective processes and (b) pronoun categories. The horizontal line within each box represents the median. The top and bottom borders of the box are the 75th and 25th percentiles, respectively. The whiskers above and below the box mark the 90th and 10th percentiles. The points beyond the whiskers are outliers beyond the 90th percentile.
In the MDD network, there were 5050 nodes (representing individual user IDs) and 36,657 edges (representing the reply-to relations of the messages posted by two corresponding user IDs). It is worth noting that although 37 members deleted their account, we still had access to the information on their conversations with others except for the content of their messages. We excluded self-loops; the network is visualized in
A node has both in-degree and out-degree. The distributions of in-degree and out-degree are shown in
Comparison of the MDD network, Myspace (in [
Network metric | MDD | Myspace | Erdös-Rényi model | Barabási-Albert model | MDD-friend | MedHelp | Erdös-Rényi -friend model | Watts-Strogatz model |
Type | Directed | Directed | Directed | Directed | Undirected | Undirected | Undirected | Undirected |
Nodes | 5050 | 36,459 | 5050 | 5050 | 5050 | 30,915 | 5050 | 5050 |
Edges | 36,657 | 80,675 | 36,657 | 36,657 | 17,401 | 113,273 | 17,401 | 17,401 |
Connected components | 162 | 1 | 1 | 1 | 162 | 2 | 5 | 2 |
Nodes in the largest weakly connected component | 4881 | 36,459 | 5050 | 5050 | 4881 | 30,870 | 5046 | 5049 |
Network density (10–5 scale) | 143.8 | 6.07 | 143.8 | 138.6 | 136.5 | 23.7 | 136.5 | 136.5 |
Network diameter | 10 | 11 | 8 | 11 | 11 | — | 8 | 8 |
Network reciprocity (10–2 scale) | 34.0 | 1.45 | 0.08 | 0.03 | — | — | — | — |
Clustering coefficient (10–2 scale) | 4.47 | 0.031 | 0.29 | 0.64 | 4.47 | 3.1 | 0.13 | 1.19 |
Average shortest path length | 4.11 | 5.14 | 4.53 | 2.83 | 3.80 | 3.81 | 4.64 | 4.69 |
Power law exponent (in/out) | 2.13/2.20 | 2.65/1.99 | — | 2.15/3.01 | 2.29 | 2.12 | — | — |
Min degree (in/out) | 0/0 | 1/0 | 5/5 | 0/4 | 0 | — | 0 | 3 |
Max degree (in/out) | 451/942 | 558/6077 | 19/20 | 6053/7 | 464 | — | 19 | 27 |
In the MDD network, 36.03% (1820/5050) of nodes had zero in-degree, indicating that there was no other member who replied to their messages in the MDD group. On the other hand, the in-degree could go as large as 541. For the out-degree, 50% of members had a value of either 1 or 2, whereas the largest was 991. We examined the node with the largest in-degree and the node with the largest out-degree. It turned out that it was the same node that had both the largest in-degree and the largest out-degree, indicating that this user was the most active member in the group from both perspectives. This member was also one of the administrators invited by the group initiator. This finding indicates that additional rewarding and ranking mechanisms (eg, coadministrators or facilitators) would be useful to improve the communications in the design of online health groups. In addition, the users with high in-degrees also had high out-degrees, indicating that the users had mixed behaviors. This will be further examined in the next section.
To better understand the unique features of the conversations in the MDD group, we compared the topological properties of the MDD network with the conversation-based network of Myspace [
In addition, the reciprocity of the MDD network was approximately 30 times higher than that in Myspace. That means if a member
Comparison of Bow-Tie model analysis between the MDD network and other networks.
Networks | MDD | Myspace | Java forum | Web | Erdös-Rényi model | Barabási-Albert model |
SCC | 54.53% | 1.17% | 12.30% | 27.70% | 99.91% | 0.10% |
IN | 29.27% | 0% | 54.90% | 21.20% | 0.04% | 98.10% |
OUT | 7.80% | 81.50% | 13.00% | 21.20% | 0.04% | 0% |
TENDRILS | 4.22% | 0.027% | 17.50% | 21.50% | 0% | 0.61% |
TUBES | 0.04% | 0% | 0.40% | 0.40% | 0% | 0% |
DISC | 4.22% | 17.30% | 1.90% | 8.00% | 0.01% | 1.21% |
The Bow-Tie model was used to examine the general structure of the network and its reciprocity in more detail. As shown in
Note that the definition of the edge direction in the Java forum was the opposite of ours. Therefore, the IN (OUT) component in the Java forum should be OUT (IN) by our definition. Members belonging to the in component behave like members who only answer questions in the Java forum and these members posted messages to others but never got any response. Both the Java forum and the MDD network had a large in component (29.27%) and the proportion of in component was almost four times larger than the out component (7.80%). This also revealed the imbalanced tendency that members in the MDD group would have liked to express their points without getting any responses. To alleviate the heterogeneous behaviors of in and out, some special encouraging functions could be prompted to the users with imbalanced communications when designing/organizing health forums to improve online communications.
Visualization of the MDD network. The size of the node is proportional to the in-degree of the node and the darkness of the edge represents the edge betweenness centrality value.
In-degree and out-degree distributions of the MDD network.
The position of a member in a social network represented his or her role in this social group. The roles of members were potentially associated with the language use of the members. A key question that remained unanswered was “What is the relationship between the language use and the topological positions in the network?”
To answer this question, we integrated the linguistic properties and the topological properties observed from the previous two sections. We chose two sets of representative linguistic and topological properties: (1) the word count, the seven categories, and positive, negative, and pronoun words (first-, second-, and third-person); and (2) degree, including in-degree and out-degree, the number of self-loops representing the initiated threads, average shortest path length, betweenness centrality, and clustering coefficient. Spearman correlations were run to assess the relationship between the two sets of properties using the 5013 members in the MDD group; the results are shown in
In terms of the relations between topological and linguistic properties of the same sample size, we found that the correlations were moderately strong (coefficient >.3 or <-.3 [
There was a strong correlation (ρ=.70) between the in-degree and the out-degree of a member. This indicated that if a member posted more messages to other members, this user had a higher chance of receiving more replies from others and vice versa. In addition, there was also a strong correlation between the number of threads a user created and the in-degree and out-degree of the member (ρ=.74 and ρ=.40, respectively), which was expected from the definition of the MDD network.
We observed monotonic relations between a set of linguistic properties (word count, second-person pronouns, third-person pronouns) and the topological properties, and also between the topological properties (in-degree, out-degree, the number of threads a user created, average shortest path length, betweenness centrality, and clustering coefficient). Refer to the upper-left parts in
Conversely, we did not find clear monotonic relations between other pairs of linguistic (including the seven main categories
Spearman correlations of the topological (1-6) and linguistic (7-19) properties.
Scatterplot of the relationship between topological (in-degree, out-degree, the number of threads [NumOfThreads]) and linguistic properties (word count [WC] and 7 main categories) (part 1).
Scatterplot of the relationship between topological (average shortest path length [AsP], betweenness centrality [BwC], clustering coefficient [CuC]) and linguistic properties (word count [WC] and 7 main categories) (part 2).
Scatterplot of the relationship between topological (in-degree, out-degree, average shortest path length [AsP], betweenness centrality [BwC]) and linguistic properties (word count [WC], detailed positive and negative affective and pronoun words) (part 3).
In this paper, we characterize both the language use and the network properties of a popular online health group for MDD in China. For language use, we aggregate messages on members and verify the characteristics of self-preoccupation and negative focus of depressed individuals revealed in previous psychological studies and in other social media platforms. For network properties, the MDD network differentiates from other social networks with a highly sticky structure, imbalanced in-degree and out-degree, and a high reciprocity. By integrating these two types of properties, we find a set of interesting correlations and interesting convergent relations between the linguistic and the topological properties.
This work sheds light on the in-depth understanding of how Web users communicate with one another in MDD online health groups. The analysis of language use helps understand the expression of depression on a large scale. The results provide important insights for depression surveillance in public health. Our findings help explain the dissemination of depression-related information in a highly mutually connected community devoted to depression (the MDD group). The social network analysis presents novel and efficient information spread patterns of the MDD group that can be further adapted by health care providers to develop better and effective functions to facilitate online communications in the design of Health 2.0 applications.
There are also a number of limitations and questions that need further investigation:
How to identify the topics of the discussions in MDD group and other online health communities? We plan to adapt state-of-the-art text-mining methods into the linguistic analysis with LIWC in our future work to address this issue.
How to propose new network models to describe and replicate the unique topological properties of the MDD group? We are now developing a new generative network model based on the basic Barabási-Albert model with a focus on being able to control the value of the reciprocity of the network to replicate the higher intendancy of mutual communications in the MDD group. We will also develop models of multiple (interdependent) networks [
Do these unique topological properties only exist in the MDD group or in other online health groups as well? We are collecting data from online health groups for different mental health problems and other types of diseases on different social media platforms. Empirical studies with the new data will be our future work.
This supplementary file contains content, figures, and tables that support the conclusion of the paper, but are too redundant to be included into the main manuscript. There are two figures in this document (in the text). The high resolution versions of the supplementary figures were uploaded to the system as well.
Linguistic Inquiry Word Count
major depressive disorder
Simplified Chinese LIWC
This research was supported by The National Natural Science Foundation of China (NSFC) Grant No. 71402157, CityU Start-up Grant No 7200399, the Natural Science Foundation of Guangdong Province, China (2014A030313753), and The Theme-Based Research Scheme of the Research Grants Council of Hong Kong Grant No. T32-102/14N.
None declared.