This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods.
This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. We tested on datasets that contain posts labeled for perceived suicide risk or moderator attention in the context of self-harm. Specifically, we assessed the ability of the methods to prioritize posts that a moderator would identify for immediate response.
We used 1588 labeled posts from the Computational Linguistics and Clinical Psychology (CLPsych) 2017 shared task collected from the Reachout.com forum. Posts were represented using lexicon-based tools, including Valence Aware Dictionary and sEntiment Reasoner, Empath, and Linguistic Inquiry and Word Count, and also using pretrained artificial neural network models, including DeepMoji, Universal Sentence Encoder, and Generative Pretrained Transformer-1 (GPT-1). We used Tree-based Optimization Tool and Auto-Sklearn as AutoML tools to generate classifiers to triage the posts.
The top-performing system used features derived from the GPT-1 model, which was fine-tuned on over 150,000 unlabeled posts from Reachout.com. Our top system had a macroaveraged F1 score of 0.572, providing a new state-of-the-art result on the CLPsych 2017 task. This was achieved without additional information from metadata or preceding posts. Error analyses revealed that this top system often misses expressions of hopelessness. In addition, we have presented visualizations that aid in the understanding of the learned classifiers.
In this study, we found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.
Mental health disorders are highly prevalent, with epidemiological studies reporting roughly half the population in the United States meeting the criteria for one or more mental disorders in their lifetime and roughly a quarter meeting the criteria in a given year [
Mental disorders are among the strongest predictors for nonsuicidal self-injury and suicidal behaviors; however, little is known about how people transition from suicidal thoughts to attempts [
Peer support forums can be a useful and scalable approach to social therapy for mental health issues [
Previous research suggests that the language of individuals with mental health conditions is characterized by distinct features [
Such automated systems typically start with a feature extraction step that converts the variable length input text into fixed-length numeric vectors (features). This step is required to apply machine learning classifiers that operate on such vectors. An example is the bag-of-words representation, where each numeric feature represents the count of a specific word that is selected based on frequency or from a lexicon. With such a representation, a classifier may learn that mentions of
Recently, word embeddings have been shown to provide rich representations where words from the same context of a corpus tend to occupy a similar feature space [
Reachout.com is an Australian youth-based mental health peer support forum. It is targeted for those aged 14 to 25 years, and the community is maintained by staff and trained volunteer moderators. Staff and moderators monitor the forums, and they respond, as required, with empathy, support, and referrals to relevant information and available services. The 2017 Computational Linguistics and Clinical Psychology (CLPsych)–shared task organizers provided a corpus of posts from Reachout.com to assess the ability of automated methods to triage forum posts based on the urgency of moderator response [
In this paper, we benchmarked multiple feature extraction methods on forum posts from Reachout.com by evaluating their ability to predict the urgency of moderator response. Furthermore, we explored the interpretability through emoji representations and by visualizing word importance in text that mimics themes from suicide notes. We have shown that modern transfer learning approaches that take advantage of large corpora of unlabeled text, in combination with automated machine learning (AutoML) tools, improve performance.
Our primary data source was made available for the 2017 CLPsych-shared task and was collected from the Australian mental health peer support forum, Reachout.com [
To test the generalizability of the system developed on the Reachout.com data, we used a subset of the data made available from the University of Maryland (UMD) Reddit Suicidality Dataset [
Of the subset with labels by expert annotators, we then selected only data from users who had posted once in /r/SuicideWatch to minimize ambiguity in understanding which of their posts was the cause of the associated label. Predictions were made only on posts from /r/SuicideWatch. In total, there were 179 user posts across the categories (
To better gauge our performance on the UMD Reddit Suicidality Dataset posts, we calculated an empirical distribution of random baselines for the macro-F1 metric. This baseline distribution quantifies the performance of random shuffles of the true labels (including the class
We used 10 composite quotes to share example predictions of our system on text that could be predictive/indicative of self-harming and/or suicidality. These composite quotes were created by Furqan et al [
Features were extracted from only the text body of the posts. For all posts, any quotes from previous posts or links to images were removed.
We extracted features using lexicon-based tools such as Valence Aware Dictionary and sEntiment Reasoner (VADER; 4 features) [
With Empath and LIWC, sentence splitting was not performed. With the remaining feature encoding (VADER, DeepMoji, Universal Sentence Encoder, and both GPT models) methods, we first preprocessed the text body of each post into sentences using the sentence boundary detection from spaCy version 2.1. Sentence feature vectors were aggregated to the post level by taking their mean, maximum, and minimum for each extracted feature.
To train classifiers on the various feature sets, we used 2 AutoML methods that are built upon scikit-learn [
We used the Tree-based Optimization Tool (TPOT) [
To compute the matrix of pairwise Euclidean distances between posts for each set of features, we used SciPy’s distance matrix function [
To better understand the distribution of the 64 emoji features represented across the labeled posts, we aggregated the mean of an emoji feature across sentences in a post. Each of these aggregate features was then normalized to be between 0 and 1 to better compare features against each other. To obtain a measure of feature importance, we permuted each feature column and assessed the decrease in classification performance on the macro-F1 metric while using the best-performing pipeline derived from TPOT. For each emoji feature, we performed this procedure 10,000 times. Images of the emojis were obtained from EmojiOne (currently JoyPixels Inc) and converted to grayscale.
The CLPsych 2017 and UMD Reddit Suicidality datasets are available upon request from the original sources [
To benchmark the performance of various text derived features for the automated classification of online forum posts, we ran both TPOT and Auto-Sklearn on the features generated from the post bodies. In
Benchmarking by features, automated machine learning methods, and datasets with the macro-F1 metric.
Feature set | Feature count | Tree-based Optimization Tool | Auto-Sklearn | ||||
|
|
Train 10-fold, 5 times | Test | Reddit validation | Train 10-fold, 5 times | Test | Reddit validation |
Empath (post) | 195 | 0.280 | 0.253 | 0.385a | 0.292 | 0.344 | 0.321 |
Linguistic Inquiry and Word Count | 70 | 0.434 | 0.354 | 0.346a | 0.433 | 0.380 | 0.315 |
Valence Aware Dictionary and sEntiment Reasoner (sentence) | 12 | 0.363 | 0.263 | 0.356a | 0.340 | 0.263 | 0.353a |
Emoji 64 | 192 | 0.425 | 0.369 | 0.280 | 0.424 | 0.461 | 0.308 |
DeepMoji | 6912 | 0.442 | 0.452 | 0.345a | 0.391 | 0.437 | 0.351a |
Universal Sentence Encoder | 1536 | 0.457 | 0.446 | 0.300 | 0.484 | 0.479 | 0.236 |
GPTb default | 2304 | 0.373 | 0.334 | 0.344a | 0.396 | 0.383 | 0.402a |
GPT fine-tuned | 2304 | 0.510 | 0.559 | 0.320 | 0.492 | 0.572 | 0.324 |
aReddit validation performance better than chance.
bGPT: Generative Pretrained Transformer.
Confusion matrices for 2 models trained with Auto-Sklearn. Each cell in the matrix provides the counts of posts that were labeled in the corresponding row and column axis that represent the predicted and true labels, respectively. Counts are colored from the highest cell (blue) to the lowest (white). The top-left to bottom-right diagonal cells count correctly predicted posts. Panel A trained with Valence Aware Dictionary and sEntiment Reasoner (VADER) features. Panel B trained with features from a fine-tuned Generative Pretrained Transformer (GPT) language model.
We noted that the average macro-F1 obtained during training was a fairly reliable predictor of the score on the held-out test set. Auto-Sklearn performed better on average than TPOT (mean test macro-F1 of 0.414 versus 0.379, respectively). We also observed the trend that features extracted from pretrained models perform better in general (average Auto-Sklearn test macro-F1 of 0.329 versus 0.466). However, the features extracted from the default GPT model (without any additional fine-tuning) were the worst performing of those obtained from neural models, whereas the GPT model that was fine-tuned on the unlabeled posts performed best across all experiments. The Universal Sentence Encoder and fine-tuned GPT features exceeded the highest macro-F1 score reached in the 2017 CLPsych-shared task when a classifier was learned with Auto-Sklearn (0.467; submission by Xianyi Xia and Dexi Liu). Upon inspection, the Auto-Sklearn–generated classifier for the GPT fine-tuned features was a complex ensemble of pipelines with multiple preprocessing steps and random forest classifiers. The TPOT-generated classification pipeline first selects features using the analysis of variance
To better understand the low Reddit validation scores, we calculated a random baseline. Although it is random, this does use information about the class distributions. We marked Reddit validation performance as better than chance in
To better assess the variability of our best-performing system (Auto-Sklearn trained with features generated from the fine-tuned GPT model), we reran the Auto-Sklearn training and testing process 20 times. For each run, Auto-Sklearn was allotted 24 hours of compute time. Across those 20 systems, the average macro-F1 score on the held-out test set was 0.5293 (SD 0.0348). Of those 20 systems, the best- and worst-performing systems had a final test score of 0.6156 and 0.4594, respectively. Importantly, despite the variability and less compute time, the average macro-F1 score of these classifiers performed better than the scores obtained from different feature sets.
To determine the impact of the amount of data used for fine-tuning the GPT model on its effectiveness for feature extraction in the classification task, we fine-tuned models with increasing amounts of unlabeled posts before extracting post-level features to train a classifier (
A graph of macro-F1 test scores versus the number of posts used for Generative Pretrained Transformer-1 fine-tuning. Auto-Sklearn methods are marked with continuous red (Auto-Sklearn) and dashed blue (Tree-based Optimization Tool, TPOT) lines.
To compare the different representations or embeddings of the post contents, we used the Mantel test (
Mantel correlations between the extracted feature sets.
Feature Set | VADERa | Empath | LIWCb | Universal Sentence | Emoji 64 | DeepMoji | GPTc default | GPT fine-tuned |
VADER | 1.000 | 0.003 | 0.098 | 0.453 | 0.211 | 0.422 | 0.430 | 0.429 |
Empath | 0.003 | 1.000 | 0.009 | 0.006 | −0.005 | −0.008 | 0.004 | 0.001 |
LIWC | 0.098 | 0.009 | 1.000 | 0.148 | 0.403 | 0.507 | 0.267 | 0.253 |
Universal Sentence | 0.453 | 0.006 | 0.148 | 1.000 | 0.193 | 0.509 | 0.823 | 0.823 |
Emoji 64 | 0.211 | −0.005 | 0.403 | 0.193 | 1.000 | 0.523 | 0.302 | 0.335 |
DeepMoji | 0.422 | −0.008 | 0.507 | 0.509 | 0.523 | 1.000 | 0.632 | 0.631 |
GPT default | 0.430 | 0.004 | 0.267 | 0.823 | 0.302 | 0.632 | 1.000 | 0.799 |
GPT fine-tuned | 0.429 | 0.001 | 0.253 | 0.823 | 0.335 | 0.631 | 0.799 | 1.000 |
aVADER: Valence Aware Dictionary and sEntiment Reasoner.
bLIWC: Linguistic Inquiry and Word Count.
cGPT: Generative Pretrained Transformer.
In
To better understand judgments made by our trained classifier, we present predictions in
All words were iteratively masked to indicate their effects on the predicted class (see Methods section). In
For 2 of the quotes predicted as red, no words were highlighted, suggesting that, in these instances, many words were key to the prediction. Overall, the quotes would all be flagged as requiring some level of moderator attention, and for the most part, the nature of words that were important in classifying the severity of quotes made conceptual sense.
Violin plot showing the distributions of the 10 most discriminative emoji features across labeled classes. The classes are according to label with crisis in gray. The y-axis is the predicted scores for each emoji that have been scaled to the 0-1 interval. The emojis across the y-axis are marked with their images and their official Unicode text labels. The emojis are ranked from the most to least important feature (left to right).
Predictions and highlights of suicide-related composite quotes from Furqan and colleagues. Words that changed predictions are color coded. Replacing a yellow or red word with an unknown word shifts the prediction to a less severe class by 1 or 2 levels, respectively, (ie, replacing a yellow word in text that is classified as crisis would change the prediction to red while a red word would change it to amber). In contrast, replacement of green words will result in more severe predictions.
We have shown that there are highly informative signals in the text body alone of posts from the Reachout.com forum. More specifically, we identified a transfer learning approach as particularly useful for extracting features from raw social media text. In combination with the training of classifiers using AutoML methods, we showed that these representations of the post content can improve triage performance without considering the context or metadata of the posts. These methods take advantage of the large amount of unlabeled free text that is often available to diminish the need for labeled examples. We also showed that these methods can generalize to new users on a support forum, for which there would not be preceding posts to provide context on their mental states. By combining the pretrained language models with AutoML, we were able to achieve state-of-the-art macro-F1 on the CLPsych 2017 shared task. Our content-only approach could be complemented by previous work, which used hand-engineered features to account for contextual information, such as a user’s post history or the thread context of posts [
Our current approach follows methods outlined by Radford et al [
We compared the use of AutoML tools, such as Auto-Sklearn and TPOT, to generate classification pipelines with a variety of features extracted from free text. We also identified them as sources of variability in the final scores of our system. When developing our top-performing systems with features extracted from a fine-tuned GPT and using Auto-Sklearn on 20 trials, we obtained macroaverage F1 scores ranging from 0.6156 to 0.4594. In part, this is because of the small size of the dataset and the weighted focus of the macroaverage F1 metric toward the
There are a variety of limitations, depending on the use of the approaches we benchmarked. Further experiments would be needed to determine if Reachout.com moderator responsiveness improves when more accurate classifiers are used. The present system performance cannot be extrapolated too far into the future because of changes in the population of users on the forum, shifting topics discussed or variations in language used. Furthermore, it is important to note that any implemented system would require ongoing performance monitoring.
To further understand how our trained models would perform in a new context, we assessed performance on an independently collected dataset and composite quotes that were derived from suicide notes. All composite quotes were flagged as requiring moderator attention. Our classifiers generalize to some degree on the UMD Reddit Suicidality Dataset, which approximates the task outlined for Reachout.com. We noted that the Reddit user base is not specific to Australia, is not targeted explicitly to youth, and may have substantially different topics of discussion than Reachout.com. This performance is primarily driven by good accuracy on the
We manually reviewed the errors made by the best-performing system (Auto-Sklearn classifier with the GPT fine-tuned features). The most worrisome prediction errors occur when the classifier mistakes a crisis post for one of lesser importance, which could potentially delay a moderator response. When posts were not classified as crisis posts (but should have been), this was often due to vague language referring to self-harm or suicide (eg, “time’s up,” “get something/do it,” “to end it,” “making the pain worse”). Sometimes, forum users deliberately referred to self-harm or suicide with nonstandard variations, such as “SH” or “X” (eg, “attempt X,” “do X”). Future work could be instructive in determining whether these words are associated with higher levels of distress/crisis relative to the words they are meant to replace. Alternatively, custom lexicons might be developed to capture instances of self-harm or suicide represented by vague language or nonstandard variations.
In some failure cases (ie, posts that should be classified as being of higher risk than they were), the classifier did not notice expressions of hopelessness, which may cue the imminence of risk. Other prominent failure cases were instances when the classifier did not notice a poster’s dissatisfaction with mental health services that provide real-time help (eg, suicide call-back services and crisis helplines, etc). According to the labeling scheme, these posts should be classified as red. However, this dissatisfaction was often conveyed in diverse and highly contextualized ways, likely making it difficult for the system to identify. There were also posts that did not indicate imminent risk but described sensitive topics such as feeling lonely or losing a parent. These were often misclassified as green (when they should have been amber), possibly because they also contained positive language, or the sensitivity of the topic was difficult for the system to grasp.
In some of these failure cases, it may have been useful to take into account the previous post; eg, when the post in question is short or vague, the system may classify the level of risk more accurately if the previous post expresses a high level of concern about the poster or tries to convince the poster to seek immediate help.
Neural networks can build complex representations of their input features, and it can be difficult to interpret how these representations are used in the classification process. In a deeper analysis of DeepMoji features, we identified the most important emoji for classification and found that the emotional features follow a linear arrangement of expression at the class level corresponding to label severity. We also used input masking to iteratively highlight the contributions of individual words to the final classification. Such highlighting and pictorial/emoji visualizations could speed moderator review of posts. Ultimately, we believe the further development of methods to improve model interpretability will be essential in facilitating the work of mental health professionals in Web-based contexts.
In conclusion, we showed that transfer learning combined with AutoML provides state-of-the-art performance on the CLPsych 2017 triage task. Specifically, we found that an AutoML classifier trained on features from a fine-tuned GPT language model was the most accurate. We suggest this automated transfer learning approach as the first step to those building natural language processing systems for mental health because of the ease of implementation. Although such systems lack interpretability, we showed that emoji-based visualizations and masking can aid explainability.
Table including fine-grained annotations of Reachout posts.
automated machine learning
Centre for Addiction and Mental Health
Computational Linguistics and Clinical Psychology
Generative Pretrained Transformer
Linguistic Inquiry and Word Count
Tree-based Optimization Tool
Valence Aware Dictionary and sEntiment Reasoner
The Centre for Addiction and Mental Health (CAMH) Specialized Computing Cluster, which is funded by the Canada Foundation for Innovation and the CAMH Research Hospital Fund, was used to perform this research. The authors thank the Nvidia Corporation for the Titan Xp GPU that was used for this research. The authors acknowledge the assistance of the American Association of Suicidology in making the University of Maryland Reddit Suicidality Dataset available. The authors also thank the 3 anonymous reviewers for their helpful suggestions and comments. This study was supported by the CAMH Foundation and a National Science and Engineering Research Council of Canada Discovery Grant to LF.
LF owns shares in Alphabet Inc, which is the parent company of Google, the developer of the freely available Universal Sentence Encoder, which was compared with other methods.