This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter.
This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described.
We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes—
Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).
Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.
Social media has provided a platform for internet users to share experiences and opinions, and the abundance of data available has turned social networking websites into valuable resources for research. Social media chatter encapsulates knowledge regarding diverse topics such as politics [
Although the volume of data in social media is attractive, owing to the various complexities associated with the data, such as the use of nonstandard language and the presence of misspellings, advanced natural language processing (NLP) pipelines are required for automated knowledge discovery from this resource. These pipelines typically require the application of machine learning approaches, supervised or unsupervised, for information classification and extraction. Unsupervised approaches such as topic modeling are capable of automatically identifying themes associated with health topics from large unlabeled datasets [
The importance of building high-quality datasets and annotation processes cannot be overstated—the reliability of the systems and their performance estimates depend directly on it. When annotating datasets for training machine learning algorithms, the standard approach is to have multiple annotators annotate the same sample of data and then compute agreement among the different annotators. Interannotator agreement (IAA) measures provide estimates about how well defined a task is, its level of difficulty, and the ceiling for the performance of automated approaches (ie, it is assumed to be impossible for an automated system to be better than human agreement). IAA values reported for social media–based annotation tasks are often relatively low [
One of the most important steps in preparing high-quality corpora is the development of detailed and consistent annotation guidelines that are followed by all the annotators involved. Methodically prepared annotation guidelines for a target task have multiple advantages, as outlined below:
They enable the annotation process to be more consistent, leaving fewer decisions to the subjective judgments of different annotators. Consequently, this also inevitably improves IAA, naturally raising the performance ceilings for automated systems.
Well-defined guidelines document the clinical or public health purposes of the studies, enabling researchers from informatics or computer science domains to better understand the high-level objectives of the studies, thereby helping bridge the gap between the domains.
Data science approaches to health-related problems are seeing incremental development (ie, as one problem is addressed successfully, additional follow-up problems are addressed). Therefore, well-defined annotation guidelines can be crucial to enable extensions of the annotated corpora for future studies.
Datasets for a specific problem (eg, adverse drug event detection [
The considerations documented within the annotation guidelines of one study can be beneficial for research teams developing corpora for other tasks, as they can follow identical standards or make similar considerations.
In addition to datasets and automated systems that are valuable for the health informatics research community, detailed explanations of methods and justifications for annotation guidelines can impact data-centric automation—particularly for domain-specific problems, where the potential for automation is at the exploratory or early development phase.
In this paper, we discuss the preparation of a dataset from Twitter involving misuse- and abuse-prone prescription medications. Prescription medication misuse and abuse, and more generally, drug abuse, is currently a major epidemic globally, and the problem has received significant attention particularly in the United States in recent years because of the opioid crisis. Given the enormity of the problem and the obstacles associated with the active monitoring of drug abuse, recent publications have suggested the possibility of using innovative sources for close-to-real-time monitoring of the crisis [
The contribution of prescription medications in the broader drug abuse crisis has been well documented and understood over the recent years. Nonmedical use of prescription medications may result in an array of adverse effects, from nonserious ones such as vomiting to addiction and even death. A significant portion of emergency department visits are due to nonmedical use of prescription medications [
In this paper, we do not distinguish between prescription drug misuse and abuse and use these terms interchangeably to represent all types of nonmedical use. There are, however, subtle differences between the definitions of the terms.
We present here an analysis of how prescription medication abuse information is presented on Twitter, the details of a large-scale annotation process that we have conducted, annotation guidelines that may be used for future annotation efforts, and a large annotated dataset involving various abuse-prone medications that we envision will drive community-driven data science and NLP research on the topic. Although we primarily focus on the annotation process, guidelines, and the data, we also illustrate the utility of the corpus by presenting the performances of several supervised classification approaches, which will serve as strong baselines for future research.
In consultation with the toxicology expert of our study (JP), we selected 20 medications (generic) to include in the study. We selected drugs belonging to the classes of prescription medications that have been identified as more commonly abused: opioids (including those used for medication-assisted treatment), benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs.
The protocol for this study was reviewed by the University of Pennsylvania’s institutional review board and was determined to meet the criteria for exempt human subjects research as all data collected and used are publicly available. In the examples presented in this paper, all identifiers have been removed, and slight modifications have been made to tweets to protect the anonymity of users.
Main drug categories, generic names, and brand names for prescription medications included in this study.
Drug category | Generic name | Brand name(s) |
Opioids | Oxycodone | Oxycontin, Percocet |
Methadone | Dolophine | |
Morphine | Avinza | |
Tramadol | Conzip | |
Hydrocodone | Vicodin, Zohydro | |
Buprenorphine | Suboxone | |
Benzodiazepines | Diazepam | Valium |
Alprazolam | Xanax | |
Clonazepam | Klonopin | |
Lorazepam | Ativan | |
Atypical antipsychotics | Olanzapine | Zyprexa |
Risperidone | Risperdal | |
Aripiprazole | Abilify | |
Asenapine | Saphris | |
Quetiapine | Seroquel | |
Central nervous system stimulants | Amphetamine mixed salts | Adderall |
Lisdexamfetamine | Vyvanse | |
Methylphenidate | Ritalin | |
GABAa analogs | Gabapentin | Neurontin |
Pregabalin | Lyrica |
aGABA: gamma-aminobutyric acid.
In a preliminary study that paved the way for a long-term project [
Potential Abuse or Misuse (A): These tweets contain possible indications that the user is abusing or is seeking to abuse or misuse the medication. The user may have a valid prescription for the medication, but their manner of use is indicative of abuse or misuse, or the medication may have been obtained illegally. We also include in this category tweets that can possibly indicate abuse without confirming evidence. As the end goals of this project are to identify all potential mentions of nonmedical or improper drug use by users, we do not differentiate between misuse and abuse.
Non-abuse Consumption (C): These tweets indicate that the user has a valid prescription for the medication and is taking the medication as prescribed, or is seeking to obtain the medication for a valid indicated reason. Tweets should be placed in this category when there is evidence of possible consumption, but there is no evidence of abuse or misuse. This category only applies to personal consumption.
Drug Mention Only (M): In these tweets, the mention of the medication name is not related to wanting, needing, or using the medication either as prescribed or misuse or abuse. For example, these tweets may be sharing information or news about the medication, jokes, movie or book titles, or lines from movies or songs. This category also includes mentions of use by a third person that do not indicate abuse or misuse by that person.
Unrelated (U): These tweets mention the medication keywords, but they do not represent the drug and refer to something else.
We decided on these categories and built our initial guidelines using the grounded theory approach [
From the initial topic categorization of the tweets, we added identifying markers that could be found within the tweets to help determine their classifications. With the exception of
For example, an identifying marker of abuse or misuse is the explicit or implied mention of consuming a higher dose of medication than prescribed:
let's see how fast a double dose of hydrocodone will knock me out
An identifying marker of consumption is the taking of a prescribed medication as indicated with no evidence of it being abused or misused:
I was prescribed Ritalin by my doctor to help me. i feel more hyper than focused
Meanwhile, a tweet categorized as mention gives no indication that the person mentioning the medication is taking the medication themselves:
the adderall tweets are not even funny to me. if you saw what i see daily at work it wouldn't be funny to you either.
The creation of the gold standard corpus commenced after consistent levels of agreement between the annotators were achieved. The corpus of tweets was divided into 3 overlapping sets ensuring that each tweet was annotated at least twice, with some being annotated 3 times. The annotations were completed by 3 expert annotators trained on the guidelines (AU1, AN1, and AN2). The annotators coded each tweet according to the entire text contained in the tweet by following the guidelines established to distinguish between classes. There were no further annotations at the subtweet level. The disagreements from each set were annotated by a fourth annotator (AU2) for resolution. For the tweets that were annotated by 3 annotators, majority agreement was used to resolve disagreements. In the event that all 3 annotators disagreed on the classification, they were reviewed and resolved by AU2. An overview of the process is shown in
The tweet explicitly states that the user has taken or is going to take the medication to
The tweet expresses that the user
The tweet expresses a
The user mentions
In the tweet, the user expresses
The tweet conveys some information about the medication but
The mention of the medication is
The mention of the medication
The only tweets that belong to this category are those that include a drug/medication name as keyword, but the keyword is referring to something else and not the drug/medication. It can be, for example, a person’s name or a misspelling of something else.
Overview of the creation of the annotation guideline and the iterative annotation process.
To demonstrate the utility of the corpus for training systems for automatic classification of medication abuse–related Twitter chatter, we performed a set of supervised classification tasks. Our intent with these experiments was to illustrate that machine learning algorithms are trainable using this dataset and establish a set of baseline performance metrics that can be used as reference for future research. We split the annotated dataset into 2 at approximately 80:20 ratio and used the larger set (13,172/16,443, 80.11%) for training and the smaller set (3271/16,443, 19.89%) for evaluation.
We experimented with 4 classifiers—multinomial naive Bayes (NB), random forest (RF), support vector machines (SVM), and deep convolutional neural network (dCNN). Our extensive past work on social media mining for health research and social media text classification has demonstrated that identifying the best classification strategy requires elaborate experimentation and is best identified by means of community-driven efforts such as shared tasks [
In total, a sample of 16,443 tweets were selected for annotation from more than 1 million posts collected from April 2013 to July 2018. This rather arbitrary number of tweets resulted from the various filtering methods (eg, removing short tweets and undersampling tweets with stimulants) that we applied on a much larger random sample of about 50,000 tweets. Before undersampling, approximately three-quarters of the retrieved tweets mentioned stimulants, and only approximately one-fifth of them were kept following the sampling process. From this chosen set, 517 randomly selected tweets were used in the initial iterations for improving agreement and developing the guidelines. These were then adjudicated and added to the gold standard corpus. The rest of the corpus was split into 3 sets containing 15,405 (set 1), 8016 (set 2), and 6906 tweets (set 3). In addition, a fourth set contained overlapping tweets that were annotated by all 3 of the annotators (set 4). All these sets had an arbitrary number of overlapping tweets with at least one other set, which the annotators were not aware of during annotation. Pairwise IAA, measured using Cohen kappa [
An analysis of the disagreements suggested that they were somewhat evenly distributed across the categories of interest. Over the first 3 sets, there were a total of 3631 disagreements among the annotators, 1082 (29.80%) were disagreements between abuse or mention classifications, 1160 (31.95%) were between abuse or consumption, 1186 (32.66%) were between consumption or mention, and the remaining 203 (5.59%) were disagreements between unrelated or all other categories. The analyses also showed that the disagreements did not result from the annotators’ incorrect interpretations of the guidelines but from their interpretations of the tweets. We, therefore, concluded that it was unlikely that we could further increase the IAA by updating or modifying the annotation guidelines.
Annotation agreement results.
Set | Annotators | Tweets, n | Agreement, n (%) | IAAa |
1 | AN1+AU1 | 15,405 | 13,560 (88.02) | 0.815 |
2 | AN1+AN2 | 8016 | 6414 (80.02) | 0.681 |
3 | AU1+AN2 | 6906 | 6709 (97.15) | 0.953 |
4 | AN1+AN2+AU1 | 6906 | —c | 0.904b |
aInterannotator agreement.
bFleiss Kappa.
cNot applicable.
Distribution of tweets in the annotated corpus by annotation category and drug class.
Class-specific F1 scores, overall accuracy, and 95% CIs for the accuracy for 4 classifiers.
Classifier | Abuse | Consumption | Mention | Unrelated | Correct predictions and accuracy (N=3271), n (%) | 95% CI |
NBa | 0.51 | 0.66 | 0.77 | 0.81 | 2257 (69.00) | 67.4-70.6 |
SVMb | 0.53 | 0.67 | 0.82 | 0.78 | 2388 (73.00) | 71.4-74.5 |
RFc | 0.30 | 0.66 | 0.81 | 0.79 | 2352 (71.90) | 70.3-73.4 |
dCNNd | 0.35 | 0.64 | 0.79 | 0.16 | 2355 (72.00) | 70.3-73.5 |
aNB: naive Bayes.
bSVM: support vector machine.
cRF: random forest.
ddCNN: deep convolutional neural network.
The iterative process undertaken for our guideline development was crucial to concretize the definitions for each of the classes and identify sample tweets presenting a multiplicity of types of information for each class, and to reduce decision-making uncertainties among the annotators. Through the process, we raised IAA from 0.569 in the first round to a combined average of 0.861, which can be interpreted as an “almost perfect agreement” [
Examples of difficult-to-annotate instances.
Tweet | Category | Justification |
generic xanax and adderall look way too alike. oh no what have i done...? | Ca | There is inexplicit evidence that the user took the medication, although there is no evidence of abuse. |
Going by a restaurant before 10:30 and not stopping to get breakfast is how you know you're on Vyvanse | C | There is inexplicit evidence that the user took the medication, although there is no evidence of abuse. |
if this tweet sticks i'll eat my shorts (made of adderall) | Ab | The user is expressing an intent to abuse, with an inexplicit indication that he/she has access to the medication. |
i always freak out before a speech, always... this is the part where i'm supposed to ask my gp for zoloft or roofies but nooo, | Mc | The user is expressing that he/she does not have access to the medication and expressing a situation. |
i swear vyvanse got you finishing things you didn't know you had to doo #justironedmysocks | C | The tweet expresses the effect of Vyvanse more like a side effect, with no evidence or hint to indicate that the drug is being abused. |
so glad i did my research and never let anyone convince me to take tysabri or gilenya. dr. was so informative! | M | The user is expressing that he or she never took the medication. |
vyvanse i love you so much omg like i want to marry you i want to love you | C | The user is expressing love for Vyvanse, although never really expressing or hinting at possible abuse. If there was any hint of abuse, this tweet would be labeled as such. |
took double dose vyvanse today by accident. i'mbouncinall around. | A | Although the misuse is unintentional, the user is expressing certain sensations brought about by the drug, so it was considered to be abuse-indicating. This is another borderline case. |
aC: Non-abuse consumption.
bA: Potential abuse or misuse.
cM: Drug mention only.
We also performed a word-level analysis to better understand how the contents of the tweets belonging to the 4 classes differed, if at all. We found that the consumption tweets contain more health-related terms (eg, pain, anxiety, sleep, and doctor), whereas the unrelated tweets contain mostly irrelevant terms (eg, song, Anderson, and Hollywood). There are similarities in the word frequencies in the abuse or misuse and mention categories, indicating that discussion about abusing medications is not remarkably different from general discussions about the medications. This adds to the difficulty of accurately classifying the tweets belonging to the smaller abuse or misuse class.
In addition to the word-level similarities between the abuse or misuse and mention classes, the ambiguity in the language and the lack of context within the tweets leave them open to subjective interpretation, which affects the annotation process itself. These interpretations are troublesome when there can be multiple meanings in the clues that are present. For example, a tweet may have no explicit mention of abuse, but the use of certain keywords (eg, popped) or the situation may suggest that there might be misuse or abuse involved (possible abuse). However, it is not unreasonable that the use of such expressions would also be adopted by a patient taking their medication in the prescribed manner, making it difficult for the annotators to decide when it should be considered abuse and when it should be considered consumption. We sought to mitigate the effect of this uncertainty on the quality of the corpus by double, or even triple, annotating each tweet to achieve consensus.
The key objective behind creating detailed annotation guidelines and making them publicly available is to ensure the reproducibility of the annotation experiments. This is of particular importance for health-related data, from public social media or other sources such as electronic health records, which may have restrictions on public sharing, requiring researchers from different institutions to annotate their own data. For example, Twitter requires researchers to make a
The expansion of the classes did decrease the accuracy we achieved from our prior pilot study [
The principal findings and outcomes of the work described in this paper are summarized as follows:
Creation of annotated data that will be used to promote community-driven research focusing on social media mining for prescription medication abuse research. We have made the manually labeled training data available with this manuscript, and the evaluation set will be used to evaluate systems via shared tasks [
We have provided elaborate descriptions about how prescription medication misuse or abuse is discussed on Twitter for a number of medications. Our detailed annotation guideline may be used by others to contribute more annotated datasets involving additional sets of medications.
The machine learning results mentioned in the paper present strong baseline and benchmark results for future systems trained and evaluated on this dataset.
A number of recent studies, including our preliminary studies on the topic [
The study has several limitations, particularly in terms of scope. Only Twitter data are included in this study and the accompanying dataset, although data on misuse or abuse are also available from other social networks such as Instagram and Reddit [
In this paper, we discussed how users present information about prescription medication abuse and consumption on Twitter, described the iterative annotation of a large corpus containing 16,443 tweets, outlined our annotation guidelines that we have made available along with this publication, and presented the performance of several baseline classifiers over a sample of the corpus to demonstrate its utility. In our annotation guideline, we identified and defined 4 possible broad categories of topics of discussion related to abuse-prone prescription medications: potential abuse or misuse, non-abuse consumption, mention only, and unrelated. The guidelines were improved over a series of iterations of annotation and reviewed until we reached an agreeable level of consistency in our annotations. Through this process, we created a high-quality annotated corpus that can serve as the standardized dataset for future research on the topic. We expect that our annotation strategy, guidelines, and dataset will provide a significant boost to community-driven data-centric approaches for the task of monitoring prescription medication misuse or abuse monitoring from Twitter. Considering the growing problem of drug abuse, social media–based research may provide important unprecedented insights about the problem and perhaps even enable the discovery of novel abuse-prone medications or medication combinations.
Full annotation guidelines.
deep convolutional neural network
interannotator agreement
naive Bayes
National Institute on Drug Abuse
National Institutes of Health
natural language processing
random forest
support vector machine
Research reported in this publication was supported by NIDA of the NIH under award number R01DA046619. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
KC led the annotation process and the guideline preparation, served as an annotator, and contributed to the writing of the manuscript. AS performed some annotation and disagreement resolution, conducted the classification experiments, and wrote significant portions of the manuscript. JP provided domain expertise for the study, finalized medications to be included, reviewed the guidelines, and contributed to the writing of the manuscript. GG provided high-level guidance to the project and contributed to the writing of the manuscript.
None declared.