Characterizing and Identifying the Prevalence of Web-Based Misinformation Relating to Medication for Opioid Use Disorder: Machine Learning Approach

Background: Expanding access to and use of medication for opioid use disorder (MOUD) is a key component of overdose prevention. An important barrier to the uptake of MOUD is exposure to inaccurate and potentially harmful health misinformation on social media or web-based forums where individuals commonly seek information. There is a significant need to devise computational techniques to describe the prevalence of web-based health misinformation related to MOUD to facilitate mitigation efforts. Objective: By adopting a multidisciplinary, mixed methods strategy, this paper aims to present machine learning and natural language analysis approaches to identify the characteristics and prevalence of web-based misinformation related to MOUD to inform future prevention, treatment, and response efforts. Methods: The team harnessed public social media posts and comments in the English language from Twitter (6,365,245 posts), YouTube (99,386 posts), Reddit (13,483,419 posts), and Drugs-Forum (5549 posts). Leveraging public health expert annotations on a sample of 2400 of these social media posts that were found to be semantically most similar to a variety of prevailing opioid use disorder–related myths based on representational learning, the team developed a supervised machine learning classifier. This classifier identified whether a post’s language promoted one of the leading myths challenging addiction treatment: that the use of agonist therapy for MOUD is simply replacing one drug with another. Platform-level prevalence was calculated thereafter by machine labeling all unannotated posts with the classifier and noting the proportion of myth-indicative posts over all posts. Results: Our results demonstrate promise in identifying social media postings that center on treatment myths about opioid use disorder with an accuracy of 91% and an area under the curve of 0.9, including how these discussions vary across platforms in terms of prevalence and linguistic characteristics, with the lowest prevalence on web-based health communities such as Reddit and Drugs-Forum and the highest on Twitter. Specifically, the prevalence of the stated MOUD myth ranged from 0.4% on web-based health communities to 0.9% on Twitter.


Background
In the United States, opioid overdose continues to be a leading cause of death [1]. The Centers for Disease Control and Prevention estimates that the total economic burden of prescription opioid misuse in the country alone is US $78.5 billion a year, including the costs of health care, lost productivity, treatment, and criminal justice involvement [2]. Alarmingly, opioid overdoses increased by 30% from July 2016 to September 2017 in 52 areas in 45 US states [3]. Consequently, in 2017, the Department of Health and Human Services declared it as a public health emergency [4]. Central to addressing the opioid crisis is expanding access to medication treatment for opioid use disorder (MOUD) [5]. MOUD increases treatment retention and reduces opioid use, risk behaviors that transmit blood-borne pathogens, and overdose mortality [6]. However, despite its well-documented effectiveness, studies have found that MOUD is underused due in part to stigma and misperceptions about treatment [7].
In recent years, many individuals have been seeking both conventional and nonconventional ways to recover from substance use, including using web-based resources [8]. For these conditions, as well as opioid use disorder (OUD), research has shown that individuals turn to the web for promoting and discovering recovery strategies, for example, appropriating the Forum77 forum for prescription drug use recovery [9] and participating in 12-step programs such as Narcotics Anonymous [10,11]. Social support is another motivation behind individuals with substance use disorders turning to social media; Rubya and Yarosh [12] examined peer support for substance use disorder recovery meetings through video chat, discovering that video chat support groups not only provide immediacy and convenience in meeting needs but can also be places of obtaining emotional and informational support. More recently, researchers have examined patterns of anonymity in web-based recovery communities [13]. Specific to OUD, previous studies have investigated the different types of web-based discourse associated with opioid use, including personal use, whether it is associated with legitimate use or abuse of opioids [14], or whether it involves the promotion of clinically unverified treatments [15]. Abuse discourse on social media platforms has been further broken down into stand-alone use and co-use of opioids with other opioids, illicit drugs, and alcohol [16]. In addition, a prior study analyzed the web-based discourse surrounding the perception of opioids [17]. The perception of opioids included commentary on the opioid crisis, opioids in general, and interaction with news surrounding the opioid crisis or medical use of opioids [17]. Researchers in the past have also harnessed social media data as unobtrusive sensors to identify individuals who might benefit from or be receptive to treatment and recovery interventions [18]. Others have computationally examined and compared web-based discussion communities to discover the intent to contribute to web-based mental health communities [19]. In general, social media platforms have been found to allow increased self-disclosure for users to discuss otherwise sensitive and stigmatizing topics such as OUD [20]. Apart from self-disclosure, social media data provide unique opportunities for understanding the users' sentiments and opinions [21], which may be insightful from the perspective of addiction treatment.
Despite the positive benefits of social media, existing attempts of individuals with OUD are often challenged because of the pervasiveness of inaccurate and potentially harmful health misinformation on social media platforms [15]. Health misinformation is defined as a health-related claim of a fact that is currently false because of a lack of scientific evidence [22]. In general, misinformation is usually attributed to misconceptions and is not intended to cause harm. Disinformation is false information that is created deliberately to cause harm, with motivations that are often social, political, or financial. Although misinformation and disinformation are inherently false, malinformation is usually based on real information that is taken completely out of or without context to inflict harm [23]. Fake news is defined as fabricated information that mimics news media content in form but not in organizational process or intent [24,25]. Molina et al [24] have outlined key indicators of fake news such as content that is not fact-checked, is emotionally charged, is written in narrative style, has unverified sources, or comes from an unknown source. In this study, we focused on the language of false claims surrounding MOUDs regardless of intent; therefore, it might be the case that we captured a few instances of disinformation, possibly on web-based platforms that lack constant domain-specific moderation. Thus, we use the term health misinformation as we assume that the spread of these claims is not intentional.
From the discourse on infectious disease outbreaks and global epidemics to alternative therapies to tackle behavioral health problems, web-based misinformation can have adverse effects on public health, including negatively influencing people's health literacy, attitudes, beliefs, and health-related decision-making [22]. For example, antivaccine-promoting social media posts legitimize debate about vaccine safety, contribute to reductions in vaccination rates, and increase vaccine-preventable diseases such as measles [26]. In the context of public health crises, social media rumors circulating during the Ebola outbreak in 2014 were found to create hostility toward health workers, which posed challenges in controlling the epidemic [27]. Most recently, the novel COVID-19 pandemic has come to be defined by a tsunami of persistent misinformation to the public on everything from the utility of masks and the effectiveness of social distancing to even the promise of vaccines, together contributing to an increased COVID-19 pandemic burden [28]. At-risk populations are known to be particularly vulnerable to misinformation [22,29] because of a lack of reliable information outside of formal clinical or rehabilitation contexts [30,31]. In fact, studies show that because of exposure to such misinformation, people worry that they will be ostracized by their community if their substance use is revealed to others, thus delaying treatment [32].
Given the limited uptake of MOUD, the potential contribution of health misinformation to this public health problem, and the fact that information about barriers to MOUD is challenging to ascertain from other data sources, exploring digital health-seeking behavior through passive sensing of misinformation related to MOUD provides an important avenue for addressing this problem. Thus, infodemiology, which refers to the science of studying the distribution and determinants of information and user-generated content in an electronic medium such as the web in general and social media in particular [33], has the opportunity to shape MOUD-related health promotion strategies and policies. Given the potential impact of misinformation in the midst of the ongoing overdose crisis, there is a critical need to better understand misinformation-related social media posts on OUD treatment. In fact, in recent years, approaches in infodemiology have been noted to be important in mitigating public health problems stemming from infodemics [34,35], a portmanteau of information and epidemic that typically refers to a rapid and far-reaching spread of both accurate and inaccurate information about a disease.

Objective
In this study, we focus on one particular myth (and its language variants) related to MOUD: agonist therapy or medication-assisted treatment (MAT) is simply replacing one drug with another. For example, someone might express this myth by saying "You are not really in recovery if you are on Suboxone." This myth is believed to be one of the major reasons cited for individual hesitancy to initiate MOUD; it has been discussed extensively in clinical literature [29,36,37] and has been discredited by evidence that MOUDs facilitate recovery and that multiple other chronic health conditions such as diabetes and asthma necessitate reliance on daily medication to maintain health.
By adopting a multidisciplinary, mixed methods strategy, this paper aims to present the first work that investigates the characteristics and prevalence of web-based misinformation related to MOUD across 3 types of web-based social platforms to inform future prevention, treatment, and response efforts. Our contributions include a set of machine learning (ML) models that classify whether a post revolves around conversations surrounding a specific MOUD as replacing one drug with another or explorations of lexical variations characterizing web-based conversations relating to this myth.

Data Set Curation
We first identified and curated a set of clinically grounded and publicly prevalent myths that surround OUD treatment and developed a lexicon of opioid-related keywords associated with different aspects of OUD. We captured different types of opioids, such as natural opiates, semisynthetic opioids, and synthetic opioids, and included opioids that were over-the-counter, prescription based, or illicit. For each generic name, we also included trade and combination product names in consultation with the substance use literature and the public health coauthors. This resulted in a total of 152 keywords curated in the lexicon. We then curated a diverse data set from Twitter, YouTube, and the web-based health communities Reddit and Drugs-Forum. These platforms were selected as (1) they are adopted pervasively by Americans and (2) there are well-established means and infrastructures for collecting meaningful data sets by leveraging app programming interfaces to query them and access public posts on these platforms. According to the Pew Research Center, in 2021, 18% of US adults use Reddit, 23% use Twitter, and 81% use YouTube [38]. In addition, these platforms have been mined in prior substance abuse literature for abuse monitoring and digital epidemiology purposes [39][40][41]. For all the platforms we investigated, we focused on public posts and messages created between January 1, 2018, and December 31, 2019.
Our data set collection methodology for Twitter comprised querying for all tweets that included 1 of the words in our lexicon. This process yielded a total of 6,365,245 tweets. For YouTube, owing to limitations in the number of comments that can be accessed, we restricted the 152 keywords to 11 OUD treatment keywords such as buprenorphine and naltrexone. We used the YouTube app programming interface to identify 552 YouTube videos that contained 1 of the 11 keywords in the title and then collected all of the associated comments (99,386 comments). We relied on expert domain knowledge to identify subforums pertinent to OUD for Reddit and Drugs-Forum and used the full set of 152 keywords for these sites. For Reddit, we used data from 22 opioid-specific subreddits: r/Carfentanil, r/opiates, r/fentanyl, r/opiatesmemorial, r/modquittingkratom, r/Methadone, r/suboxone, r/kratom, r/heroin, r/quittingkratom, r/Tianeptine, r/loperamide, r/naltrexone, r/oxycodone, r/OpiatesRecovery, r/Opiatewithdrawal, r/lean, r/heroinaddiction, r/HeroinHeroines, r/OpiateChurch, r/suboxone, and r/OurOverUsedVeins. This resulted in a total of 1,189,590 posts and 12,293,829 comments. In addition, we collected all 5549 messages posted under the Opiates and Opioids subforums on Drugs-Forum [42]. Throughout the paper, we have combined Reddit and Drugs-Forum content under the category of web-based health communities, as both have similar structure, format, and affordances.

ML Approach Using Expert Involvement
Web-based discourse surrounding OUD is semantically rich; that is, there are different words and combinations of words that people use to convey meaning. Previous literature has quantitatively and qualitatively investigated various categories of language pertaining to OUD, including OUD use (own use, use by others, abuse, legitimate use, and co-use), OUD perception (commentary on opioid crisis or opioids in general), and OUD advertisements [14,16,17]. In light of such linguistic richness and prior investigations, we adopted an ML and natural language analysis methodology to identify posts relevant to the myth under investigation in the huge search space.
We first leveraged representation learning techniques, which are a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data [43] to construct document-level embeddings (consisting of 4096 dimensions) of the myth statement noted earlier. For this, we used a bidirectional long short-term memory (LSTM) sentence encoder model universally trained on a natural language inference task [44]. LSTM was a suitable choice here as it allowed us to learn long-term dependencies among words in sentence structures. We then used this model to encode all the collected posts. Following this step, we obtained the k-nearest neighbor (KNN), where k=200, for semantically most similar posts per platform for the MOUD-related seed myth under investigation. Second, using a mixed methods approach, our models then harnessed qualitative content analysis in the form of public health expert annotations to label a total of 800 posts (200 KNNs per platform) and annotate whether each post was relevant to the myth (ie, whether the post discussed MOUD and described MOUD as using one drug to replace another). Hence, we modeled this problem as a binary classification task where the positive class denoted a post discussing the aforementioned piece of misinformation and the negative class represented any post that was not relevant to the myth. Each myth KNN post was annotated by the same expert public health annotator to provide consistent annotations within the linguistic domain of a given myth.
Leveraging these annotations as training data, we finally built and evaluated a series of supervised ML models, ranging from logistic regression (LR) and support vector machines to feedforward neural networks and LSTM networks. Our feature set included lexical features such as n-grams (n=1, 2, 3), term frequency-inverse document frequency (TF-IDF) weights, and representation learning features, including sentence-based embeddings (semantic) and transformer-based embeddings, such as bidirectional encoder representations from transformers [45] and bidirectional encoder representations from transformers for biomedical text mining [46]. We used all annotations belonging to our myth and considered all the samples from other myths as negative training samples. On the basis of this process, we obtained 171 positive samples and 2229 negative samples. Owing to this large imbalance, we leveraged an oversampling technique from the rare class, called the synthetic minority oversampling technique [47]. We then split the data set into training and test samples with an 80% to 20% split, respectively. We leveraged 2 techniques for cross-validation: k-fold cross-validation (for LR and support vector machine models) and an independent validation sample to tune a model's hyperparameters (for the LSTM model). Table 1 and Figure 1 show the best-performing ML models in terms of their area under the curve, precision, recall, and F 1 scores. Our best-performing model was a combination of TF-IDF features and an LR classifier, achieving a precision of 0.85, a recall of 0.91, an F 1 score of 0.88, and an area under the curve of 0.9. By applying our best-performing model to machine label all posts in our data sets, we were able to estimate the prevalence of posts related to the myth under investigation on each platform. The prevalence of posts among our sampled comments that were related to the myth that the use of MOUD does not constitute true recovery was 0.4%, 0.9%, and 0.58% for web-based health communities, Twitter, and YouTube, respectively. For additional context and interpretability in terms of how our best-performing models operated per platform, 2 examples of posts that were classified correctly by our classifier are provided in Table 2, along with the top words used by the classifier to attain a relevancy decision for each post on each platform. Here we observed some consistencies in the discussions of the myth across platforms. For example, we noted that our model was able to pick up on the use of verbs synonymous with replac, such as switch, which was not originally included in the myth phrasing. In addition, the verb go was used in multiple contexts, such as going to Alcoholics Anonymous meetings instead of relying on MATs and going through withdrawals from MAT. We also noted the presence of multiple drug names such as Ativan, buprenorphine, methadone, and suboxone.

Web-based health communities
"take kratom switch one drug anoth go aa meeting for real iv ativan usual go drug symptom" "Don't take the kratom. Don't switch one drug for another. Go to an aa meeting. for real. IV Ativan is usually the go to drug for such symptoms." "fulli understand fear withdraw symptom suck doctor know intak prescript game plan set quit also effort ween med sure heard suboxon prescript medic short summari itll help withdraw well act like crutch anoth thing kratom go withdraw one one learn walk away medfre person take lot gut courag take" "Your fear of the withdrawal symptoms is totally legit. They suck. Did you tell your doctor about your intake of the prescription? There needs to be some sort of a planned approach for not just quitting, but also to make sure you ween off your meds properly. Have you heard of Suboxone? It's a prescription medication that basically will help you with withdrawals as well as give you a crutch. Kratom is another option, but going through the withdrawal alone and learning how to walk away as a substance-free person takes a lot of daring and audacity, so you need to have what it takes for it."  Table 3. These terms include mat, assist, treatment, replac, therapi, rehab, methadon, behavior, habit, and substitut. Furthermore, to provide additional insight into words used by the ML model to identify myth-related posts, for each of the top 10 terms, we display the 15 words with the closest semantic proximity (based on training a Word2Vec embedding model [48]) as measured by cosine similarity. Qualitative assessment of the identified words revealed excellent identification of synonymous terms and phrases, including those that were unlikely to be readily suggested or identified by human readers, such as ost (opioid substitution therapy).  column depicts the features and their term frequency-inverse document frequency scores. The nearest neighbors column also depicts the cosine similarity between each word and the corresponding feature. Words in posts are stemmed before being fed to models (eg, recovery is stemmed to its root recoveri). Web-based health communities refer to Reddit and Drugs-Forum.

Principal Findings
Harms propagated by misinformation are aplenty on the web and come at both financial and societal costs. People often accept what they read as true, especially if it comes from a reasonably reputable source, and do not question the information, no matter how astounding or alarming. In fact, people even repeat the more remarkable information regardless of how accurate it is. In the context of MOUD, it can lead to grave consequences, including overdose deaths [29]. To the best of our knowledge, this is the first study to examine MOUD-related misinformation on a large scale, harnessing conversations happening on the web.
Closely related to our work is the study by Jamison et al [49], which leverages a collection of tweets to quantify vaccine misinformation. Similar to our work, Jamison et al [49] coded tweets into thematic categories based on vaccine sentiment (positive, negative, or neutral). However, our work leveraged thematic categories (relevant and not relevant to the myth) to design ML-based models that are able to identify misinformation in the context of MOUDs. Heimer et al [29] discussed prevalent misconceptions about OUDs in the United States through 3 crises (1865-1913, 1960-1975, and 1995-today). Similar to our focus, the authors acknowledged opioid abstinence-based recovery models as a prevailing misconception and promoted the large-scale expansion of MAT. Our work complements their work by investigating this misconception quantitatively through the lens of social media. Chenworth et al [50] investigated the perception of the general public toward methadone and buprenorphine-naloxone on Twitter. The authors identified that a common barrier to treatment with these medications was the idea of opioid substitution-the exchange of one opioid addiction for another [50]. Our work investigates this barrier at a deeper level by building models that are able to recognize this type of discourse on social media.
Our results have important public health implications. Across multiple platforms, we detected that the prevalence of posts about a single myth related to medication treatment for OUD in our sample ranged from 4 per 1000 posts on web-based health communities to 9 per 1000 posts on Twitter. This is notable, as, at any time, there are likely multiple myths being discussed on the web, suggesting that the total volume of misinformation content related to opioids may be a substantial proportion of the total posts. The prevalence of such information has not been previously quantified, and this study offers important insights into the potential scope of this health information issue.
Although we cannot speculate on the exact reason why Twitter presented more misinformation in the case of OUD-related misinformation as that requires causal inference analysis, which is beyond the scope of this paper, prior literature has pointed out the lack of active expert or clinical-based moderation on Twitter [51]. Although web-based health communities are also not immune to bad behavior and antisocial activities such as trolling, spamming, and harassment, these communities are often guided by strict norms against such behavior and moderated to ensure the quality and credibility of the content being shared [52]. Prior studies on different types of web-based health communities have demonstrated that adequate active moderation increases the engagement of members and consequently also increases the beneficial outcomes for members in a web-based community [53]. In fact, the moderators themselves regard their moderation style as important for the regulation and stimulation of membership engagement [54,55]. We suspect that, because of these established moderation norms, we observed a relatively lesser prevalence of MOUD misinformation in the web-based communities we studied. We noted that Twitter does implement some broad governance rules that allow for certain types of information to stay on the platform, whereas others are removed (eg, graphic violence and adult content [56]). The platform also has provisions to tackle the widespread presence of hate speech and abusive content [57]. However, to the best of our knowledge, Twitter does not implement policies toward the moderation of MOUD misinformation. Our conjecture is that, because of this existing practice, our study revealed a greater prevalence of this misinformation on the platform. Nevertheless, in light of the ongoing COVID-19 pandemic, Twitter has broadened its definition of harm to address "content that goes directly against guidance from authoritative sources of global and local public health information" [58]. We hope that the findings of this study can motivate social media platforms to consider moderation approaches toward substance misuse information as well.
Given the significant prevalence of myths around OUD treatment, as shown in this study, a possible approach to counter web-based misinformation could be to perform targeted, expert fact-checking of social media posts. This could mirror and harness guidelines adopted by public health organizations to debunk unverified information about OUD treatment. For instance, substance use experts can be identified and asked to review the content of social media posts to determine their accuracy. These experts could critically appraise a post and produce a response comprising a lay summary of the evidence in addition to a detailed, referenced evidence review. This review could be directly linked to the original post through appropriate platform affordances to provide users with quick access to fact-checked information. Specific fact-checking processes could also be tailored to individual social media platforms, given the differences we observed both in terms of prevalence and the linguistic characteristics of the myth discussions. Qualitative exploration of the characteristics of the statements identified by the ML approach revealed linguistic and topical diversity. Some statements explicitly referenced the main concept we queried for-that MOUD represents replacing one drug with another. However, related statements were identified in which alternative treatments such as kratom entered into the discussion. Rationales for hesitancy toward MOUD also became apparent, including concerns about the addictiveness of MOUD, the nature of withdrawal symptoms from MOUD, and concerns about industry or governmental motivations for recommending MOUD. Understanding these concerns is directly relevant to providing health information, understanding the role of digital information ecosystems as a supplant or adjuvant resource in substance misuse treatment, and addressing treatment hesitancy.
In addition to fact-checking efforts, public health engagement campaigns could also be used to address specific cases of misinformation. Recent research suggests that information campaigns led by trusted community members and health partners can help address health misinformation on social platforms [59]. Accordingly, alliances can be forged with social media influencers and key opinion leaders to run targeted health promotion campaigns. Interventions such as those with positive messaging can also be tailored to the preferences, perceptions, and cultures of different platforms. Educational interventions that improve literacy around OUD treatment and reduce the stigma that precludes seeking help, as well as ecologically sensitive interventions that open up avenues to access social support, could also empower individuals to be better equipped to deal with OUD treatment myths on the web. In short, although the literature on strategies to effectively counter health misinformation is still emerging, at minimum, this work highlights the importance of ongoing assessment and awareness of what health information is being prominently discussed on the web to guide both the provision of effective health care and public health prevention activities.
We note some limitations of this work. Although our analysis included large data sets from diverse web-based platforms, MOUD-related discussions happen on a wide variety of social platforms, and the prevalence of misinformation across a broader set of web-based environments needs characterization. For one platform, YouTube, limitations in the number of comments that can be accessed required restriction of the keyword list, which may have affected the prevalence of misinformation, although the estimate from YouTube was comparable with the other platforms. Furthermore, this research did not examine the nature of conversations surrounding the OUD treatment myth we focused on in this paper, such as whether a conversation might be reinforcing or countering the myth or discussing other previously known myths. Future work may unpack these characteristics of web-based discussions while also investigating additional myths about OUD misuse that surface on web-based platforms. Finally, geospatial-temporal studies on MOUD misinformation that originates and spreads via social media platforms can be a promising and significant direction for future research; they can influence interventions such as targeted location-based misinformation-countering campaigns as well as help clinicians respond to patients' false beliefs or misperceptions.