Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Background: The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter. Objective: This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example. Methods: We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional


Introduction
Consumers' perspectives and feedback are crucial for improving products or services. Over the last two decades, widespread adoption and use of the internet has led to its use as a major platform for collecting targeted consumer feedback. Businesses often allow consumers to rate specific products and services and provide detailed comments or reviews, and this has become a key feature of e-commerce platforms. For example, consumer-generated reviews and ratings of products play an important role in the differentiation on Amazon's e-commerce site, which currently has a global presence [1,2]. There are also companies, such as Yelp, that focus specifically on crowdsourcing consumer feedback [3][4][5][6]. Similarly, as social media has become the primary platform of communication for many people, many companies have started maintaining and communicating via social media accounts, often enabling direct communications, both private and public, with consumers. Not only do consumers provide comments or seek assistance through these social media accounts, they also often engage in discussions about products or services within their own social networks. Consequently, such consumer-generated chatter is often used to assess perceptions about specific topics, which may range from products or services to social programs, legislation, and politics.
Social media is a rich resource for obtaining perspectives on public health, as it enables the collection of large amounts of data directly and in real time. It is commonly used for sentiment analysis-a field of study that analyzes opinions, sentiments, attitudes, and emotions from written language. Sentiment analysis research involving social media data has covered a wide range of topics, events, individuals, issues, services, products, and organizations [7,8]. However, the use of social media has not been limited to sentiment analysis in open domains. In recent years, research within the broader medical domain has embraced social media, and it is currently being used for conducting real-time public health surveillance, including for topics such as influenza surveillance, pharmacovigilance, and toxicovigilance [9][10][11]. Meanwhile, similar to corporate businesses in the United States, health service providers such as local health departments and hospitals have also started adopting social media specifically as a consumer-facing communication channel [12,13]. Prior studies in this area have investigated how social media data linked to such health services accounts reflect the consumers' perspectives about them. The simplest studies have focused on using structured or numeric information, such as likes or ratings, associated with the accounts belonging to hospitals or nursing homes, and these metrics have been compared with traditional quality reports and ratings [14][15][16]. Building on the advances in open-domain natural language processing (NLP), some studies within the broader health domain have attempted to use unstructured data, including postings related to patients' experiences about hospitals, to infer consumer sentiments [17,18] or extract topics that summarize content [19].
Extracting knowledge from social media data is notoriously difficult for NLP methods because of factors such as the presence of misspellings, colloquial expressions, lack of context, and noise. These problems are exacerbated for health-related data because of the complexities of domain-specific terminologies, the lack of expert knowledge among common social media users, and the uniqueness of health-related topics. Consequently, there is considerably less research using the free-text data on social media for health-related tasks. Past studies closely related to ours have focused on analyzing sentiments toward attributes of health insurance plans [20] and social media users' responses to public announcements about health policies [21]. However, to the best of our knowledge, there has been no near real-time automatic system that provides comprehensive data collection and analysis on social media chatter about health services and insurance coverage provided by large public insurers such as Medicaid and Medicare.
Nevertheless, such a system is essential for analyzing the public's perspective toward public insurers and their governance and policies from social media. For example, the customers using Medicaid might provide their feedback based on their experiences or even engage in discussion of their experiences, which they might not have a chance or even willing to reveal to the Medicaid providers' customer service representatives. Analyzing such chatter could provide researchers and policy makers with information complementary to traditional customer feedback channels and possibly help improve the services and related policies. However, chatter associated with an entity such as Medicaid contains discussions about politics and legislation; academic research, statistics, and factual information; consumer feedback; and so on. Chatter related to politics can be very different in terms of content, compared with chatter related to consumer feedback. Thus, properly categorizing these tweets based on content is crucial for providing accurate information. Furthermore, sentiment may also have different meanings for these 2 broad categories of chatter-negative sentiment in political chatter may represent a user's emotions associated with a political decision about the health service (eg, changes in policies related to insurance coverage or covered benefits within Medicare or Medicaid) rather than the service itself.
Therefore, there is a need to identify and categorize the content of the chatter before it can be used for targeted analyses. A good categorization scheme can not only help in bringing forth good analysis but can also help to avoid contaminating the chatter with irrelevant content. To achieve this and effectively use social media big data, automatic classification and analysis systems based on machine learning methods are required. This, together with the promise of social media data and the lack of past research in this specialized area, served as the primary motivation for this study. We chose Medicaid as our target health service because it is the single largest public insurance program in the country [22] and contains large volumes of related chatter on social media.
The specific objectives of this paper are as follows: • To assess if a social media platform, specifically Twitter, contains sufficient volumes of chatter about health services so that it can be used to conduct large-scale analyses, using Medicaid as our target service • To develop and discuss a data-centric system involving NLP and machine learning to automatically collect, categorize, and analyze Twitter chatter associated with Medicaid, as shown in Figure 1 • To describe the manual annotation of a Twitter-Medicaid data set and its composition • To describe supervised classification strategies for automatically classifying Medicaid-related tweets into broad categories and evaluating the performances of several machine learning models, with particular emphasis on tweets that potentially represent consumer feedback • To conduct postclassification content analyses to verify the potential utility of our data-centric system Figure 1. Workflow of the natural language processing system for automatic data collection, classification, and content analysis of the Medicaid chatter on Twitter. API: application programming interface; TFIDF: term frequency-inverse document frequency.
The main contributions of this paper are as follows: • We present the methods and results of collecting Medicaid-related Twitter data, analyzing a sample of the data manually, and developing an annotation guideline suitable for preparing a large data set for training classification algorithms. • We present details of automatic supervised classification experiments, including methods, results, and evaluations, and provide suggestions on how to further improve the performance.

•
We discuss the postclassification analyses of the collected data, including data distribution and content analyses. • We make the NLP and machine learning scripts in this study publicly available, along with the labeled training data set and a larger set of unlabeled Medicaid-related data.

Data Collection
To develop our models for analyses of Twitter data related to Medicaid, we collected 2 sets of publicly available data from the network, which we labeled Corpus 1 and Corpus 2. Corpus 1 contains tweets mentioning the term "Medicaid," or Medicaid agency (MA) and managed care organization (MCO, an organization that provides Medicaid-related health services under contracts from the agency) names that are branded and thus easily distinguishable on Twitter (eg, Medi-cal: California's Medicaid program, and TennCare, Tennessee's Medicaid program). These tweets were collected via Twitter's public streaming application programming interface from May 1, 2018, to October 31, 2019, and were limited to only English tweets. It has been reported that misspellings appear frequently on social media platforms [23], particularly Twitter; hence, we used an automatic spelling variant generator to generate common misspellings for "medicaid" and used them to capture tweets referring "medicaid" as one of the misspellings [24]. This can increase the retrieval rate and increase the volume of streaming data. The full list of keywords, including misspellings, is shown in Table S1 in Multimedia Appendix 1. We then identified and removed tweets whose contents were not directly related to Medicaid and tweets with repeated or duplicated contents (eg, fundraising or political campaigns). To focus on tweets expressing personal opinions, we removed retweets, which were deemed as duplicates of the original tweets. The final data set consisted of 628,411 tweets for Corpus 1.
Although most of the chatter regarding Medicaid posted by consumers only included the term "medicaid" (or its variants), some directly tagged or mentioned relevant Twitter handles associated with MAs or the MCOs (eg, "@organization_name"). Corpus 2 is composed of such tweets, and the MA and MCO Twitter handles were identified in a previous study [25]. The full list of the handles used in data collection is presented in Table S2 in Multimedia Appendix 1. These tweets were retrieved by targeted searching (eg, "to:organization_name") on Twitter. These tweets were posted between December 12, 2008, and the time of search (January 9, 2020). We filtered the tweets using the same approaches used for Corpus 1. Overall, there were 27,337 tweets in the corpus. Additional notes regarding our data are provided in Multimedia Appendix 1.

Tweet Contents and Manual Annotations
To better understand the contents of the tweets posted by users and to develop methods to automatically characterize the posts, we first performed manual inspections of the contents of the posts and identified commonly occurring themes. We used the grounded theory approach to conduct a thorough analysis [26]. We analyzed a random sample of tweets to identify recurring topics and then grouped the topics into broader categories and themes. The analysis was conducted by multiple authors of this paper, and the topics discovered initially were discussed. The discovered topics were either merged into broader themes (eg, combining information and outreach), discarded from our consideration (eg, for topics that were observed rarely or only once), or split into multiple themes (eg, splitting of information tweets into academic, information/outreach, and news). Following iterative discussions and finalization by the domain expert authors of this paper (JMZ and DG), we classified tweet contents into 5 broad categories: (1) academic, (2) consumer feedback, (3) information/outreach, (4) news, and (5) political opinion/advocacy. Tweets that could not be categorized as any of these were labeled as other. The descriptions of these classes are as follows: • Academic (academic)-tweets related to research on Medicaid. These included tweets by persons or organizations with academic affiliations or think tanks that expressed the perspective from the affiliated organizations or any tweet relating to education, scholarship, and thought, including (links to) journal publications and reports.
• Consumer feedback (consumer): These included tweets related to consumers' experiences or questions related to Medicaid services, coverage, benefits, or health issues. The tweets were typically from Medicaid consumers or family members of consumers and also included discussions with others.
• Information/outreach (information): These included tweets directed at consumers and beneficiaries of Medicaid to convey information including agency services, programs, events, enrollment, eligibility criteria, etc. Tweets containing information about general health or public health reminders were also included.
• News (news): These included news and announcements, including any tweets from a news agency or organization.
Tweets that explicitly expressed political opinions and tweets from Medicaid agencies or plans were excluded.
• Political opinion/advocacy (political): These included comments, personal opinions, and feedback about politics related to Medicaid.
• Other (other): These included tweets that were not relevant, typically the noise that is not captured by the initial screening.
Following the establishment of the desired categories and the development of annotation guidelines by JMZ, 2 trained annotators performed the first round of annotations (for the data in Corpus 1) in multiple iterations, developed annotation guidelines, and resolved ambiguities via discussions. Following the completion of this round of annotations, the annotation disagreements were resolved by AS and WH. We found the class distribution to be very imbalanced, with most of the tweets annotated as news, political, and other, whereas only a small portion were in academic, consumer, and information categories ( Table 1). Examples of each category are provided in Table S3 in Multimedia Appendix 1. To understand how this imbalanced distribution affected the classifier performances on the smaller classes, particularly the consumer class, we performed preliminary automatic classification experiments using 3 classifiers: naïve Bayes (NB), support vector machine (SVM), and random forest (RF). We split the data into training (5795/7244, 80%) and validation (1449/7244, 20%) sets and found the best performance on consumer feedback to be low for all the classifiers, with the best F 1 score=0.3 (SVM). Tweets belonging to the consumer feedback class were of particular importance to our overarching project objectives, so we devised 2 strategies for improving performance for this class: the first involved additional annotations of targeted tweets from the same data set and the second focused on collecting an additional data set (Corpus 2, as described earlier). Table 1. Distribution (counts and percentages) of annotated data in the first round of annotations (rows 2 and 3) and the final data sets (Corpus 1 for rows 4 and 6; Corpus 2 for rows 5 and 7).
Total, n (%) Other, n (%) Political, n (%) News, n (%) Information, n (%) Consumer, n (%) Academic, n (%) Data set 5795 (100)  For the first strategy, we conducted another round of annotation of tweets from Corpus 1 to increase the number of tweets for the consumer class. Owing to the very low number of consumer class tweets in the original data set, we realized that it would not be feasible to annotate sufficient numbers of these tweets by drawing random samples because of budgetary and other constraints. Therefore, rather than randomly drawing tweets for the next round of annotations, which would again lead to finding a small number of tweets belonging to the consumer feedback category, we attempted to artificially increase the number of tweets for this category. We achieved this by running our above-described weak classifier on a larger set of unlabeled tweets and only picking tweets classified as consumer feedback by the SVM classifier. This significantly increased the number of consumer feedback tweets in the data to be annotated. The new set of annotated data was then added to the training set, and the data distribution is presented in Table 1.
We followed the same annotation strategy for Corpus 2 (ie, annotating tweets classified as consumer feedback by the classifier trained on the previously annotated data), but this time, we also annotated equal amount of nonconsumer tweets. This is because Corpus 2 is rich in consumer feedback tweets, and we would also like to include tweets in other categories to improve performance. An outline of the overall annotation process is presented in Figure 2. Although we tried to decrease the class imbalance in the training sets of the 2 corpora, to ensure that our evaluations represented the classifier performances on real-world distributions of the data, we did not artificially balance the validation set. We also annotated the test set randomly generated from the 2 corpora, 1000 tweets each, so they would reflect the data composition of the original corpora, allowing us to evaluate how the classifier would perform when deployed for streaming data.

Classification
We experimented with 5 traditional classification algorithms, including Gaussian NB [27,28], SVM [29,30], RF [31], k-nearest neighbor (KNN) [28], and shallow neural network (NN), and 2 advanced classification algorithms, bidirectional long short-term memory (BLSTM) [32,33] and bidirectional encoder representations from transformers (BERT) [34,35]. Although the origin and distributions of tweets in the 2 corpora were different, we decided to combine them as our previous research suggests that multicorpus training, or distant supervision, leads to performance improvements for social media text classification [36]. The feature extraction and classification training for traditional classifiers was done using the "Scikit-learn" package in Python [37], the BLSTM classification was implemented using package "Keras" in Python [38], and the BERT classification approach was implemented using package "simpletransformers," which is based on the package "transformers" [39]. The performance on the validation set and the test set from Corpus 1 and Corpus 2 are shown in Table 2. The tweets were preprocessed by lowercasing and anonymizing the URLs and user names. For the traditional classifiers, the non-English characters were further removed (keeping underline), and each word was stemmed by the Porter stemmer. The features were the unnormalized counts of the 3000 most frequent n-grams (contiguous sequences of words with n ranging from 1 to 3, with 1380 unigrams, 1296 bigrams, and 324 trigrams). We also introduced a "word cluster" feature, which are clusters or generalized representations of semantically similar words or phrases learned from Twitter chatter [10,40].
The word clusters were represented as bag-of-word vectors, and the feature space consisted of 972 word clusters. We used the Twitter word clusters, "50mpaths2," provided by Owoputi et al [41]. For the advanced classifiers, each word or character sequence was replaced with a dense vector, and the vectors were then fed into the relevant algorithms for training.
We performed hyperparameter tuning using the validation set to improve the classification task on the imbalanced data set. Specifically, we focused on improving the F 1 score for consumer feedback. For traditional classifiers, we optimized the number of nearest neighbors for KNN, the number of estimators (trees) for RF, and the c parameter and weights for SVM. We also experimented with oversampling using the synthetic minority oversampling technique, but the performance was not improved (provided in Table S4 in Multimedia Appendix 1). The optimal hyperparameters for the traditional classifiers are listed in Table  S5 in Multimedia Appendix 1. We used the Twitter GloVe word embeddings for the BLSTM [42] classifier, where each word was converted to a 200-dimensional vector. BLSTM was then trained with 40 epochs and dropout regularization, and the best model was selected based on the accuracy of the validation data. We chose RoBERTa-large for the BERT algorithms [35], trained with 3 epochs. The technical details are provided in Table S5 in Multimedia Appendix 1.

Postclassification Analyses
To assess the utility of our classification approaches and gain an understanding of the data, we used the best-performing classifier (the classifier based on BERT) to label all collected unlabeled data and compute the data distribution. We then performed content analysis using the term frequency-inverse document frequency (TFIDF) method [43], focusing on the tweets in Corpus 1 that contained the term "medicaid" and its misspellings, and using the latent Dirichlet allocation (LDA) for topic modeling [44], focusing on consumer feedback tweets. Our first intent was to qualitatively assess whether the classifier was capable of distinguishing tweets based on contents that were manually verifiable. Second, we wanted to obtain a basic understanding of the content of each category by identifying the top rated TFIDF words. The TFIDF method adjusts the term frequencies with inverse document frequency so that the high-frequency terms unique in one category would rank higher than the high-frequency words that are common across the categories. This helps identify important terms unique to the target category. Our third objective was to summarize consumer feedback chatter using LDA topic modeling, going beyond the TFIDF method. For all content analyses, the text was first preprocessed by lowercasing and removing URLs, user names, non-English characters (keeping underline and hyphen), stopwords, and any word with less than 4 characters. For LDA topic modeling, we experimented with different hyperparameters (number of topics=5, 10, 20, 50, and 100) and selected the model with the highest coherence score.

Annotation and Class Distributions in Test Sets
We annotated a total of 9179 tweets from Corpus 1 and 2200 tweets from Corpus 2. We obtained substantial interannotator agreement (Cohen κ=0.734) [45,46] over 892 double-annotated tweets. The test data sets were randomly selected from the corpora, and therefore, they can be considered a sample of the collected data. For Corpus 1, the test data contained 1000 tweets, among which political discussion was the dominant class (603/1000, 60.30%), followed by news (199/1000, 19.90%), whereas consumer feedback comprised 4.60% (46/1000) of the tweets. In contrast, consumer feedback comprised 57.90% (579/1000) of the tweets in Corpus 2, and 30.80% (308/1000) of the tweets could not be categorized, most of which were part of conversations and could not be understood without full context.

Classification Results
The F 1 scores for each class and the accuracies of the classifiers on the validation set and the test sets are presented in Table 2, including CIs estimated using bootstrapping, whereas the precisions and recalls are given in Table S6 of Multimedia Appendix 1. For the validation and test sets from Corpus 1, the classifiers showed high performance for political discussion, but relatively low performance for consumer feedback. This was expected based on the large imbalance described earlier.
Among all the traditional classifiers tested, SVM performed the best on the validation set, with an F 1 score of 0.53 on the consumer feedback. However, the F 1 score on the consumer feedback on the test set from Corpus 1 was only 0.38. In contrast, we found that the BERT classifier had the highest F 1 scores on consumer feedback for both the validation set (0.61) and the test set from Corpus 1 (0.58).
For the test set from Corpus 2, most of the classifiers performed well on the consumer feedback. Among the traditional classifiers, RF performed the best, with an F 1 score of 0.82 on consumer feedback. On the other hand, BERT still performed the best, with a consumer feedback F 1 score of 0.90.
As the BERT classifier performed the best in terms of accuracy and the consumer feedback F 1 score on the validation set and the 2 test sets, we used the BERT classification for postclassification analysis.

Error Analysis
We conducted a brief analysis of the errors made by the BERT-based classifier. We first calculated the confusion matrix for both test sets (Table 3). In Table 4, we provide examples of the most frequent classification errors, omitting unnecessary details. For Corpus 1, we highlighted that the classifier frequently misclassified political tweets as news or consumer feedback, and vice versa. This is not surprising because users sometimes commented on and discussed politics with personal experience, and some news content was related to opinions about the policy. We also highlighted that the uncategorized tweets, whose content is often not directly related to Medicaid or lack of information, are frequently misclassified as consumer feedback or political. The confusion between consumer feedback and political or uncategorized tweets, along with the low volume of consumer feedback, contributes to the low performance of consumer feedback. We also observed that some news tweets were confused with the information tweets because information is frequently spreading as news or blog articles. Uncategorized because of lack of related content but is similar to political discussion Other (political) <USERNAME> I hope someone will ask him "What's the difference between Medicaid and Medicare?"

Test set (Corpus 2)
Most likely comment on customer service but hard to pick up by algorithm Consumer (other) <organization_name> poorly worded Classified as others because of lack of information, but the algorithm might recognize that it could be a conversation between a customer and a customer representative Other (consumer) <organization_name> My pleasure! For Corpus 2, the dominant classes were consumer feedback and uncategorized tweets, and they were most frequently misclassified as each other. We suspect they were misclassified because tweets sometimes lacked context, making their meanings ambiguous and difficult for the machine to understand. For example, the tweet "<organization_name> poorly worded," though ambiguous, might be understood as that some document for the customer or the customer service representative's expression was poorly worded, and thus, we categorized it as consumer feedback. However, the machine learning algorithms were not capable of deciphering such implicit contexts: "poorly worded" is usually associated with a feedback and, in tweets directed to the agency's handle, it is likely related to customer service. Similarly, the tweet "<organization_name> My pleasure!," may belong to a conversation between a customer and a representative, but the lack of information renders it to the other class. However, machine learning could not capture this understanding.

Postclassification Analyses: Data Distribution
We applied the best-performing classifier (BERT) to label both corpora. The obtained class distributions are shown in Figure  3. We found that the majority of tweets in Corpus 1 were news (142047/628411, 22.60%) and political discussion (400778/628411, 63.78%), whereas consumer feedback accounted for only 4.55% (28604/628411), consistent with the data distribution of the test set of Corpus 1. The data distribution indicates that this corpus is suitable for analyzing chatter regarding political discussion or news. For Corpus 2, the majority of the tweets were labeled as consumer feedback (15073/27337, 55.14%) and uncategorized (8590/27337, 31.42%), which is also consistent with the data distribution in the test set.

Postclassification Analyses: Content of Each Class in Corpus 1
We now briefly summarize the findings from content analyses of the tweets in Corpus 1 that contain the terms associated with "medicaid" to understand, from a high-level perspective, the contents within each category. The 10 highest ranking bigrams and trigrams detected by the TFIDF method are listed in Table  S7 of Multimedia Appendix 1 [43]. Not surprisingly, the academic tweets are dominated by terms starting with "study..." and terms indicating research finding. Similarly, the information tweets contain terms related to "service," "care,"... etc, consistent with information outreach. For the news tweets, we found that many tweets were about news on medicaid work requirements in Kentucky and Arkansas (blocked by the federal judge on March 27, 2019). In addition, "social security" and "Trump..." are also highly ranked among the news and political classes. For the tweets belonging to the consumer feedback, some of the high-ranking terms were shared with other classes (eg, "... insurance," "social security," or "... care"), whereas some were specific to this class ("make much" or "doesn cover") and potentially indicated comments about Medicaid income cap and coverage.
We did not know the compositions of the 2 data sets that we had collected a priori. Thus, the results of our classification experiments provided us with very important knowledge about which type of Twitter data to use when conducting targeted studies about Medicaid or health services in general. For example, when studying consumer feedback, it is best to use data from Corpus 2 (ie, tweets containing Twitter handles of the MA or MCO); for studying public perceptions of political decisions, Corpus 1 would be more useful. Detailed content analyses of the tweets in each category, such as their temporal and geolocation-specific distributions, are likely to reveal more relevant information. However, such analyses are outside the scope of this study, and we plan to build on the NLP system described in this paper to conduct more thorough content analyses in the future.

Postclassification Analyses: LDA Topic Modeling on Consumer Feedback
We found that the model with 20 topics achieved the highest coherence score. The top 20 words in each topic are listed in Multimedia Appendix 1, Table S8. We now summarize the main findings based on these top words, with example top words provided in parentheses. We deduced that this chatter contains discussion related to (1) applying for Medicaid, either for oneself or even family members (eg, deny, apply, family, and child); (2) Medicaid coverage for dentists, specialists, prescription medications, emergency department visits (eg, cover, dentist, therapist, prescription, medication, and emergency); (3) interacting with customer representatives, especially through phone (eg, call, phone, hour, hold, tell, and wait); (4) hospital-related bills (eg, hospital, bill, and copay); and (5) comparing different insurance plans (switch, insurance, private, and plan). The list of topics can be a guide to further categorize the consumer feedback chatter, which could lead to a more detailed analysis and even provide recommendations on how to further improve the Medicaid program. A more in-depth analysis is left for future work.

Principal Findings
As many classification errors occur because the tweets lie in the boundary between 2 classes, we note that a multilabel classification scheme might improve the performance [47]. However, in the experiments conducted earlier in this project, we found that the multilabel scheme only improved the classification performance by a small margin, while making the annotation process more difficult. Thus, we focused on the single-label classification scheme in this work, leaving the development of multilabel models to future work.
In addition to multilabel classification models, the classification error might also be remedied by creating new categories for tweets lying at the boundary of current categories. For example, we can further divide the political discussion into 2 categories: discussion of policy without personal experiences or experiences from friends or relatives and discussion of policy with experiences as supporting evidence. The classification performance may be further improved by including more user profile information. For example, we can include features such as whether the account belongs to a news agency or if the user is affiliated with an academic organization or think tanks, which could improve the classification performance on the news class or the academic class. As the 2 corpora have very different distributions, developing a corpus-specific classifier might further improve the performance.
Although our content analysis is limited to the high-ranking TFIDF terms and LDA topic modeling on consumer feedback, additional analyses could include topic modeling of other chatter [44] or sentiment analysis to understand sentiments toward Medicaid in general or to specific aspects of the Medicaid program [8]. The manual analysis of selected samples can deepen the understanding of these topics and potentially generate recommendations toward policy change. We also note that content analysis can not only help researchers further understand the Medicaid chatter but it can also improve the classification performance in reverse.

Limitations
This analysis has limitations related to the quality of Twitter data, which contain high volumes of noise that may affect the accuracy and generalizability of our content analyses and annotation guidelines. In addition, Twitter users may not be representative of Medicaid enrollees. Older age groups tend to be underrepresented among Twitter users [48], and more vulnerable populations who rely on Medicaid may not use this platform to discuss their health coverage.

Conclusions
We have developed a social media mining system, involving NLP and machine learning, for continuously collecting and categorizing Twitter chatter about the Medicaid program. Our study demonstrates that it is possible to collect data about large, complex health services and coverage programs such as Medicaid using Twitter to obtain near real-time knowledge about consumer perceptions and opinions. The automatic classification of streaming data is crucial, specifically for smaller classes, such as consumer feedback, for studying targeted topics.
Our analysis can inform public health researchers on how to use public discussions about health programs and services, such as Medicaid. Similarly, our system can be deployed by research groups or Medicaid agencies for continuous, ongoing research on the evolution of public opinions on social media (eg, the impact of certain policy changes or rulings). We also note that although this work focuses on Medicaid, our methods and open source code can readily be applied to other health services. Annotated data and methods are available for future studies [49].