This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Electronic cigarette (e-cigarette) is an emerging product with a rapid-growth market in recent years. Social media has become an important platform for information seeking and sharing. We aim to mine hidden topics from e-cigarette datasets collected from different social media platforms.
This paper aims to gain a systematic understanding of the characteristics of various types of social media, which will provide deep insights into how consumers and policy makers effectively use social media to track e-cigarette-related content and adjust their decisions and policies.
We collected data from Reddit (27,638 e-cigarette flavor-related posts from January 1, 2011, to June 30, 2015), JuiceDB (14,433 e-juice reviews from June 26, 2013 to November 12, 2015), and Twitter (13,356 “e-cig ban”-related tweets from January, 1, 2010 to June 30, 2015). Latent Dirichlet Allocation, a generative model for topic modeling, was used to analyze the topics from these data.
We found four types of topics across the platforms: (1) promotions, (2) flavor discussions, (3) experience sharing, and (4) regulation debates. Promotions included sales from vendors to users, as well as trades among users. A total of 10.72% (2,962/27,638) of the posts from Reddit were related to trading. Promotion links were found between social media platforms. Most of the links (87.30%) in JuiceDB were related to Reddit posts. JuiceDB and Reddit identified consistent flavor categories. E-cigarette vaping methods and features such as steeping, throat hit, and vapor production were broadly discussed both on Reddit and on JuiceDB. Reddit provided space for policy discussions and majority of the posts (60.7%) holding a negative attitude toward regulations, whereas Twitter was used to launch campaigns using certain hashtags. Our findings are based on data across different platforms. The topic distribution between Reddit and JuiceDB was significantly different (
This study examined Reddit, JuiceDB, and Twitter as social media data sources for e-cigarette research. These mined findings could be further used by other researchers and policy makers. By utilizing the automatic topic-modeling method, the proposed unified feedback model could be a useful tool for policy makers to comprehensively consider how to collect valuable feedback from social media.
Electronic cigarettes (e-cigarettes) have become increasingly popular in recent years. As a new type of nicotine delivery system, e-cigarettes, as defined by the US Food and Drug Administration (FDA), are battery-operated products designed to deliver nicotine, flavor, and other chemicals in aerosol form [
Many e-cigarette studies have used the survey method to collect information on the pattern of usage [
The rapid growth of online communities and social media provides a new approach in collecting evidence for policy-making processes. Large social media platforms, including Facebook, Twitter, YouTube, and Reddit, enable new channels for e-cigarette users to share information and experiences. These platforms have provided efficient methods of information access for health surveillance and social intelligence [
More insights were generated from studies based on specific social media platforms. For example, one study found that the vast majority of e-cigarette information on YouTube promoted their use and depicted it as socially acceptable [
E-cigarettes are also discussed on forums. Reddit, one of the most comprehensive forums on the Internet, was used as a source to identify vulnerable populations [
Moreover, we have noticed that different social media platforms have different characteristics, both for posts and users. For instance, Reddit is essentially an online bulletin system that includes all kinds of discussions [
In a previous study, we collected data from Reddit [
Data from JuiceDB were collected by using its public application program interface (API). We collected 14,433 JuiceDB e-liquid reviews from June 26, 2013 to November 12, 2015. The dataset was comprised of reviews on e-liquids including overall rating, subrating of e-liquid components, and detailed comments.
We also collected some data from Twitter. We created crawling agents and simulated human behavior in the searching page of Twitter to retrieve historical data from January 1, 2010 to June 30, 2015. We used the keywords “e cigarettes,” “electronic cigarettes,” “ecigarettes,” “ecigs,” “smoking electronic cigarettes,” “smoking ecigarettes,” and “smoking ecigs” in the searches and collected 353,984 tweets. Compared with Reddit, Twitter is good at information transmission, which makes it an important platform for advertising and social media campaigns. Results from the Reddit dataset showed that the e-cigarette ban debate was an interesting discussion topic. “E-cig ban” and “e-cigarette ban” were general keywords describing the topic. Thus, we used these keywords to collect data and analyze the detailed discussion topic on Twitter. Some tweets were not written in English. They were collected because they used English hashtags that contained the keywords. In order to analyze English tweets only, we filtered out other tweets by using a stop words list to detect the most probable language the tweet was written in. Finally, we collected 13,356 tweets that were valid for analysis.
We used natural language processing (NLP) and Latent Dirichlet Allocation (LDA), which are information science techniques, to analyze the data. “Natural language” means the language used by humans, whereas processing means using computers to understand natural language input [
LDA is a generative model for unsupervised topic modeling that automatically discovers hidden topics from a set of documents, such as posts, reviews, or tweets in this study, each of which contains a bag of words [
Practically, it was challenging to determine the number of topics in the LDA method. We used the hierarchical Dirichlet process (HDP-LDA) to evaluate our decision, which was also supported by the Python gensim package [
The output of LDA in this study was a set of topics and the main words associated with each topic. For example, 13,356 tweets were treated as the input after preprocessing by the NLP tools. After LDA processing, five topics with associated words were summarized from these tweets. Consider each of the topics as a group. Every post belonged to one of the groups based on the words it contained.
We performed LDA on the three datasets. The number of topics for each dataset was set to five. For a specific topic, the top 20 associated keywords are listed in
Top five topics and keywords for posts from Reddit, JuiceDB, and Twitter.
Platform and topic | Keywordsa | |
1. Individual trades and vendor promotions | Liquid, size, mini, sold, brand, shipping, free, cream, retail, price, sample, purchase, list, prices, items, high, left, love, prefer, natural | |
2. Flavor-related experiences and sentiments | Juice, flavor, good, flavors, vape, taste, great, juices, well, sweet, liquid, tastes, menthol, love, tank, nice, pretty, coffee, hit, find | |
3. E-liquid components | Strawberry, flavor, VG, juice, vanilla, cream, custard, thanks, vapor, banana, PG, flavors, TFA, apple, mL, milk, 12 mg, bottles, menthol, 30 mL | |
4. Relationship with traditional tobacco products | Tobacco, nicotine, vaping, smoking, cigarette, people, smoke, ecig, quit, products, health, product, year, electronic, know, companies, pack, stop, addiction, quit | |
5. Personal experiences and questions | Time, know, well, feel, best, love, long, pretty, thought, start, find, want, favorite, give, question, experience, idea, hear, start, thanks | |
1. Throat hit and vapor production | Throat hit, VG, vape, coil, tank, cloud, use, RDA, PG, vapor, max VG, liquid, dripper, high, drip, vapor production, price, higher, 50/50, 6 mg | |
2. Fruit and cream flavors | Sweet, like, strawberry, exhale, flavor, nice, get, really, fruit, fruity, vape, cream, inhale, taste, candy, good, tart, well, menthol, little | |
3. Cream, tobacco, and seasonings flavors | Sweet, like, creamy, rich, exhale, custard, cinnamon, get, tobacco, nice, vanilla, inhale, good, banana, cream, really, caramel, vape, smooth, hint | |
4. Product promotion and recommendation | Try, vape, bottle, great, juice, order, favorite, recommend, best, flavor, day, love, time, first, adv, go, would, price, amaze, definite | |
5. Vaping experiences | Like, steep, try, taste, really, get, good, vape, would, bottle, don’t, much, first, got, smell, think, bit, better, still, even | |
1. Euecigban | Euecigban, eu, save, tobacco, stop, smoke, live, vaper, help, swof, try, want, people, million, smoker, please, go, via, need, product | |
2. New York and noecigban | Vape, smoke, Twitter, come, pic, health, public, nyc, euecigban, cig, ad, noecigban, like, via, citi, call, propose, look, tobacco, news | |
3. General discussion of e-cigarette ban | Vape, smoke, vote, blog, post, huge, electroniccigarette, consequence, citi, include, council, new, school, report, fda, house, county, harm, propose, cig | |
4. Petition | Sign, vape, health, flavor, RT, want, tobacco, petitition, euecigban, say, please, support, sale, regulate, us, minor, use, propose, govern, plane | |
5. Noecigban and freevape | Vape, public, noecigban, vaping, sale, smoke, place, bill, minor, freevape, new, indoor, use, would, cig, call, consider, New York, lawmaker, wale |
a PG: propylene glycol; RDA: rebuildable dripping atomizer; RT: retweet; TFA: the flavor apprentice; VG: vegetable glycerin.
The first topic was about purchasing e-cigarette products. It contained vendor promotions and advertisements, but also individual trading information. The keywords included product descriptions and prices. Topic 2 was flavor-related experiences and sentiments. People discussed their vaping experience with specific flavors and expressed their sentiment or evaluation. Topic 3 was the discussion of e-liquid components. It is known that e-liquid consists of vegetable glycerin (VG), propylene glycol (PG), nicotine, and flavors [
The outcome of LDA on JuiceDB reviews was quite different. JuiceDB is a specific platform only for e-liquid reviews and the LDA results supported this. The top five topics were narrower and more focused on e-liquids (
Topic 1 referred to throat hit and vapor production, which were two major features of the e-cigarette vaping experience. Topics 2 and 3 were discussions of specific flavors. From the previous study, we knew that fruit and cream flavors were the most popular, which was supported by the result that these two flavors made up one topic and other flavors were a separate topic [
The LDA performance on the Twitter data was even more specific because we focused on the tweets related to e-cigarette bans. Almost all tweets had a URL link that brought noise to the LDA analysis. Thus, we built the LDA model after removing URL links.
Twitter is famous for its hashtag system. The hashtag is a word coming after a hash (#) sign. It is used as a label to tag the tweet to a specific group so that users can easily find and share information in a specific community. Some of the keywords (
The preceding results described different topics for different social media platforms. Generally speaking, Reddit is a comprehensive forum so the topics are more general and broader compared to JuiceDB, which is a specific platform for e-liquid reviews. The data from Twitter showed that this social media was used as a platform for campaigns. We summarize the topics in these three platforms and present our insights for policy makers. In total, there were four types of topics: promotions, flavor discussions, experience sharing, and regulation debates.
Promotion as a topic included trading among e-cigarette users and sales from vendors to users. For instance, on Reddit, one example of a vendor promotion to users was:
Wednesday Purple Drank, Banana Berry Milkshake, AND Hot Cider Donut Giveaway! Coupon code inside for 15% off ALL liquids! | Vapor Trails NW.
JuiceDB had promotions as well. However, the vendor promotions on JuiceDB were written in the format of user reviews because JuiceDB did not accept advertisements. For example:
Mountain Dew-inspired flavor. I have been using this juice for a few days now and it’s actually really good! Tastes pretty close to the real Mountain Dew flavor. It’s not exactly the same flavor as the drink but it is VERY close. I recommend it!
Trading among users was another important type of e-cigarette promotion. It was common to see these posts on Reddit because the titles usually started with want to trade (WTT), want to sell (WTS), and want to buy (WTB). For example:
WTT/WTS: Avid and MBV Juice, Also a Kanger Aerotank + full 5 pack of coils.
Among all the posts, 1636 posts had WTS in their title, 895 posts were labeled as WTT, and 431 posts were WTB posts.
Reddit, as a comprehensive platform, provides a promotion platform for both vendors and individual users. Of 27,638 posts, 2962 (10.72%) are related to trading, which indicates that there exists some secondhand e-cigarette transaction channels, raising new challenges for regulation and surveillance. Teenagers, for example, could acquire e-cigarette products easily from such channels, which decreases the effectiveness of the FDA’s proposed e-cigarette ban policy. The existence of secondhand markets introduces other possible problems as well. Without regulations and standards, the product safety is not guaranteed, raising potential risks for users. More than half of the trading posts were on the supply side, which indicates that e-cigarette users tend to be capricious about preference. This phenomenon provides evidence for the necessity of further investigation.
Reddit and JuiceDB both provided detailed descriptions of e-cigarette products. Moreover, some posts linked these two platforms together. For instance, the posts in
It is possible that users might refer to several platforms to find useful information and suggestions for vaping. We examined several other platforms, including Facebook, Twitter, the Vaping Forum, UK Vapers, E-cigarette Forum, and Aussievapers. The results are shown in
Platform links.
Link | Reddit (n=27,638), n (%) | JuiceDB (n=14,434), n (%) | |
Title | Content | Content | |
32 (0.12) | 650 (2.35) | 15 (0.10) | |
7 (0.03) | 290 (1.05) | 0 | |
JuiceDB (Reddit) | 14 (0.05) | 68 (0.25) | 110 (0.76) |
The Vaping Forum | 4 (0.01) | 7 (0.03) | 0 |
UK vapers | 13 (0.05) | 4 (0.01) | 1 (0.01) |
E-cigarette forum | 0 | 38 (0.14) | 0 |
Aussievapers | 4 (0.01) | 13 (0.05) | 0 |
Reddit is a comprehensive platform that links many other forums and social media. However, JuiceDB seemed to be exclusively related to Reddit.
Flavor was one of the most discussed topics among e-cigarette users. Both Reddit and JuiceDB had many posts related to e-liquid flavors. In previous research, we identified eight categories of flavors: fruits, cream, tobacco, menthol, beverages, sweet, seasonings, and nuts [
From the Reddit LDA results, the topic contained several keywords related to the taste of flavors, such as strawberry, vanilla, custard, banana, apple, menthol, candy, blueberry, mango, watermelon, cinnamon, peach, caramel, lemon, chocolate, honey, cake, tea, raspberry, orange, cherry, cereal, coconut, pear, grape, cookie, peanut, mint, pineapple, and coffee. This set of flavors covered the majority of flavors found in previous research [
A study about e-cigarette flavors pointed out that new flavors would come out every now and then as the e-cigarette market develops [
The findings on JuiceDB were similar. However, because JuiceDB focuses on e-liquid reviews, the topics we found were more focused. Thus, fruit and cream flavors composed a single topic, whereas other flavors made up a separate one. These two topics identified by the LDA method could help us build and complete the flavor list, as well as identify new types and trends.
Social media is a way for e-cigarette users to share their vaping experience with one another. People may ask and answer questions about e-cigarettes. Or they simply write down their feelings after trying a particular product. For example, a Reddit user raised a question about sweet e-juice and cavities, which is shown in
Users also shared their methods of using e-cigarettes to help others improve their vaping experience. For example, a common method is called steeping. This is a special method to process the e-liquid, especially for new products. Vapers usually believe that steeping helps to disperse chemicals and flavors throughout the juice. Steeping is simple. Just shake and store in a cool, dark place to get a well-steeped e-liquid. This is an example from JuiceDB:
Steeped this juice for 4 days, the color darkened just a bit, the flavor really came out as well.
In comparison with traditional tobacco products, e-cigarettes use e-liquid to deliver nicotine and other chemicals. Thus, the method of vaping is totally different from smoking. As far as we know, e-liquid steeping is still not well studied among the literature.
Throat hit and vapor production are two other major features of using e-cigarettes. Both JuiceDB and Reddit have thousands of posts related to them. Throat hit is the feeling of smoke hitting the back of the throat [
This juice is basically Boba’s Bounty with Banana added in. A nice tobacco/graham cracker flavor bursting with banana but not too overwhelming, it’s just right. Great vapor production and throat hit.
The other type of users have never smoked traditional tobacco products, directly adopting vaping. Thus, they are less likely to like a strong throat hit. Their sharing and recommendations are more mild in taste. For example:
Very little throat hit in my mix (50pg/50vg 6mg) but very good vapor production.
However, both types of users are more prone to like thick vapor production. We believe that the vapor helps users’ gain a visually pleasing experience. A huge amount of vapor could produce a salient social image that is perceived and evaluated by e-cigarette users, similar to traditional cigarettes [
In summary, both Reddit and JuiceDB provide users a platform to share vaping experiences. JuiceDB content is in the form of reviews and focuses more on e-liquids. Reddit, however, offers more approaches for user interactions, such as questions and answers.
Reddit and Twitter had topics about regulations and policy debates, but JuiceDB did not. The keywords from the LDA-identified topics included “kids,” “addiction,” “house,” “quitting,” “safe,” “cancer,” “chemicals,” “government,” “drug,” “control,” “regulation,” and “harmful.” People were discussing the effect of using e-cigarettes, especially the effects on children, and the risk of diseases from chemicals. These discussions went further and led to debates on regulations and bans.
Some Reddit users expressed concerns, whereas others appealed for not banning e-cigarettes. Examples are shown in
In general, we used the keywords “policy,” “policies,” “ban,” “bans,” “regulate,” “regulates,” “regulated,” and “regulation” to search the Reddit database, finding 872 posts. We were interested in generating a basic understanding of people’s attitudes toward e-cigarette regulations. Thus, by reading through the contents, 224 posts were considered to contain personal attitudes, which are summarized in
Regulation debates posts on Reddit (n=224).
Post themes | n (%) | |
Law | 1 (0.4%) | |
Research | 5 (2.2%) | |
Moral requirement | 9 (4.0%) | |
Legislation benefit | 5 (2.2%) | |
Tax | 1 (0.4%) | |
Personal freedom | 5 (2.2%) | |
Safer product | 52 (23.2%) | |
Law | 4 (1.8%) | |
Politics | 8 (3.6%) | |
Employee efficiency | 1 (0.4%) | |
Research | 8 (3.6%) | |
Call to action | 51 (22.8%) | |
How to oppose | 7 (3.1%) | |
Possible regulation | 11 (4.9%) | |
Current regulation status | 23 (10.3%) | |
Regulation effect | 15 (6.7%) | |
Company rule | 17 (7.6%) | |
Comparison | 1 (0.4%) |
Correspondingly, some vapers looked for suggestions to oppose e-cigarette bans, not only federal or state regulations, but also company and university rules.
Some posts were neutral, including forecasting possible future regulations, introducing the current regulation status, analyzing regulation effects, and discussing company-specific rules. Some posts compared e-cigarettes and other addictive products, such as junk food, to discuss regulations on e-cigarette bans.
Twitter, on the other hand, focused more on information transmission. Tweets are restricted to less than 140 words, so they contain much less information than a complete Reddit post. Thus, the contents on Twitter were more straightforward and less descriptive. Twitter users tended to use other websites as references to support their point rather than describe it in detail. For instance:
RT @DeLaConcha: RT @tobacconistu: Judge rules FDA cannot ban E-Cigarettes [URL].
Twitter is also famous for its social networking function. Users connect to one another by following relationships. By retweeting posts from other users, information is quickly transmitted all over the world. Thus, the contents are more timely than Reddit posts. For example, an e-cigarette ban proposal in Coconino County could be tracked on Google as early as April 8, 2014. In our dataset, there was a tweet directing to this page right after it was published.
Finally, as we have mentioned, Twitter is a well-known platform for social media campaigns. By using certain hashtags, users become involved and influence specific topics. Ideas spread quickly through such campaigns. The hashtags #euecigban, #noecigban, and #freevape were broadly used on Twitter.
There were 3118 tweets containing the hashtag #euecigban, 916 posts containing the hashtag #noecigban, and 299 posts containing the hashtag #freevape. We analyzed the same number of posts for each hashtag group. For each hashtag, we randomly picked out 299 posts (the total number of posts that #freevape had), analyzed the content, and classified them into themes, as shown in
1. No harm: tweets with this theme argued that e-cigarettes should not be banned because their use has little or no negative impact on human health, especially for 0 mg nicotine e-liquid.
2. Smoking cessation and saving lives: this theme stated that e-cigarettes should not be banned because e-cigarettes could act as a substitute for traditional tobacco and, therefore, e-cigarettes could help users quit smoking and save lives.
3. Pharma interests/tax income: some tweets argued that e-cigarette bans were proposed because of the interests of traditional tobacco/pharma companies or taxation from the sales of traditional tobacco.
4. Biased research: some people thought the evidence from research that supports e-cigarette bans was biased.
5. Personal freedom and rights: some people believed banning e-cigarettes was a violation of personal freedom and rights.
6. Simple opposition: some tweets just opposed e-cigarette regulations without providing any evidence.
7. Call to action: tweets in this theme were appealing for some action to oppose the ongoing bills. Usually, it was an imperative sentence with keywords “support,” “sign,” and “action.”
8. Only tag: these tweets contained a hashtag but not any other text content. Usually these tweets had URLs or pictures, which were not analyzed by this research.
9. Neutral descriptions: text content in the tweets were just descriptions without personal attitudes.
In summary, Reddit, which is essentially a forum, has more user discussions and interactions than Twitter. But Twitter is good at information transmission and social media campaigns.
Twitter hashtag analysis.
Hashtag and category | n (%) | |
No harm | 21 (7.0) | |
Smoking cessation and life saving | 141 (47.2) | |
Pharma interests/tax income | 8 (2.7) | |
Biased research | 2 (0.7) | |
Personal freedom and right | 10 (3.3) | |
Simply opposition | 46 (15.4) | |
Call to action | 32 (10.7) | |
Only tag | 10 (3.3) | |
Neutral description | 29 (9.7) | |
No harm | 10 (3.3) | |
Smoking cessation and life saving | 71 (23.7) | |
Pharma interests/tax income | 14 (4.7) | |
Biased research | 0 (0.0) | |
Personal freedom and right | 11 (3.7) | |
Simply opposition | 69 (23.1) | |
Call to action | 21 (7.0) | |
Only tag | 52 (17.4) | |
Neutral description | 51 (17.1) | |
No harm | 3 (1.0) | |
Smoking cessation and life saving | 24 (8.0) | |
Pharma interests/tax income | 7 (2.3) | |
Biased research | 0 (0) | |
Personal freedom and right | 2 (0.7) | |
Simply opposition | 15 (5.0) | |
Call to action | 5 (1.7) | |
Only tag | 23 (7.7) | |
Neutral description | 220 (73.6) |
Tweet theme comparison.
The comprehensive analysis in the previous part presented the results summarized from all the data available. However, another interesting question came from the differences across platforms; specifically, whether the posts from different platforms had different topic distributions. As shown previously, the dataset collected from Twitter was more related to regulation debates, whereas the datasets from Reddit and JuiceDB were more comprehensive because of the keywords selected in the data collection processes. Thus, in this study, we only compared the topic distributions between Reddit and JuiceDB.
As stated in the data analysis section, the LDA algorithm identified five topics from a collection of Reddit or JuiceDB posts. In order to compare across the platforms, we manually classified those topics into three groups: promotion, flavor, and experience. Each of the posts was categorized into one of the groups. For Reddit, the number of topics in promotion, flavor, and experience were 2152, 21,752, and 3734, respectively; for JuiceDB, the number of topics in promotion, flavor, and experience were 4203, 5196, and 5034, respectively.
We ran a chi-square test to compare the differences in topic distribution between Reddit and JuiceDB. The results showed that the topic distribution was significantly different (
We provide a general framework to analyze user-generated content from social media. After the raw materials are collected, we believe it will be much better if the topic-modeling method is used to generate some insights for further analysis. For instance, we found several topics by applying LDA methods to datasets collected from different social media. These topics are classified into four types: promotions, flavor discussions, experience sharing, and regulation debates. Compared to the results from surveys and experiments, data from social media are collected in the field and have a large data size, which provides a potential approach to generate valuable insights. Moreover, collecting data online uses less time and money than recruiting participants to complete questionnaires. Based on the previous analysis, we propose a unified model for e-cigarette policy proposals, as shown in
A unified e-cigarette social media feedback collection and analysis model.
Consider two simple examples. Assume that government departments, such as the FDA, want to collect some data about symptoms and adverse events from using different flavored e-liquids [
In summary, the rapid growth of e-cigarette user communities indicates the importance of research in this field. Social media has proven to play an indispensable role in promotions and communications. Previous research has utilized social media as the data source to study e-cigarettes. Most of them focused on only one specific platform [
We collected data from Reddit, JuiceDB, and Twitter, which was feasible for our current research. However, several other platforms, such as Facebook and E-cigarette Forum, could be considered to expand the current dataset for further analysis. We only collected regulation-related data from Twitter, but other e-cigarette-related tweets could be of interest. A more general keyword set should be created for data collection across the platforms. Moreover, the keywords “vape,” “vapor,” and “vaping” should be included in the next step of data collection. However, we still believe the research findings from the current dataset provide valid and valuable insights.
Another limitation of this paper was the lack of demographic information. Because Reddit, JuiceDB, and Twitter do not provide reliable personal characteristics, such as age and gender, we cannot divide our dataset into several subgroups to analyze the different patterns among different age or gender groups.
Finally, this study only used LDA to identify topics among posts. There are many other data mining tools that could be applied to further explore the dataset. For instance, sentiment analysis could be conducted on the regulation-related posts. Positive, neutral, or negative sentiments are an important indicator for understanding public comments.
We envision three possible approaches for further study. First, the LDA model could be modified and extended for further analysis. In this paper, we applied the standard LDA techniques as the topic-modeling algorithm, and the results were feasible enough to conduct some analysis. However, given the special context of e-cigarettes, we believe that some modifications to the standard LDA model could produce better and more precise results. For instance, topic-in-set knowledge could be added to achieve supervised learning [
Second, major types of topics are identified, each of which is interesting and makes practical sense. Some findings and discussions could be further explored. For example, individual trading is an emerging phenomenon in the e-cigarette market, which could produce potential risks to e-cigarette regulations. Vendors’ promotions are also worth studying to find patterns. Automatic emerging e-liquid detection and symptoms collection are important as well. Studying feedback on proposed policies would generate insights for policy makers to make better decisions.
Finally, the characteristics of social media platforms should be further analyzed. For example, the problem of bots, fake accounts, and spam on Twitter is worth exploring, from both a research perspective and an application perspective. It will be challenging and meaningful if we can develop an automatic filter for more accurate analysis on Twitter. The algorithm itself and the patterns of spammers are worth studying. The connections between platforms are interesting as well. If we could identify the same account across platforms, the information flow could be easily understood, providing a valuable signal for public health surveillance.
Using topic modeling techniques LDA, we identified topics among posts generated by e-cigarette users. This automatic method could be used to analyze the state of the art in the e-cigarette field. New brands, flavors, and trends could be found using our method, which is of great importance to the fast-developing e-cigarette market. We compared the results from Reddit, JuiceDB, and Twitter and discussed the similarities and differences of the platforms. We hope the characteristics analyzed by this paper can be further used by other researchers and policy makers.
Graphical representation of the LDA model.
LDA model.
Close connections between Reddit and JuiceDB.
Sweet e-juice and cavity.
Regulation debates from Reddit.
An instruction to against e-cigarette ban by mails.
application program interface
electronic cigarette
Food and Drug Administration
latent Dirichlet allocation
natural language processing
propylene glycol
rebuildable dripping atomizer
retweet
vegetable glycerin
want to buy
want to sell
want to trade
Several members of the SMILES (Social Media-based Informatics pLatform for E-cigarette regulatory research) group at the Institute of Automation, Chinese Academy of Sciences, assisted in this study, which we gratefully acknowledge. In particular, we would like to thank Xin Peng, Xuezhen Zhang, Na Chen, and Xiang Zhou for help downloading and coding the data and making valuable suggestions. This work was supported by the US National Institutes of Health under Grant No. 5R01DA037378-03, National Key Research and Development Program under Grant No. 2016YFC1200702, the Key Research Program of the Chinese Academy of Sciences under Grant No. ZDRW-XH-2017-3, National Natural Science Foundation of China under Grant No. 71621002,61671450,71272236.
Scott J Leischow has served as a paid consultant to or conducted research for Pfizer, GSK, Cypress BioScience, and McNeil Consumer. McNeil Consumer is collaborating with GSK on a current study on nicotine replacement, which is being conducted by Scott J Leischow, and GSK markets bupropion.