This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Our previous infodemiological study was performed by manually mining health-effect data associated with electronic cigarettes (ECs) from online forums. Manual mining is time consuming and limits the number of posts that can be retrieved.
Our goal in this study was to automatically extract and analyze a large number (>41,000) of online forum posts related to the health effects associated with EC use between 2008 and 2015.
Data were annotated with medical concepts from the Unified Medical Language System using a modified version of the MetaMap tool. Of over 1.4 million posts, 41,216 were used to analyze symptoms (undiagnosed conditions) and disorders (physician-diagnosed terminology) associated with EC use. For each post, sentiment (positive, negative, and neutral) was also assigned.
Symptom and disorder data were categorized into 12 organ systems or anatomical regions. Most posts on symptoms and disorders contained negative sentiment, and affected systems were similar across all years. Health effects were reported most often in the neurological, mouth and throat, and respiratory systems. The most frequently reported symptoms and disorders were headache (n=939), coughing (n=852), malaise (n=468), asthma (n=916), dehydration (n=803), and pharyngitis (n=565). In addition, users often reported linked symptoms (eg, coughing and headache).
Online forums are a valuable repository of data that can be used to identify positive and negative health effects associated with EC use. By automating extraction of online information, we obtained more data than in our prior study, identified new symptoms and disorders associated with EC use, determined which systems are most frequently adversely affected, identified specific symptoms and disorders most commonly reported, and tracked health effects over 7 years.
At the time of their introduction 10 years ago, there was little information on the health effects associated with electronic cigarettes (ECs); nevertheless, they were often considered safer than conventional cigarettes because they do not burn tobacco and therefore produce aerosols with fewer chemicals. Since their introduction, a wide range of studies concerning the health effects associated with ECs have been conducted using various approaches that include online informatics and survey studies [
Infodemiological approaches, which mine data from the internet and social media, have yielded new information such as EC topography and the effects of EC use on human health [
The objective of this study was to use automated computer methods to mine an online forum and extract a large set of posts dealing with the effects of EC use on human health. These data were analyzed to identify the symptoms (undiagnosed conditions) and disorders (physician-diagnosed terminology) associated with EC use. Data were analyzed over a 7-year period, and the sentiment in each post (positive, negative, and neutral) was determined.
We collected data posted between January 2008 and July 2015 on a large EC discussion forum. Data from 2008 and 2015 were each collected for approximately 6 months. We analyzed the layout of the website and built a crawler in Java using the Java library jsoup [
Online forum data pipeline showing processing, post sorting, and classification workflow.
We used a modified version of the MetaMap tool [
For the 2 semantic types (symptoms and disorders), we ordered the concepts by their frequencies.
We analyzed the different terms mapped to each concept.
We removed the misclassified concepts from our results. Examples of misclassified concepts include:
mod, which refers to vape mods, was mapped to Type 2 diabetes mellitus (C0011860)
ect, which is a type of vape mod, was mapped to Benign Rolandic epilepsy (C2363129)
pic was mapped to Punctate inner choroidopathy (C0730321)
For each semantic type, we reported the most frequent disorders and symptoms overall and by year.
To measure the positive and negative health effects produced by EC use, we used a supervised learning classifier (Random Forest) on a set of manually labeled posts to predict the sentiment for unseen posts. We randomly selected 1080 posts, which were labeled independently by 3 of the authors as the following:
Negative: if a post clearly contained a health effect or unpleasant experience or complaint that co-occurred with the use of EC.
Positive: if a post clearly mentioned a health improvement or a recovery from previous health effects when switching from smoking analogs to EC.
Neutral: if a post did not express any sentiment.
Our interpretation of positive and negative is different from typical sentiment classifications, and mainly focuses on health-related effects. We first asked the labelers to categorize 400 posts, and then we measured the intercoder reliability between the labelers. Using
Sample data summary.
Class | Posts, n | Example |
Positive | 180 | “I’ve only been vaping for 2 1/2 weeks, but I’ve already noticed a big difference in my lungs (after 20+ years of smoking). For example, I had a chest cold when I started, and in the past, once a cold moved into my chest it took a couple of months to get rid of it. ...E-cigs are pretty darn amazing, IMHO.” [sic] |
Neutral | 416 | “I dont [sic] think there are any tests since flavoring were not meant to be inhaled [sic]. I think we are taking our chances untill [sic] some evidence comes out...” |
Negative | 484 | “Hi Everyone, I have been using e-cigarrette [sic] for the past 2 months and very disappointed [sic] that I have to stop, reason being my teeth, gums are sensitive and my tooth cracked yesterday, I have to have a crown fitted.8-o [sic]. I think that the nicotine is seriously not good for the mouth. My husband and work collegue [sic] have also reported sore gums, little sores in the mouth…” |
Using Weka machine-learning toolkit v. 3.8.1 [
To improve the classifier’s accuracy, we needed to address a well-known issue in our sample data, which is the imbalanced class distribution [
After using the new training set, the classifier’s accuracy increased from 66.95% to 75.42%.
Training data summary (N=400).
Class | Training, n (%) | Training (extended), n (%) |
Positive | 67 (16.75) | 112 (28.0) |
Neutral | 154 (38.50) | 136 (34.0) |
Negative | 179 (44.75) | 152 (38.0) |
Test data classification accuracy (N=118).
Class | Precision | Recall | F-measure | Posts, n |
Positive | 0.73 | 0.72 | 0.74 | 21 |
Neutral | 0.67 | 0.77 | 0.71 | 39 |
Negative | 0.84 | 0.74 | 0.79 | 58 |
Average | 0.76 | 0.75 | 0.76 | 118 |
All health-related effects (symptoms and disorders) data reported by EC users in posts were collected iteratively and sorted into Microsoft Excel spreadsheets. The symptoms and disorders were further grouped according to the organ system and anatomical region, which we defined as
The 41,216 posts we collected spanned the years from 2008 to 2015 (2008 and 2015 were half years). We analyzed the frequency of reports for various symptoms and disorders by consolidating the reported health effects into structural or physiological systems (eg, sore throat was classified into mouth and throat;
Frequency distribution of reported symptom (A) and disorder (B) posts grouped into their systems or anatomical regions. The frequency of positive, neutral, and negative posts is shown for symptoms (A) and for disorders (B).
After examining overall frequency distribution for all posts, we grouped the posts according to their years for analysis in their symptom or disorder categories. Across all years for both symptoms and disorders, we found the frequency distribution of reports per year. In addition, the posts for symptoms and disorders were categorized according to sentiment (positive, negative, and neutral), and their frequency per year was summarized in stacked bar graphs for each year (
The frequency distribution of positive, neutral, and negative sentiment was assigned for reported symptoms (A-H) and disorders (I-P).
For the symptoms, the posts with the most reports were consistently found in the neurological, respiratory, digestive, integumentary, and mouth and throat systems. For all years except 2008, the neurological and respiratory systems were the top 2 systems. The digestive, integumentary, and mouth and throat alternated in some years, but were generally in the top 5 systems with the most posts in each of the years.
Similarly, the posts containing disorders associated with EC use had similar results for their top 5 system categories across the 7 years of reporting. The 2 top systems reported between 2008 and 2012 were the respiratory and mouth/throat. Alternating in the top 5 disorders were the integumentary, neurological, and immune systems.
Negative sentiment was associated with most symptoms and disorders in each system or anatomical region (
Heat maps were made by plotting the frequency with which individual symptoms/disorders occurred for all 41,216 posts (
Although positive symptoms were not frequently reported in this online forum, those reports that were posted most often dealt with improvements in the neurological (n=77), respiratory (n=60), digestive (n=19), and mouth and throat (n=18) systems (
For each system/anatomical region, there were 1 to 3 top disorders. In the respiratory system, the most common disorders were asthma (n=916), chronic obstructive pulmonary disorder (COPD; n=471), pneumonia (n=367), and bronchitis (n=232;
Heat map of specific symptoms reported in the neurological, respiratory, digestive, mouth and throat, and integumentary systems. The total number of posts for each symptom is shown on a log scale ranging from high (red) to low (blue).
Heat map of specific disorders reported in the respiratory, mouth and throat, neurological, integumentary, and immune systems. The total number of posts for each disorder is shown on a log scale ranging from high (red) to low (blue).
To compare the frequency with which different symptoms/disorders appeared across different systems, frequency distribution graphs were created (
Frequency distribution of specific symptoms with over 100 posts and frequency distribution of their systems or anatomical regions (inset). Digest.: digestive; Integ.: integumentary; Mo./Th.: mouth and throat; Musc./Skel.: muscular/skeletal; Neuro.: neurological; Resp.: respiratory.
Frequency distribution of specific disorders with over 100 posts and frequency distribution of their systems or anatomical regions (inset). Neuro.: neurological; Resp.: respiratory; Mo./Th.: mouth and throat; Integ.: integumentary; Digest.: digestive; Circ.: circulatory; Endoc.: endocrine.
A total of 46 paired symptoms were frequently reported (
Graph showing frequency with which symptoms in various systems were linked. Digest.: digestive; Integ.: integumentary; Mo./Th.: mouth and throat; Musc./Skel.: muscular/skeletal; Neuro.: neurological; Resp.: respiratory.
The internet is a dynamic resource containing information that can be mined to learn about the health effects associated with EC use. In our previous study, we manually mined 632 posts from 3 online EC forums to identify both positive and negative health effects reported by EC users [
The results from this study are in overall good agreement with our prior publication [
There are numerous reports on the health effects of EC, many of which are in agreement with our data. In the neurological system, the most commonly reported adverse symptoms we observed included headache, fatigue, nausea, dizziness, and seizures, which have also been reported in human studies [
For the respiratory system, the most frequently reported symptoms included coughing, wheezing, and dyspnea, and the top paired respiratory symptoms were coughing-wheezing. In the national Population Assessment of Tobacco Health (PATH) and in some human surveys, EC use was associated with increased wheezing (an important potential risk factor for respiratory disease) [
The circulatory, mouth/throat, chest, integumentary, and immunological systems were also affected by EC use in our study. Symptoms such as pain in throat, dry skin, pounding heart, and chest pain have been reported in survey/human EC studies [
EC refill fluids and aerosols are complex mixtures that contain flavor chemicals, solvents, nicotine, and metals that could contribute to adverse health effects (
Examples of chemical components in electronic cigarettes that may cause major symptoms/disorders with reference citations of studies.
Symptom/disorder | System | Flavor chemicals (study) | Metals (study) | PGa/VGb/byproducts (study) | Nicotine (study) |
Headache | Neurological | [ |
[ |
[ |
[ |
Fatigue/malaise | Neurological | [ |
[ |
—c | — |
Dizziness | Neurological | [ |
[ |
— | [ |
Nausea | Neurological | [ |
[ |
[ |
[ |
Dehydration | Neurological | — | — | [ |
[ |
Coughing | Respiratory | [ |
[ |
[ |
— |
Wheezing | Respiratory | [ |
[ |
[ |
— |
Dyspnea | Respiratory | [ |
[ |
[ |
[ |
Asthma | Respiratory | [ |
[ |
[ |
— |
COPDd | Respiratory | [ |
[ |
[ |
— |
Pneumonia | Respiratory | — | [ |
— | — |
Bronchitis | Respiratory | [ |
[ |
— | — |
Sinusitis | Respiratory | [ |
[ |
[ |
— |
Pain in throat | Mouth and throat | — | — | [ |
[ |
Dental caries | Mouth and throat | [ |
[ |
— | [ |
Itching/urticaria | Integumentary | — | [ |
[ |
[ |
Dry skin | Integumentary | — | — | [ |
— |
Acne | Integumentary | — | [ |
— | [ |
Heartburn | Digestive | — | — | — | [ |
Cramp | Digestive | — | [ |
— | — |
aPG: propylene glycol.
bVG: vegetable glycerin.
dCOPD: chronic obstructive pulmonary disorder.
Elements/metals (eg, aluminum, copper, cadmium, chromium, iron, nickel, silicon, lead, cobalt, and zinc) have been identified in EC aerosols [
Propylene glycol and glycerin, 2 solvents in EC aerosols, are generally considered safe for ingestion; however, they are known respiratory tract and integumentary irritants [
Nicotine, a major component in most EC fluids, has various neurological, respiratory, digestive, mouth/throat, and circulatory system effects that overlap the symptoms/disorders observed in our study. Most cases of EC nicotine poisonings result from oral ingestion or intravenous injection [
Recently, an e-cigarette, or vaping, product use associated lung injury (EVALI) epidemic has been identified by the Centers for Disease Control and Prevention (CDC) [
The sudden uptick in health-related symptoms and conditions related to vaping comes at least 10 years after the products have gained widespread popularity in the United States, including the rise in popularity of JUUL and marijuana (THC) vape products. Our data show that many of the symptoms characterizing the current patients have been reported online for at least 7 years, suggesting that cases similar to those in the current epidemic have existed previously and been unreported or not linked to vaping. Our data further suggest that this epidemic will continue to grow given the many reports of symptoms characteristic of EVALI on the internet. The specific causes of the reported health effects are not yet known, but it is important to continue vigilant reporting of cases, tracking symptoms, and ongoing research on the health effects related to EC use to understand and contain the vaping epidemic.
Our data may underestimate positive health effects, which EC users are less likely to post on online forums. The factors causing the symptoms and disorders reported by EC users could be complex and will require further investigations. Demographic data on the study population were not extractable. It is not known if any individuals were dual users or if they had preexisting health conditions that may have affected their response to EC.
This study is the first to use automated methods to analyze posts on an EC website over a span of 7 years and to identify the symptoms and disorders most frequently reported online by EC users. We demonstrate the value of using automated methods to acquire and analyze large datasets thereby increasing the power of infodemiological analyses. In addition, from our dataset, we identified a condensed list of symptoms and disorders and ranked them according to post frequency. These symptoms and disorders reported in our study may be of interest to physicians and health care providers who are treating patients using EC and could potentially be reported more frequently by EC users. Moreover, informative data were collected from a large population of EC vapers irrespective of their EC products and individual topographies and was not limited to a small selection of EC products or human subjects, as is often the case with experimental studies and case reports. Data collected using our automated method contribute to the growing body of knowledge linking EC use to adverse health effects, mainly in the mouth and throat and the neurological, respiratory, digestive, and integumentary systems. Our study identified hundreds of negative effects that were not previously described in case reports and peer-reviewed literature. The results from our study are in good agreement with previous surveys, human studies, and case reports. Although many of the symptoms that were reported with high frequency are not life-threatening (eg, headache, coughing, heartburn, sore throat), they can be disabling and reduce the quality of life. Of particular concern are the respiratory disorders that appeared with high frequency, such as asthma, COPD, pneumonia, and bronchitis, which not only severely impact the quality of life but may also be life threatening. Our data support the idea that EC use is not free of adverse health effects and that it is important to continue tracking the health of EC users. Advances in internet data mining provide a novel method for monitoring the health of EC users over time. Infodemiological data gathered on EC users will be valuable to physicians, regulatory agencies, and the users themselves.
Supplementary Figure 1: Heat map of symptoms less frequently reported in forums along with their associated systems. The total number of posts for each symptom is shown on a log scale.
Supplementary Figure 2: Heat map of disorders less frequently reported in forums along with their associated systems. The total number of posts for each disorder is shown on a log scale.
Supplementary Table 1: Listing of all symptoms not included in heat map.
Supplementary Table 2: Listing of all linked symptoms by frequency.
Centers for Disease Control and Prevention
chronic obstructive pulmonary disorder
Center for Tobacco Products
electronic cigarette
Food and Drug Administration
Population Assessment of Tobacco Health
Unified Medical Language System
vaping-associated pulmonary illness
The research reported in this publication was partially supported by the National Institute of Drug Addiction, National Institute of Environmental Health Sciences, and the FDA Center for Tobacco Products (CTP) Grant #s R01DA036493 and R01ES029741 (CTP) to PT. MH was supported in part by a Cornelius Hopper Fellowship and Predoctoral Dissertation Fellowship from the Tobacco-Related Research Program of California (Grant # 28DT-0009). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors are grateful to Malcolm Tran, Jacqueline Wong, and Avni Parekh for their help in organizing and sorting the data. They would like to thank Kevin Huang for his help with the manuscript and references and Dr Duc Nguyen for reading the manuscript.
None declared.