This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Investigation into personal health has become focused on conditions at an increasingly local level, while response rates have declined and complicated the process of collecting data at an individual level. Simultaneously, social media data have exploded in availability and have been shown to correlate with the prevalence of certain health conditions.
Facebook likes may be a source of digital data that can complement traditional public health surveillance systems and provide data at a local level. We explored the use of Facebook likes as potential predictors of health outcomes and their behavioral determinants.
We performed principal components and regression analyses to examine the predictive qualities of Facebook likes with regard to mortality, diseases, and lifestyle behaviors in 214 counties across the United States and 61 of 67 counties in Florida. These results were compared with those obtainable from a demographic model. Health data were obtained from both the 2010 and 2011 Behavioral Risk Factor Surveillance System (BRFSS) and mortality data were obtained from the National Vital Statistics System.
Facebook likes added significant value in predicting most examined health outcomes and behaviors even when controlling for age, race, and socioeconomic status, with model fit improvements (adjusted
Facebook likes provide estimates for examined health outcomes and health behaviors that are comparable to those obtained from the BRFSS. Online sources may provide more reliable, timely, and cost-effective county-level data than that obtainable from traditional public health surveillance systems as well as serve as an adjunct to those systems.
The development of the Internet and the explosion of social media have provided many new opportunities for health surveillance. The use of the Internet for personal health and participatory health research has exploded, largely due to the availability of online resources and health care information technology applications [
The Internet has spawned several sources of big data, such as Facebook [
Understanding how big data can be used to predict lifestyle behavior and health-related data is a step toward the use of these electronic data sources for epidemiologic needs [
In this study, we focused on harnessing the predictive power of Facebook likes for enhancing population health surveillance. Toward this end, we viewed Facebook likes as a class of big data that may help us understand population health at a local level. Given that risk factors and associated health outcomes are often clustered in populations geographically [
In this paper, we examine how big data might be used to complement traditional surveillance systems. We explored the use of Facebook likes as potential predictors of health outcomes and the behavioral determinants of poor health outcomes at the county level. Specifically, we hypothesized that (1) Facebook likes provide a means of predicting county-level mortality, (2) Facebook likes can be used as an indicator of chronic disease outcomes (obesity, diabetes, and heart disease) that contribute to increased mortality, and (3) Facebook likes can be used as an indicator of adverse lifestyle behaviors that impact disease. If these hypotheses hold, then Facebook likes could ultimately be used to enhance population health surveillance.
Data for the analysis were collected from 4 sources. Objective reports on key health indicators (ie, life expectancy, mortality, and low birth weight) were collected from the National Vital Statistics System (NVSS) for 2011, which provides population data on deaths and births in the United States. According to its website, “these data are provided through contracts between [National Center for Health Statistics] NCHS and vital registration systems operated in the various jurisdictions legally responsible for the registration of vital events—births, deaths, marriages, divorces, and fetal deaths” [
Self-reported health outcome and risk behavior data were obtained from the Behavioral Risk Factor Surveillance System (BRFSS) [
Facebook likes data were collected using the Facebook advertising application program interface (API) [
All constituent elements of these supercategories were used, regardless of a clear relationship to health, because the exact contents and means of construction of these data are not reported by Facebook. Other supercategories lacked these explicit links, although we acknowledge the possibility that potentially powerful indirect relationships may exist. Due to rounding performed automatically by the API that routinely led to overestimates, counties with fewer than 1000 profiles overall were excluded from the analysis. Facebook likes for each category were scored as a percentage of completed profiles in an area. Finally, to reduce multicollinearity caused by variation in levels of Facebook usage by county, values were divided by the average percentage of likes across all categories. The resulting variables can be characterized as a measure of popularity for each category relative to that of other categories. Although the individual variables resulting from this transformation were sometimes entirely uncorrelated with the originals, estimates using the raw and transformed variables correlated at
Population data, such as average income, median age, and sex ratio, were collected using the 2010 US Census [
Several sociodemographic, health outcome, and risk factor variables were selected for analysis. These included income, age, education, employment, nonwhite population, obesity, diabetes, physical activity, and smoking, as well as other measures such as general health status. A comprehensive listing, as well as the data source and assessment of each variable of interest are available in
We began by using principal components analysis on the 37 Facebook likes categories within the 3 selected supercategories as a data reduction technique. We then used these factors in an ordinary least squares (OLS) regression to determine whether Facebook likes could predict a number of health outcomes, conditions, and related behaviors. Finally, by limiting our analysis to Florida, where available data were more comprehensive, we formed a predictive model via bootstrap regression [
The first stage in the analysis was to establish that health outcomes could indeed be determined by Facebook likes. Through principal components analysis, the 37 categories were reduced to 9 factors (varimax rotation) purely as a means of simplifying modeling efforts by reducing these categories into the latent sociobehavioral dimensions we believed they represented. This number was arrived on by applying the Cattell scree test (shown in
To test our hypothesis that Facebook likes can be used to predict mortality on their own, we used OLS regression. We used the 9 Facebook factors to predict life expectancy, with no other controls included in the initial model. The results, as shown in the “Facebook only” column of
Ordinary least squares regression coefficients (β) for life expectancy (all independent variables are standardized).
|
Facebook only | SES only | Facebook and SES | ||||
|
β |
|
β |
|
β |
|
|
|
|
|
|
|
|
|
|
|
1 | –0.14 | <.001 | — | — | 0.20 | <.001 |
|
2 | 0.79 | <.001 | — | — | 0.43 | <.001 |
|
3 | –0.96 | <.001 | — | — | –0.30 | <.001 |
|
4 | 0.60 | <.001 | — | — | 0.42 | <.001 |
|
5 | 0.69 | <.001 | — | — | 0.41 | <.001 |
|
6 | 0.21 | <.001 | — | — | –0.04 | .05 |
|
7 | –0.08 | <.001 | — | — | –0.04 | .04 |
|
8 | –0.61 | <.001 | — | — | –0.49 | <.001 |
|
9 | 0.12 | <.001 | — | — | 0.10 | .70 |
Age | — | — | 0.16 | <.001 | 0.01 | .87 | |
Income | — | — | 0.62 | <.001 | 0.59 | <.001 | |
Education | — | — | 0.88 | <.001 | 0.61 | <.001 | |
Unemployment | — | — | –0.05 | 0.07 | 0.01 | .70 | |
Nonwhite population | — | — | –0.85 | <.001 | –0.47 | <.001 | |
Constant | 77.08 | <.001 | 77.06 | <.001 | 77.06 | <.001 | |
Adjusted |
.69 |
|
.64 |
|
.81 |
|
|
RMSE | 1.28 |
|
1.29 |
|
1.01 |
|
Our third hypothesis posited that Facebook likes, as a measure of behavior, should be able to determine the behaviors that drive health outcomes. The results in
Facebook likes impact on model fit for 214 counties.
Dependent variable | Sourcea | Facebook, |
SES, |
SES + Facebook, |
Improvement with Facebook, % |
Life expectancy | NVSS | .69 | .64 | .81 | 27% |
Mortality | NVSS | .57 | .49 | .60 | 22% |
Low birthweight | NVSS | .53 | .17 | .57 | 235% |
Obesity | BRFSS | .46 | .56 | .60 | 7% |
Diabetes | BRFSS | .36 | .39 | .55 | 41% |
Heart attack | BRFSS | .32 | .46 | .46 | 0% |
Stroke | BRFSS | .27 | .30 | .41 | 46% |
Exercise | BRFSS | .57 | .51 | .76 | 49% |
Insured | BRFSS | .48 | .37 | .65 | 76% |
Self-Reported health | BRFSS | .51 | .20 | .55 | 175% |
Smoker | BRFSS | .40 | .42 | .54 | 29% |
Last checkup | BRFSS | .69 | .30 | .72 | 140% |
Declined treatment | BRFSS | .39 | .35 | .49 | 40% |
a BRFSS: Behavioral Risk Factor Surveillance System; NVSS: National Vital Statistics System.
The natural extension of these findings would be to map out predicted prevalence of health conditions in data-deficient counties. Although 214 counties were sampled sufficiently for the BRFSS to provide county-specific estimates, the remaining 2895 counties were not. An additional source of data, such as Facebook, would be a cost-effective way to augment existing state-level data sources that are used to produce county-level estimates, such as the BRFSS.
However, attempting to apply predictions nationally from the 2011 SMART data creates a problem. Although predictions correlate well with actual levels in non-SMART data, mean levels are consistently upwardly biased. We hypothesized that the selection method that leads counties to be weighted according to the SMART program creates a nonrepresentative sample with better levels of general health than we see in the United States in general, particularly in areas that are more rural. As an alternative without such problematic selection issues, we limited our predictive model to 2010 Florida data. Florida collects more than 500 interviews in 61 of its 67 counties every 3 years, leading to a dataset that has neither sample size shortages nor selection biases relative to the state at large.
Using data exclusively from one state creates its own problems for a predictive model. Although the integrity of the data is very good, there is no easy way to correct for the various cultural differences between Florida and other states. Attempting to apply Florida-based models to the full set of SMART counties results in only fair level of correlation (
The results of a predictive model are shown in
Ordinary least squares regression (β) results for prediction of obesity.
Header | Facebook only | SES only | Facebook and SES | ||||
β |
|
β |
|
β |
|
||
|
|
|
|
|
|
|
|
|
1 | 0.04 | .05 | — | — | –0.03 | <.001 |
|
2 | –0.02 | .06 | — | — | –0.01 | .14 |
|
3 | 0.03 | <.001 | — | — | –0.01 | .07 |
|
4 | –0.02 | .06 | — | — | –0.01 | .74 |
|
5 | –0.02 | .04 | — | — | 0.03 | .01 |
|
6 | –0.02 | .07 | — | — | –0.02 | .13 |
|
7 | –0.05 | .30 | — | — | 0.02 | .04 |
|
8 | 0.01 | .34 | — | — | 0.01 | .90 |
|
9 | 0.02 | .36 | — | — | –0.01 | .17 |
Age | — | — | –0.01 | .01 | –0.01 | .01 | |
Income | — | — | –0.01 | .37 | –0.01 | .59 | |
Education | — | — | –0.03 | <.001 | 0.01 | .35 | |
Unemployment | — | — | –0.01 | .04 | 0.01 | .58 | |
Nonwhite population |
|
|
0.02 | .04 |
|
|
|
Constant | 0.29 | <.001 | 0.30 | <.001 | 0.30 | <.001 | |
Adjusted |
.77 | .72 | .8 | ||||
RMSE | 0.03 |
|
0.03 |
|
0.03 |
|
Actual statistics compared with predicted values for obesity, 2010 BRFSS. Darker colors represent higher prevalence. Light gray indicates missing data.
When we first undertook this research plan, it was our expectation that the larger part of the measurement error that would affect our results would come through the imprecise categorization and geographic aggregation of the Facebook data. However, although there are some exceptions, the consistency and strength of fit we have found seem manifest. Our models do extremely well in predicting levels of health variables across counties where data are plentiful, and often diverge from BRFSS estimates where they are not. This suggests the possibility that data imputed from Facebook and vital statistics may provide a more accurate picture in small counties than the current methodology that aggregates data across several years.
Thus, we argue that Facebook can serve an intermediary role in augmenting sparse data at a community level. We have shown that it can do so already, but additional health survey data, especially in less extensively measured regions (eg, rural), could only help. Although complete measurement is unfeasible and would render the Facebook modeling moot, ensuring that communities of all types are represented in sufficient number when estimating the model is a necessary step in avoiding the risk of systematic error in its predictions.
The ultimate goal of our analysis of Facebook likes is to establish the potential contribution of big data to research that directly affects government spending and public policy, and—most importantly—contributes to improved population health. At a fraction of the cost of traditional research, data that might seem on its face to have little to do with health can predict epidemic-level health problems such as diabetes and obesity. With the need to augment traditional public health surveillance systems with readily available, cost-effective, and geographically relevant health data, the use of “big epidemiologic data” comes at just the right time.
The nature of the Facebook data source prevents it from being a useful tool in several situations. In the case of very small counties (approximately 9% of the total) and in smaller geographic areas, rounding error becomes so great that estimates cannot be reliably used, even though they may be provided by Facebook. Additionally, Facebook profiles are untested as a tool for tracking the prevalence of infectious diseases. They may be better suited to predicting endemic and ongoing conditions that are unlikely to fluctuate over the course of short time periods.
Further, some might find it counterintuitive that Facebook data are being used to “predict” health data that not only predates it, but to which it is not causally related through any theoretical mechanism. Likes data for a given geographic area should be viewed as a product of sociobehavioral conditions within that region in the same manner that health outcomes are. As such, the likes data can be viewed as an instrument for those conditions, which are causally linked. Although the temporal concerns are not ideal, they are not especially problematic because those health metrics used in this research are not especially prone to fluctuation over short time periods.
Finally, without a clear insight into the manner in which the categories of Facebook likes are constructed and by which individuals are tagged as being interested in a given category, it is difficult to achieve more nuanced insights into the relationships between social network behavior and health outcomes. Unless Facebook becomes more transparent regarding the ways in which these data are compiled, they will remain a “black box” and we must take on faith that the interests and activities being measured are indeed those it claims to measure.
The relationships examined here demonstrate that social media may hold promise to be used as an indicator of local conditions, even those that have little relationship to the activity that takes place on Facebook. As we predicted, significant relationships that extend beyond the predictive power of local demographics exist between an area’s aggregate Facebook behavior and the incidence of diseases and of adverse lifestyle behaviors that very well may lead to those diseases.
We have also indicated the severe shortage of health data that are available in most American counties. Although Facebook data may not reach into every corner of the United States, it seems an effective enough tool to augment the existing county-level data in the majority of counties. With demand for local health data growing, such tools seem far more cost-effective than an increase in survey surveillance regardless of the mode through which it might be conducted.
Whether this data ultimately comes from Facebook or not is of little importance. The online landscape may change and it may provide a different source of data that proves more viable in the future. So long as the source reflects people’s activities in daily life, the same relationships may hold. Even if Facebook does prove to endure as a social institution, however, there is still room for a great deal of improvement on the models presented here. With cooperation from the social media outlets themselves, we may be able to obtain better estimates in categories that align better with our needs. In the end, our data may not suffer because of the rising costs of research. Instead, exploring newly opened avenues of data collection online could lead to more reliable, timely, and cost-effective county-level data than that obtainable from traditional public health surveillance systems as well as serve as an adjunct to those systems.
Facebook category structure.
Demographic variable descriptions.
Scree plot for principal components analysis.
Rotated (orthogonal varimax) factors.
Factor loadings.
application program interface
Behavioral Risk Factor Surveillance System
National Vital Statistics System
ordinary least squares
socioeconomic status
Selected Metropolitan/Micropolitan Area Risk Trends
We thank the state BRFSS coordinators for their help in collecting the data used in this analysis and members of the Population Health Surveillance Branch for their assistance in developing the database. The authors would also like to express their thanks to Youjie Huang, MD, MPH, DrPH, Florida Acting BRFSS coordinator, for the data used in the 2010 county-level analysis.
None declared.