This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Reaping the benefits from massive volumes of data collected in all sectors to improve population health, inform personalized medicine, and transform biomedical research requires the delicate balance between the benefits and risks of using individual-level data. There is a patchwork of US data protection laws that vary depending on the type of data, who is using it, and their intended purpose. Differences in these laws challenge big data projects using data from different sources. The decisions to permit or restrict data uses are determined by elected officials; therefore, constituent input is critical to finding the right balance between individual privacy and public benefits.
This study explores the US public’s preferences for using identifiable data for different purposes without their consent.
We measured data use preferences of a nationally representative sample of 504 US adults by conducting a web-based survey in February 2020. The survey used a choice-based conjoint analysis. We selected choice-based conjoint attributes and levels based on 5 US data protection laws (Health Insurance Portability and Accountability Act, Family Educational Rights and Privacy Act, Privacy Act of 1974, Federal Trade Commission Act, and the Common Rule). There were 72 different combinations of attribute levels, representing different data use scenarios. Participants were given 12 pairs of data use scenarios and were asked to choose the scenario they were the most comfortable with. We then simulated the population preferences by using the hierarchical Bayes regression model using the ChoiceModelR package in R.
Participants strongly preferred data reuse for public health and research than for profit-driven, marketing, or crime-detection activities. Participants also strongly preferred data use by universities or nonprofit organizations over data use by businesses and governments. Participants were fairly indifferent about the different types of data used (health, education, government, or economic data).
Our results show a notable incongruence between public preferences and current US data protection laws. Our findings appear to show that the US public favors data uses promoting social benefits over those promoting individual or organizational interests. This study provides strong support for continued efforts to provide safe access to useful data sets for research and public health. Policy makers should consider more robust public health and research data use exceptions to align laws with public preferences. In addition, policy makers who revise laws to enable data use for research and public health should consider more comprehensive protection mechanisms, including transparent use of data and accountability.
Cleaning, integrating, and managing the uncertainty in chaotic real data is essential for reproducible science and to unleash the potential power of big data for biomedical research. This often requires access to very detailed data that inevitably raise privacy concerns. Despite the widespread use of personal information for big data purposes (eg, marketing, intelligence gathering, political campaigns), big data analytics are still challenged in health applications owing to concerns about privacy and complex and differing federal and state laws [
An increasing number of published stories highlight the fact that different privacy protections apply in different contexts. For example, popular news stories have addressed how health information is treated differently when it is collected by health care providers as opposed to commercial companies such as Fitbit, Apple, or Ancestry.com [
Recent high profile breaches (eg, Equifax) and scandals (eg, Facebook and the 2016 US election) have raised awareness of these different privacy standards [
The purpose of this paper is to report on the results of a nationally representative survey examining US residents’ preferences for which of their identifiable personal data should be available for use, by whom, and for what purposes. Prior research focusing on Americans’ attitudes on data use and privacy shows strong support for socially beneficial uses such as research [
In February 2020, we conducted a web-based survey to explore the comfort levels and the preferences of the US population when individually identifiable data is reused for different purposes without their consent. Potential participants were recruited via a third-party research company (Dynata) that specializes in deploying surveys by using nationally representative sampling. We sought to balance the sample on 6 targets based on population characteristics used by the census (gender, race/ethnicity, age, education, household income, and region) where possible. Our goal was to recruit 500 adult (≥18 years) US residents fluent in English to enable reasonable sample balancing [
We selected attributes based on 4 of the 5 elements of the data protection laws (excluding violation penalties) (
Attributes and levels for data reuse scenarios.
Who | Purpose | Source of identifiable data |
Researcher, University | Research, scientific knowledge dissemination | Education records |
Nonprofit Organization | Promoting population health | Health records |
Government | Identify criminal activity | Government program or activity |
Business | Marketing, recruitment |
Economic activity, customer behavior |
Since it is not feasible and manageable to present all the possible combinations of each scenario to the participants, a fractional factorial design was used to randomly generate subsets of all the combinations, which were sufficient to obtain robust and meaningful differences in preferences through a standard web-based platform called "conjoint.ly", similar to that reported in previous work [
Sample pair scenario question.
To estimate the parameters, we used a hierarchical Bayes regression model, and in estimating the parameters at the individual level, we generated 10,000 posterior draws by using the Markov chain Monte Carlo simulation [
The survey was distributed to 687 individuals. Of them, 22 individuals declined to participate (3.2%), 157 did not fully complete the survey (22.8%), and 4 participant responses (0.6%) were marked as low quality based on detected participant behavior (eg, rapidly clicking through without mouse movement). This resulted in 504 respondents who fully completed the web-based survey (response rate 74.4%), which was our final analytic sample. Generally, we were able to meet our census sampling targets for gender, race/ethnicity, age, education, income, and census region (
Sociodemographic data, clinical characteristics, and privacy attitude scores of the participants (N=504).
Participant characteristics | Values | Target sample percentagea,b | |||
|
|||||
|
18-24 | 41 (8.1) | 13.1 | ||
|
25-34 | 75 (14.9) | 17.5 | ||
|
35-44 | 100 (19.8) | 17.5 | ||
|
45-54 | 101 (20.0) | 19.2 | ||
|
55-64 | 68 (13.5) | 15.6 | ||
|
65 or older | 89 (17.7) | 17.2 | ||
|
|||||
|
Male | 224 (44.4) | 48.5 | ||
|
Female | 278 (55.2) | 50.5 | ||
|
Other/prefer not to answer | 2 (0.4) | —c | ||
|
|||||
|
White | 315 (62.5) | 63.7 | ||
|
African American | 77 (15.3) | 12.2 | ||
|
Hispanic | 51 (10.1) | 16.4 | ||
|
Asian | 46 (9.1) | 4.7 | ||
|
Other | 15 (3.0) | 3.0 | ||
|
|||||
|
$20,000 or less | 103 (20.4) | 19.9 | ||
|
$20,000 to $49,999 | 149 (29.6) | 30.6 | ||
|
$50,000 to $99,999 | 137 (27.2) | 29.1 | ||
|
$100,000 to $149,999 | 67 (13.3) | 12.0 | ||
|
$150,000 or more | 48 (9.5) | 8.3 | ||
|
|||||
|
High school or less | 172 (34.1) | 32.0 | ||
|
Some college completed | 99 (19.6) | 19.0 | ||
|
College degree | 191 (37.9) | 31.0 | ||
|
Master’s | 37 (7.3) | — | ||
|
PhD/doctoral | 5 (1.0) | — | ||
|
|||||
|
Midwest | 95 (18.8) | 22.0 | ||
|
Northeast | 126 (25.0) | 18.2 | ||
|
South | 174 (34.5) | 36.2 | ||
|
West | 109 (21.6) | 23.6 | ||
|
|||||
|
Private | 169 (33.5) | 64.7 | ||
|
Medicare | 112 (22.2) | 17.7 | ||
|
Medicaid | 83 (16.5) | 17.9 | ||
|
Uninsured | 52 (10.3) | 8.5 | ||
|
VA/TRICARE | 10 (2.0) | 3.6 | ||
|
Multiple | 78 (15.5) | 14.5 | ||
|
|||||
|
No | 319 (63.3) | — | ||
|
Yes | 181 (35.9) | — | ||
|
|||||
|
No | 93 (18.5) | — | ||
|
Yes | 256 (50.8) | — | ||
|
|||||
|
No | 404 (80.2) | — | ||
|
Yes | 100 (19.8) | — | ||
|
|||||
|
No | 423 (83.9) | — | ||
|
Yes | 77 (15.3) | — | ||
Concern for information privacy scores, mean (SD) | 5.8 (1.1) | — |
aSurvey sampling targets based on census data.
bInsurance data were not used as the sampling target. These data show 2018 insurance statistics from the US census for survey sampling comparisons [
cNot available.
Relative importance by level within each attribute in percentage (SD).
Public preferences for use of data by users and purpose in percentage (SD). Our survey did not pair “for-profit” purposes with government or nonprofit users because these pairings were implausible and likely to confuse survey respondents.
Top 10 and bottom 10 ranked data use scenarios derived from the sum of scenario attributes' relative values (who/use purpose/data source) in percentage (SD).
In contrast to federal and state laws, US residents make little distinctions across types of data. However, they express much more favorable preferences for uses by academic researchers and nonprofit organizations than by the government or the business community. Moreover, all types of users consistently preferred uses that focus on public health and scientific research rather than on crime detection, marketing, or for-profit activities. Our data demonstrate interesting inconsistencies between public preferences and US privacy laws. These inconsistencies are best exemplified by our participants’ most preferred data reuse (researchers using education data to promote population health) and least preferred data reuse (businesses using consumer data for profit). Ironically, our data indicate that the US public’s most preferred data reuse scenario is currently prohibited under the federal Family Educational Rights and Privacy Act of 1974 while the US public’s least preferred data reuse is completely legal and ubiquitous under the permissive Federal Trade Commission Act [
Our participants also strongly favored data uses by universities and nonprofit organizations. Both universities and nonprofit organizations received higher preference ratings for all data use activities when compared to those received by the government or businesses. In some cases, activities that participants viewed as heavily undesirable when conducted by the government or business (crime detection, marketing) were rated favorably when conducted by a nonprofit organization or university. In contrast, the least preferred scenarios involved data reused for profit-driven or marketing activities by businesses or government. Mistrust in government has been documented in other research on attitudes of research and is perhaps unsurprising in the present partisan political environment [
We did find some preference differences for certain data types, but these differences were modest. Our data show that the public prefers the use of health or educational data (both heavily regulated under US laws) as compared to government data or economic data. Still, our data do not show any strong preferences. The public seems to view data as data. We noted that 4 of the 5 data use purposes we included in our study fall neatly into 2 broad categories: altruistic purposes and self-serving purposes. Public health and scientific purposes both ultimately contribute to the greater good, and our data suggest that these purposes are strongly preferred by the US public, regardless of who is doing the activity. In contrast, our respondents generally found those activities that are primarily self-serving (ie, profit-driven or marketing/recruitment activities) undesirable, regardless of who was doing the activity. The lone exception was marketing by universities, which received a modest positive relative importance score. Consequently, it could be that our participants based some of their preference decisions on whether they saw the data use as contributing to an altruistic or common good objective as opposed to primarily benefiting the data user’s self-interests.
Identifying criminal activity was the one data use that does not neatly fit in the broad categories of altruistic or self-serving purposes. While law enforcement clearly has some social benefits (as do all the activities used in our study), identifying criminal activity implies punishment for some individuals. Consequently, it is not entirely altruistic and not entirely self-serving. Interestingly, participant preferences for identifying criminal activity seemed to vary depending on the data user. Universities and nonprofit organizations both received positive relative importance scores whereas governments and businesses received negative scores. Just as with other data uses, it could be that participants positively associate universities and nonprofit organizations with motivations more in line with social benefits rather than individual benefits.
Collectively, our results do not support the current patchwork of US data protection laws. Many US data protection laws focus primarily on the type of data (ie, health, education, governmental program data), but our respondents were fairly indifferent toward these distinctions. Instead, our findings suggest that the US public is much more interested in who is using the data and for what purposes the data are being used. In particular, our results suggest that the US public has a strong preference for data uses that promote the common good as opposed to individual or self-serving interests.
In fact, findings suggest that US preferences more closely align with a comprehensive data protection framework such as the General Data Protection Regulation enacted by the European Union where rules vary based on data use but are broadly applicable to all identifiable data [
There are 2 important limitations. We did not capture the universe of data use possibilities; therefore, the measured participants’ preferences are relative to the 72 provided scenarios. Additionally, this design measured participants’ preferences rather than acceptability, meaning that a participants’ least preferred scenario could still be acceptable to them or the most preferred scenario might be unacceptable.
Importantly, these results support a close re-examination of the absence of public health and research data use exceptions in US laws. It is clear that the US public strongly prefers using data to promote population health (as compared to other legal data uses); yet, few laws allow this kind of exception. The Family Educational Rights and Privacy Act provides an excellent example, given that it does not have a public health exception (or a research exception that permits exploring health implications) despite being one of the most potent known social determinants of health. Moreover, the absence of these data use exceptions within the current patchwork of inconsistent US data protection laws persistently frustrates secondary database researchers and public health professionals, thereby delaying, impeding, or increasing the cost of data-intensive scientific discovery and public health practice [
This study was funded in part by the Presidential Impact Fellow Award at Texas A&M University and the Population Informatics Lab in the School of Public Health at Texas A&M University.
CS and HCK were responsible for conceptualization, study design and implementation, data analysis, interpretation, writing and revisions, and supervision. TG and MR were responsible for study implementation, data analysis, interpretation, writing, and revisions. QZ and MM were responsible for data analysis, interpretation, writing, and revisions.
None declared.