This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.
The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data.
A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data.
The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively.
We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
Access to data for building and testing artificial intelligence and machine learning (AIML) models has been problematic in practice and presents a challenge for the adoption of AIML [
A key obstacle to data access has been analyst concerns about privacy and meeting growing privacy obligations. For example, a recent survey by O’Reilly [
Anonymization is one approach for addressing privacy concerns when making data available for secondary purposes such as AIML [
Synthetic data generation is another approach for addressing privacy concerns that has been gaining interest recently [
There are different types of privacy risks. One of them is identity disclosure [
Some researchers have argued that fully synthetic data does not have an identity disclosure risk [
Another type of privacy risk is attribution risk [
Key definitions and requirements will be presented, followed by a model for assessing identity disclosure risk. As a general rule, we have erred on the conservative side when presented with multiple design or parameter options to ensure that patient privacy would be less likely to be compromised.
The basic scheme that we are assuming is illustrated in
The relationships between the different datasets under consideration. Matching between a synthetic sample record and someone in the population goes through the real sample and can occur in 2 directions.
The data custodian makes the synthetic sample available for secondary purposes but does not share the generative model that is used to produce the synthetic sample. Therefore, our risk scenario is when the adversary only has access to the synthetic data.
Synthetic records can be identified by matching them with individuals in the population. When matching is performed to identify synthetic records, that matching is done on the
The variables that are not quasiidentifiers will be referred to as
To illustrate the privacy risks with fully synthetic data, consider the population data in
As can be seen, there is only one North African individual and one European individual in the population, and they both are in the real sample. Therefore, these unique real sample records would match 1:1 with the population and, therefore, would have a very high risk of being identified. The populationunique European and North African records are also in the synthetic data, and thus, here we have a 1:1 match between the synthetic records and the population.
The sensitive income value in the synthetic sample is very similar to the value in the real sample for the North African record. Therefore, arguably, we also learn something new about that individual. The sensitive income value is not so close for the European record, and therefore, even though we are able to match on the quasiidentifier, we will not learn meaningful information about that specific individual from synthetic data.
Example of a population dataset, with one’s origin as the quasiidentifier and one’s income as the sensitive variable.
National ID  Origin  Income ($) 
1  Japanese  110k 
2  Japanese  100k 
3  Japanese  105k 
4  North African  95k 
5  European  70k 
6  Hispanic  100k 
7  Hispanic  130k 
8  Hispanic  65k 
Example of a real sample, with one’s origin as the quasiidentifier and one’s income as the sensitive variable.
Origin  Income ($) 
European  70k 
Japanese  100k 
Hispanic  130k 
Hispanic  65k 
North African  95k 
Example of a synthetic sample, with one’s origin as the quasiidentifier and one’s income as the sensitive variable.
Origin  Income ($) 
Japanese  115k 
Japanese  120k 
North African  100k 
European  110k 
Hispanic  65k 
This example illustrates that it is plausible to match synthetic sample records with individuals in the population and thus identify these individuals, since a synthesized record can have the same value as a real record on quasiidentifiers. However, such identification is only meaningful if we learn somewhat correct sensitive information about these matched individuals. Learning something new is considered when evaluating identifiability risks in practical settings [
To formulate our model, we first need to match a synthetic sample record with a real sample record. Consider the synthetic sample in
The key information here is that there was a match—it is a binary indicator. If there is a match between real sample record
A concept that is well understood in the disclosure control literature is that the probability of a successful match between someone in the population and a real record will depend on the direction of the match [
In our hypothetical example, an adversary may know Hans in the population and can match that with the European record in the synthetic sample through the real sample. Or the adversary may select the European record in the synthetic sample and match that with the only European in a population registry through the real sample, which happens to be Hans. Both directions of attack are plausible and will depend on whether the adversary already knows Hans as an acquaintance or not.
Now we can combine the 2 types of matching to get an overall match rate between the synthetic record and the population: the synthetic sample–to–real sample match and the real sample–to–population match, and in the other direction. We will formalize this further below.
We start off by assessing the probability that a record in the real sample can be identified by matching it with an individual in the population by an adversary. The populationtosample attack is denoted by
Under the assumption that an adversary will only attempt one of them, but without knowing which one, the overall probability of one of these attacks being successful is given by the maximum of both [
The match rate for populationtosample attacks is given by El Emam [
This models an adversary who selects a random individual from the population and matches them with records in the real sample. A selected individual from the population may not be in the real sample, and therefore, the sampling does have a protective effect.
Under the sampletopopulation attack, the adversary randomly selects a record from the real sample and matches it to individuals in the population. The match rate is given by El Emam [
We now extend this by accounting for the matches between the records in the synthetic sample and the records in the real sample. Only those records in the real sample that match with a record in the synthetic sample can then be matched with the population. We define an indicator variable,
And similarly, the synthetic sampletopopulation identification risk can be expressed as
And then we have the overall identification risk from equation (1):
The population value of 1/
Notation used in this paper.
Notation  Interpretation 

An index to count records in the real sample 

An index to count records in the synthetic sample 

The number of records in the true population 

The equivalence class group size in the real sample for a particular record 

The equivalence group size in the population that has the same quasiidentifier values as record 

The number of records in the (real or synthetic) sample 

A binary indicator of whether record 

A binary indicator of whether the adversary would learn something new if record 

Number of quasiidentifiers 
_{λ}  Adjustment to account for errors in matching and a verification rate that is not perfect 

The minimal percentage of sensitive variables that need to be similar between the real sample and synthetic sample to consider that an adversary has learned something new 
In practice, 2 adjustments should be made to equation (6) to take into account the reality of matching when attempting to identify records [
Real data has errors in it, and therefore, the accuracy of the matching based on adversary knowledge will be reduced [
A previous review of identification attempts found that when there is a suspected match between a record and a real individual, the suspected match could only be verified 23% of the time [
We can now adjust equation (6) with the λ parameter:
However, equation (8) does not account for the uncertainty in the values obtained from the literature and assumes that verification rates and error rates are independent. Specifically, when there are data errors, they would make the ability to verify less likely, which makes these 2 effects correlated. We can model this correlation, as explained below.
The verification rate and data error rate can be represented as triangular distributions, which is a common way to model phenomena for risk assessment where the real distribution is not precisely known [
We can also model the correlation between the 2 distributions to capture the dependency between (lack of) data errors and verification. This correlation was assumed to be medium, according to Cohen guidelines for the interpretation of effect sizes [
We can use the λ_{s} value directly in our model. However, to err on the conservative side and avoid this adjustment for data errors and verification overattenuating the actual risk, we use instead the midpoint between λ_{s} and the maximum value of 1. We define
This more conservative adjustment can be entered into equation (6) as follows:
We now extend the risk model in equation (10) to determine if the adversary would learn something new from a match. We let
Because a real sample record can match multiple synthetic sample records, the
In practice, we compute
Learning something new in the context of synthetic data can be expressed as a function of the sensitive variables. Also note that for our analysis, we assume that each sensitive variable is at the same level of granularity as in the real sample since that is the information that the adversary will have after a match.
The test of whether an adversary learns something new is defined in terms of 2 criteria: (1) Is the individual’s real information different from other individuals in the real sample (ie, to what extent is that individual an outlier in the real sample)? And (2) to what extent is the synthetic sample value similar to the real sample value? Both of these conditions would be tested for every sensitive variable.
Let us suppose that the sensitive variable we are looking at is the cost of a procedure. Consider the following scenarios: If the real information about an individual is very similar to other individuals (eg, the value is the same as the mean), then the information gain from an identification would be low (note that there is still some information gain, but it would be lower than the other scenarios). However, if the information about an individual is quite different, say the cost of the procedure is 3 times higher than the mean, then the information gain could be relatively high because that value is unusual. If the synthetic sample cost is quite similar to the real sample cost, then the information gain is still higher because the adversary would learn more accurate information. However, if the synthetic sample cost is quite different from the real sample cost, then very little would be learned by the adversary, or what will be learned will be incorrect, and therefore, the correct information gain would be low.
This set of scenarios is summarized in
The relationship between a real observation to the rest of the data in the real sample and to the synthetic observation, which can be used to determine the likelihood of meaningful identity disclosure.
We propose a model to assess what the adversary would learn from each sensitive variable. If the adversary learns something new for at least
We start off with nominal/binary sensitive variables and then extend the model to continuous variables. Let
We can then determine the distance that the
The distance is low if the value
Let the matching record on the sensitive variable in the synthetic record be denoted by
How do we know if that value indicates that the adversary learns something new about the patient?
We set a conservative threshold; if the similarity is larger than 1 standard deviation, assuming that taking on value
The inequality compares the weighted value with the standard deviation of the proportion
Continuous sensitive variables should be discretized using univariate kmeans clustering, with optimal cluster sizes chosen by the majority rule [
In the same manner as for nominal and binary variables, the distance is defined as
Let
We need to determine if this value signifies learning too much. We compare this value to the median absolute deviation (MAD) over the
When this inequality is met, then the weighted difference between the real and synthetic values on the sensitive variable for a particular patient indicates that the adversary will indeed learn something new.
The 1.48 value makes the MAD equivalent to 1 standard deviation for Gaussian distributions. Of course, the multiplier for MAD can be adjusted since the choice of a single standard deviation equivalent was a subjective (albeit conservative) decision.
An adversary may not attempt to identify records on their original values but instead generalize the values in the synthetic sample and match those. The adversary may also attempt to identify records on a subset of the quasiidentifiers. Therefore, it is necessary to evaluate generalized values on the quasiidentifiers and subsets of quasiidentifiers during the matching process.
In
We describe the methods used to apply this meaningful identity disclosure risk assessment model on 2 datasets.
We apply the meaningful identity disclosure measurement methodology on 2 datasets. The first is the Washington State Inpatient Database (SID) for 2007. This is a dataset covering population hospital discharges for the year. The dataset has 206 variables and 644,902 observations. The second is the Canadian COVID19 case dataset with 7 variables and 100,220 records gathered by Esri Canada [
We selected a 10% random sample from the full SID and synthesized it (64,490 patients). Then, meaningful identity disclosure of that subset was evaluated using the methodology described in this paper. The whole population dataset was used to compute the population parameters in equation (5) required for calculating the identity disclosure risk values according to equation (11). This ensured that there were no sources of estimation error that needed to be accounted for.
The COVID19 dataset has 7 variables, with the date of reporting, health region, province, age group, gender, case status (active, recovered, deceased, and unknown), and type of exposure. A 20% sample was taken from the COVID19 dataset (20,045 records), and the population was used to compute the meaningful identity disclosure risk similar to the Washington SID dataset.
State inpatient databases have been attacked in the past, and therefore, we know the quasiidentifiers that have been useful to an adversary. One attack was performed on the Washington SID [
Quasiidentifiers included in the analysis of the Washington State Inpatient Database (SID) dataset.
Variable  Definition 
AGE  patient's age in years at the time of admission 
AGEDAY  age in days of a patient under 1 year of age 
AGEMONTH  age in months for patients under 11 years of age 
PSTCO2  patient's state/county federal information processing standard (FIPS) code 
ZIP  patient's zip code 
FEMALE  sex of the patient 
AYEAR  hospital admission year 
AMONTH  admission month 
AWEEKEND  admission date was on a weekend 
For the COVID19 dataset, all of the variables, except exposure, would be considered quasiidentifiers since they would be knowable about an individual.
For data synthesis, we used classification and regression trees [
The specific method we used to generate synthetic data is called conditional trees [
Let us say that we have 5 variables, A, B, C, D, and E. The generation is performed sequentially, and therefore, we need to have a sequence. Various criteria can be used to choose a sequence. For our example, we define the sequence as A→E→C→B→D.
Let the prime notation indicate that the variable is synthesized. For example, A’ means that this is the synthesized version of A. The following are the steps for sequential generation:
Sample from the A distribution to get A’
Build a model F1: E ∼ A
Synthesize E as E’ = F1(A’)
Build a model F2: C ∼ A + E
Synthesize C as C’ = F2(A’, E’)
Build a model F3: B ∼ A + E + C
Synthesize B as B’ = F3(A’, E’, C’)
Build a model F4: D ∼ A + E + C + B
Synthesize D as D’ = F4(A’, E’, C’, B’)
The process can be thought of as having 2 steps, fitting and synthesis. Initially, we are fitting a series of models (F1, F2, F3, F4). These models make up the generator. Then these models can be used to synthesize data according to the scheme illustrated above.
As well as computing the meaningful identity disclosure risk for the synthetic sample, we computed the meaningful identity disclosure risk for the real sample itself. With the latter, we let the real sample play the role of the synthetic sample, which means we are comparing the real sample against itself. This should set a baseline to compare the risk values on the synthetic data and allows us to assess the reduction in meaningful identity disclosure risk due to data synthesis. Note that both of the datasets we used in this empirical study were already deidentified to some extent.
For the computation of meaningful identity disclosure risk, we used an acceptable risk threshold value of 0.09 to be consistent with values proposed by large data custodians and have been suggested by the European Medicines Agency and Health Canada for the public release of clinical trial data (
This study was approved by the CHEO Research Institute Research Ethics Board, protocol numbers 20/31X and 20/73X.
The meaningful identity disclosure risk assessment results according to equation (11) for the Washington hospital discharge data are shown in
The risk result on the real dataset is consistent with the empirical attack results [
The results for the synthetic Canadian COVID19 case data are also below the threshold by about 10 times, and 4 times below risk values for the real data, although the original data has a risk value that is also below the threshold.
However, it is clear that the synthetic datasets demonstrate a significant reduction in meaningful identity disclosure risk compared to the original real dataset.
Overall meaningful identity disclosure risk results. (The italicized values are the maximum risk values.)
Parameter  Synthetic data risk  Real data risk  

Populationtosample risk  Sampletopopulation risk  Populationtosample risk  Sampletopopulation risk 
Washington State Inpatient Database  0.00056 

0.016 

Canadian COVID19 cases  0.0043 

0.012 

The objective of this study was to develop and empirically test a methodology for the evaluation of identity disclosure risks for fully synthetic health data. This methodology builds on previous work on attribution risk for synthetic data to provide a comprehensive risk evaluation. It was then applied to a synthetic version of the Washington hospital discharge database and the Canadian COVID19 cases dataset.
We found that the meaningful identity disclosure risk was below the commonly used risk threshold of 0.09 between 4.5 times and 10 times. Note that this reduced risk level was achieved without implementing any security and privacy controls on the dataset, suggesting that the synthetic variant can be shared with limited controls in place. The synthetic data also had a lower risk than the original data by between 4 and 5 times.
These results are encouraging in that they provide strong empirical evidence to claims in the literature that the identity disclosure risks from fully synthetic data are low. Further tests and case studies are needed to add more weight to these findings and determine if they are generalizable to other types of datasets.
This work extends, in important ways, previous privacy models for fully synthetic data. Let
This is similar to our definition of learning something new conditional on identity disclosure. Our model extends this work by also considering the likelihood of matching the real sample record to the population using both directions of attack, including a comprehensive search for possible matches between the real sample and synthetic sample. We also consider data errors and verification probabilities in our model, and our implementation of
Some previous data synthesis studies examined another type of disclosure: membership disclosure [
Privacy risk measures that assume that an adversary has whitebox or blackbox access to the generative model [
Meaningful identity disclosure evaluations should be performed on a regular basis on synthetic data to ensure that the generative models do not overfit. This can complement membership disclosure assessments, providing 2 ways of performing a broad evaluation of privacy risks in synthetic data.
With our model, it is also possible to include meaningful identity disclosure risk as part of the loss function in generative models to simultaneously optimize on identity disclosure risk as well as data utility, and to manage overfitting during synthesis since a signal of overfitting would be a high meaningful identity disclosure risk.
The overall risk assessment model is agnostic to the synthesis approach that is used; however, our empirical results are limited to using a sequential decision tree method for data synthesis. While this is a commonly used approach for health and social science data, different approaches may yield different risk values when evaluated using the methodology described here.
We also made the worstcase assumption that the adversary knowledge is perfect and is not subject to data errors. This is a conservative assumption but was made because we do not have data or evidence on adversary background knowledge errors.
Future work should extend this model to longitudinal datasets, as the current risk model is limited to crosssectional data.
Details of calculating and interpreting identity disclosure risk values.
artificial intelligence and machine learning
median absolute deviation
State Inpatient Database
We wish to thank Yangdi Jiang for reviewing an earlier version of this paper.
This work was performed in collaboration with Replica Analytics Ltd. This company is a spinoff from the Children’s Hospital of Eastern Ontario Research Institute. KEE is cofounder and has equity in this company. LM and JB are data scientists / software engineers employed by Replica Analytics Ltd.